Agent System Overview

The BYOC Agentic AI Operations platform uses a multi-agent architecture to monitor, detect, investigate, and report on cluster health across the entire CelerData BYOC fleet.

Architecture

The system follows a Sentinel-Investigator pattern with two additional agent roles for fleet sweeps and interactive analysis:

Sentinel -- Pure Python change-detection layer. No LLM calls, zero cost. Runs every 15 minutes after each risk/score refresh. When a trigger fires, it writes a task to the agent_tasks queue.
Investigator Agent -- LLM-powered autonomous agent. Picks up pending tasks from agent_tasks, performs a structured investigation using SQL tools and the Knowledge Lake, and writes findings to agent_investigations.
Patrol Agent -- LLM-powered fleet-wide health scan. Runs 2x daily (morning + evening) or on-demand. Pre-aggregates fleet state, then sends a compact summary to the LLM for deep analysis.
BYOC Agent -- Interactive chat agent. Used by operators to ask ad-hoc questions about cluster health, query performance, and usage patterns.

Data Flow

Agent Comparison

Agent	Trigger	LLM?	Model	Cost	Output
Sentinel	Every 15 min (cron)	No	--	Zero	`agent_tasks` rows
Investigator	Pending tasks in queue	Yes	Claude (configurable)	Per-investigation	`agent_investigations` rows
Patrol	2x daily or on-demand	Yes	Claude (configurable)	Per-report	`patrol_reports` rows
BYOC Agent	User chat message	Yes	Claude Sonnet	Per-conversation	Chat response
AIIssueGrouper	New alerts (every 5 min)	Yes	Claude Haiku	Per-alert batch	`ai_issues` rows

Supporting Components

UnifiedScorer -- Single pipeline merging metrics, alerts, and customer tier into one weighted score per cluster. Replaces separate HealthScorer and ClusterRiskAnalyzer runs.
HealthScorer -- Per-cluster health scores (0-100) from YAML rules across 8 dimensions.
ClusterRiskAnalyzer -- Threshold-based risk classification across 10 dimensions in 4 categories.
IssueTracker -- Alert-to-issue grouping with lifecycle management (anomaly vs. failure patterns).
AIIssueGrouper -- Claude Haiku-powered alert classification and issue grouping.
LocalAnalyst -- Pre-built analyses (health overview, latency trends, alert summary) that run without an LLM.

Key Database Tables

Table	Purpose
`alerts.agent_tasks`	Sentinel task queue (pending/running/completed/failed)
`alerts.agent_investigations`	Investigator findings with structured fields
`alerts.patrol_reports`	Fleet patrol reports with parsed sections
`alerts.cluster_unified_scores`	Unified scoring snapshots (DUPLICATE KEY)
`alerts.cluster_health_scores`	Health scoring snapshots (DUPLICATE KEY)
`alerts.cluster_risk_snapshots`	Risk classification snapshots (PRIMARY KEY, upsert)
`alerts.ai_issues`	AI-grouped issues with triage workflow
`alerts.issues`	Rule-based grouped issues with lifecycle

Scheduling

Job	Frequency	What it does
`refresh_unified.py`	Every 15 min	Runs UnifiedScorer, then Sentinel `check_triggers()`
`refresh_risk.py`	Every 15 min	Runs ClusterRiskAnalyzer, then Sentinel
`refresh_scores.py`	Every 15 min	Runs HealthScorer, then Sentinel
Investigator	After Sentinel	Picks up pending tasks (`run_pending_investigations`)
Patrol	2x daily + on-demand	`run_patrol(report_type="morning"\|"evening")`
Alert pipeline	Every 5 min	Loads new Lark alerts, runs AIIssueGrouper

Architecture​

Data Flow​

Agent Comparison​

Supporting Components​

Key Database Tables​

Scheduling​