The BYOC Agentic AI Operations platform uses a multi-agent architecture to monitor, detect, investigate, and report on cluster health across the entire CelerData BYOC fleet.
Architecture
The system follows a Sentinel-Investigator pattern with two additional agent roles for fleet sweeps and interactive analysis:
- Sentinel -- Pure Python change-detection layer. No LLM calls, zero cost. Runs every 15 minutes after each risk/score refresh. When a trigger fires, it writes a task to the
agent_tasks queue.
- Investigator Agent -- LLM-powered autonomous agent. Picks up pending tasks from
agent_tasks, performs a structured investigation using SQL tools and the Knowledge Lake, and writes findings to agent_investigations.
- Patrol Agent -- LLM-powered fleet-wide health scan. Runs 2x daily (morning + evening) or on-demand. Pre-aggregates fleet state, then sends a compact summary to the LLM for deep analysis.
- BYOC Agent -- Interactive chat agent. Used by operators to ask ad-hoc questions about cluster health, query performance, and usage patterns.
Data Flow
Agent Comparison
| Agent | Trigger | LLM? | Model | Cost | Output |
|---|
| Sentinel | Every 15 min (cron) | No | -- | Zero | agent_tasks rows |
| Investigator | Pending tasks in queue | Yes | Claude (configurable) | Per-investigation | agent_investigations rows |
| Patrol | 2x daily or on-demand | Yes | Claude (configurable) | Per-report | patrol_reports rows |
| BYOC Agent | User chat message | Yes | Claude Sonnet | Per-conversation | Chat response |
| AIIssueGrouper | New alerts (every 5 min) | Yes | Claude Haiku | Per-alert batch | ai_issues rows |
Supporting Components
- UnifiedScorer -- Single pipeline merging metrics, alerts, and customer tier into one weighted score per cluster. Replaces separate HealthScorer and ClusterRiskAnalyzer runs.
- HealthScorer -- Per-cluster health scores (0-100) from YAML rules across 8 dimensions.
- ClusterRiskAnalyzer -- Threshold-based risk classification across 10 dimensions in 4 categories.
- IssueTracker -- Alert-to-issue grouping with lifecycle management (anomaly vs. failure patterns).
- AIIssueGrouper -- Claude Haiku-powered alert classification and issue grouping.
- LocalAnalyst -- Pre-built analyses (health overview, latency trends, alert summary) that run without an LLM.
Key Database Tables
| Table | Purpose |
|---|
alerts.agent_tasks | Sentinel task queue (pending/running/completed/failed) |
alerts.agent_investigations | Investigator findings with structured fields |
alerts.patrol_reports | Fleet patrol reports with parsed sections |
alerts.cluster_unified_scores | Unified scoring snapshots (DUPLICATE KEY) |
alerts.cluster_health_scores | Health scoring snapshots (DUPLICATE KEY) |
alerts.cluster_risk_snapshots | Risk classification snapshots (PRIMARY KEY, upsert) |
alerts.ai_issues | AI-grouped issues with triage workflow |
alerts.issues | Rule-based grouped issues with lifecycle |
Scheduling
| Job | Frequency | What it does |
|---|
refresh_unified.py | Every 15 min | Runs UnifiedScorer, then Sentinel check_triggers() |
refresh_risk.py | Every 15 min | Runs ClusterRiskAnalyzer, then Sentinel |
refresh_scores.py | Every 15 min | Runs HealthScorer, then Sentinel |
| Investigator | After Sentinel | Picks up pending tasks (run_pending_investigations) |
| Patrol | 2x daily + on-demand | run_patrol(report_type="morning"|"evening") |
| Alert pipeline | Every 5 min | Loads new Lark alerts, runs AIIssueGrouper |