Skip to main content

Agent System Overview

The BYOC Agentic AI Operations platform uses a multi-agent architecture to monitor, detect, investigate, and report on cluster health across the entire CelerData BYOC fleet.

Architecture

The system follows a Sentinel-Investigator pattern with two additional agent roles for fleet sweeps and interactive analysis:

  1. Sentinel -- Pure Python change-detection layer. No LLM calls, zero cost. Runs every 15 minutes after each risk/score refresh. When a trigger fires, it writes a task to the agent_tasks queue.
  2. Investigator Agent -- LLM-powered autonomous agent. Picks up pending tasks from agent_tasks, performs a structured investigation using SQL tools and the Knowledge Lake, and writes findings to agent_investigations.
  3. Patrol Agent -- LLM-powered fleet-wide health scan. Runs 2x daily (morning + evening) or on-demand. Pre-aggregates fleet state, then sends a compact summary to the LLM for deep analysis.
  4. BYOC Agent -- Interactive chat agent. Used by operators to ask ad-hoc questions about cluster health, query performance, and usage patterns.

Data Flow

Agent Comparison

AgentTriggerLLM?ModelCostOutput
SentinelEvery 15 min (cron)No--Zeroagent_tasks rows
InvestigatorPending tasks in queueYesClaude (configurable)Per-investigationagent_investigations rows
Patrol2x daily or on-demandYesClaude (configurable)Per-reportpatrol_reports rows
BYOC AgentUser chat messageYesClaude SonnetPer-conversationChat response
AIIssueGrouperNew alerts (every 5 min)YesClaude HaikuPer-alert batchai_issues rows

Supporting Components

  • UnifiedScorer -- Single pipeline merging metrics, alerts, and customer tier into one weighted score per cluster. Replaces separate HealthScorer and ClusterRiskAnalyzer runs.
  • HealthScorer -- Per-cluster health scores (0-100) from YAML rules across 8 dimensions.
  • ClusterRiskAnalyzer -- Threshold-based risk classification across 10 dimensions in 4 categories.
  • IssueTracker -- Alert-to-issue grouping with lifecycle management (anomaly vs. failure patterns).
  • AIIssueGrouper -- Claude Haiku-powered alert classification and issue grouping.
  • LocalAnalyst -- Pre-built analyses (health overview, latency trends, alert summary) that run without an LLM.

Key Database Tables

TablePurpose
alerts.agent_tasksSentinel task queue (pending/running/completed/failed)
alerts.agent_investigationsInvestigator findings with structured fields
alerts.patrol_reportsFleet patrol reports with parsed sections
alerts.cluster_unified_scoresUnified scoring snapshots (DUPLICATE KEY)
alerts.cluster_health_scoresHealth scoring snapshots (DUPLICATE KEY)
alerts.cluster_risk_snapshotsRisk classification snapshots (PRIMARY KEY, upsert)
alerts.ai_issuesAI-grouped issues with triage workflow
alerts.issuesRule-based grouped issues with lifecycle

Scheduling

JobFrequencyWhat it does
refresh_unified.pyEvery 15 minRuns UnifiedScorer, then Sentinel check_triggers()
refresh_risk.pyEvery 15 minRuns ClusterRiskAnalyzer, then Sentinel
refresh_scores.pyEvery 15 minRuns HealthScorer, then Sentinel
InvestigatorAfter SentinelPicks up pending tasks (run_pending_investigations)
Patrol2x daily + on-demandrun_patrol(report_type="morning"|"evening")
Alert pipelineEvery 5 minLoads new Lark alerts, runs AIIssueGrouper