Skip to main content

Architecture Overview

The platform follows a layered architecture: data sources feed into ingestion pipelines, which populate StarRocks OLAP storage. An agent layer runs autonomous analysis, exposed through a FastAPI backend to a React frontend.

System Diagram

Component Details

Data Sources

SourceProtocolVolume
Lark Alert ChannelLark Open API (MCP im_v1_message_list)~8,500+ alert messages since Jan 2026
Grafana / AlertmanagerWebhook POST to alert_webhook.pyReal-time firing/resolved notifications
StarRocks MetricsDirect OLAP queries via mysql-connector-python~830M rows in metrics_table, materialized to ~33.7M hourly snapshots
CelerData Control PlaneReplicated to byoc database (16 tables)1,143 orgs, 3,534 clusters, ~2M billing rows

Ingestion Layer

ComponentFileTriggerWhat it does
Daily Alert Pipelinedaily_alert_pipeline.pyCron (daily) or manualFetches Lark messages via API, parses alert cards, inserts into lark_alerts, generates recommendations
Alert Webhookalert_webhook.pyGrafana webhook POSTReceives structured alert JSON, computes SHA256-based message_id for dedup, inserts into lark_alerts
Lark Alert Loaderload_lark_alerts.py + process_lark_page.pyManual / MCP batchParses interactive card JSON from saved MCP results, handles multi-alert cards (Firing + Resolved)

Storage (StarRocks)

Three databases, all queried via mysql-connector-python over the StarRocks MySQL-compatible protocol:

  • metrics -- Telemetry data. The materialized view amv_hourly_snapshots_v1 is the primary analysis table. Enriched views (v5, v6) join metric dictionary metadata. dim_customer_profile provides customer tier (A/B/S).
  • byoc -- Cluster infrastructure. 16 tables covering the full entity hierarchy: organizations -> accounts -> clusters -> resources -> nodes -> VM specs. Plus billing (bill_detail), operations (billing_order), and weekly CCU rollups.
  • alerts -- Operational data. Raw alerts (lark_alerts), grouped issues (issues), risk snapshots (cluster_risk_snapshots), health scores (cluster_health_scores), agent tasks and investigations, patrol reports.

Agent Layer

Four agents with distinct roles:

AgentModuleScheduleLLM?Purpose
Sentinelbyoc_agent/sentinel.pyEvery 15 minNoDetects trigger conditions (new Critical, score cliff >15pt, alert storm >5/hr, Tier A/S warning) and creates agent_tasks
Investigatorbyoc_agent/investigator.pyAfter each Sentinel runYes (Claude)Picks up pending tasks, runs 8-step investigation protocol using SQL tools + Knowledge Lake, writes structured findings to agent_investigations
Patrol Agentbyoc_agent/patrol_agent.py2x/day (morning + evening)Yes (Claude)Fleet-wide health scan: aggregates risk/scores/alerts, identifies problem clusters, detects cross-cluster patterns, produces actionable reports
BYOCAgentbyoc_agent/agent.pyOn-demand (chat)Yes (Claude)Interactive chat agent for ad-hoc cluster health queries. Uses MCP for open-ended StarRocks exploration

All LLM-backed agents inherit from AgentBase (byoc_agent/agent_base.py), which provides a reusable tool-use loop supporting both Anthropic and OpenAI-compatible providers.

Knowledge Lake (byoc_agent/knowledge_lake_client.py) -- an MCP server providing vector + fulltext search over proprietary StarRocks knowledge: DW on-call logs, known bugs by SR version, monitoring guidelines, and proven resolutions.

API Layer (FastAPI)

backend/main.py mounts 14 routers:

RouterPrefixPurpose
auth/api/authJWT login, Supabase user management, invite/activate
health/api/healthCluster health scores, unified scoring results
issues/api/issuesIssue tracker (grouped alerts with lifecycle)
risk/api/riskRisk snapshots, risk distribution
analysis/api/analysisUsage patterns, breaking-point analysis
settings/api/settingsUser preferences, notification config
investigations/api/investigationsAgent investigation results
llm_usage/api/llm-usageToken consumption tracking
looker/api/lookerLooker embed SSO token generation
critical_clusters/api/critical-clustersCritical cluster list and details
chat/api/chatBYOCAgent chat interface (streaming)
ai_issues/api/ai-issuesAI-grouped issue summaries
fleet/api/fleetFleet-wide stats and cluster list
patrol/api/patrolPatrol reports and on-demand triggers

Frontend (React SPA)

Built with React 19 + Vite + TypeScript + Tailwind CSS v4. Pages:

PageRouteDescription
Overview/overviewFleet health dashboard with risk distribution, critical clusters
Issues/issuesIssue tracker with filtering, triage workflow
Raw Alerts/raw-alertsDirect view of lark_alerts table
Breaking Point/breakingCapacity analysis and risk forecasting
Usage Patterns/usageCluster utilization trends
Investigations/investigationsAgent investigation reports
AI Issues/ai-issuesAI-grouped and summarized issues
Patrol/patrolFleet patrol reports
Chat/chatInteractive chat with BYOCAgent
LLM Usage/llm-usageToken and cost tracking for all agents
Settings/settingsUser and notification preferences
Help/helpPlatform documentation and guides