Architecture Overview
The platform follows a layered architecture: data sources feed into ingestion pipelines, which populate StarRocks OLAP storage. An agent layer runs autonomous analysis, exposed through a FastAPI backend to a React frontend.
System Diagram
Component Details
Data Sources
| Source | Protocol | Volume |
|---|---|---|
| Lark Alert Channel | Lark Open API (MCP im_v1_message_list) | ~8,500+ alert messages since Jan 2026 |
| Grafana / Alertmanager | Webhook POST to alert_webhook.py | Real-time firing/resolved notifications |
| StarRocks Metrics | Direct OLAP queries via mysql-connector-python | ~830M rows in metrics_table, materialized to ~33.7M hourly snapshots |
| CelerData Control Plane | Replicated to byoc database (16 tables) | 1,143 orgs, 3,534 clusters, ~2M billing rows |
Ingestion Layer
| Component | File | Trigger | What it does |
|---|---|---|---|
| Daily Alert Pipeline | daily_alert_pipeline.py | Cron (daily) or manual | Fetches Lark messages via API, parses alert cards, inserts into lark_alerts, generates recommendations |
| Alert Webhook | alert_webhook.py | Grafana webhook POST | Receives structured alert JSON, computes SHA256-based message_id for dedup, inserts into lark_alerts |
| Lark Alert Loader | load_lark_alerts.py + process_lark_page.py | Manual / MCP batch | Parses interactive card JSON from saved MCP results, handles multi-alert cards (Firing + Resolved) |
Storage (StarRocks)
Three databases, all queried via mysql-connector-python over the StarRocks MySQL-compatible protocol:
metrics-- Telemetry data. The materialized viewamv_hourly_snapshots_v1is the primary analysis table. Enriched views (v5,v6) join metric dictionary metadata.dim_customer_profileprovides customer tier (A/B/S).byoc-- Cluster infrastructure. 16 tables covering the full entity hierarchy: organizations -> accounts -> clusters -> resources -> nodes -> VM specs. Plus billing (bill_detail), operations (billing_order), and weekly CCU rollups.alerts-- Operational data. Raw alerts (lark_alerts), grouped issues (issues), risk snapshots (cluster_risk_snapshots), health scores (cluster_health_scores), agent tasks and investigations, patrol reports.
Agent Layer
Four agents with distinct roles:
| Agent | Module | Schedule | LLM? | Purpose |
|---|---|---|---|---|
| Sentinel | byoc_agent/sentinel.py | Every 15 min | No | Detects trigger conditions (new Critical, score cliff >15pt, alert storm >5/hr, Tier A/S warning) and creates agent_tasks |
| Investigator | byoc_agent/investigator.py | After each Sentinel run | Yes (Claude) | Picks up pending tasks, runs 8-step investigation protocol using SQL tools + Knowledge Lake, writes structured findings to agent_investigations |
| Patrol Agent | byoc_agent/patrol_agent.py | 2x/day (morning + evening) | Yes (Claude) | Fleet-wide health scan: aggregates risk/scores/alerts, identifies problem clusters, detects cross-cluster patterns, produces actionable reports |
| BYOCAgent | byoc_agent/agent.py | On-demand (chat) | Yes (Claude) | Interactive chat agent for ad-hoc cluster health queries. Uses MCP for open-ended StarRocks exploration |
All LLM-backed agents inherit from AgentBase (byoc_agent/agent_base.py), which provides a reusable tool-use loop supporting both Anthropic and OpenAI-compatible providers.
Knowledge Lake (byoc_agent/knowledge_lake_client.py) -- an MCP server providing vector + fulltext search over proprietary StarRocks knowledge: DW on-call logs, known bugs by SR version, monitoring guidelines, and proven resolutions.
API Layer (FastAPI)
backend/main.py mounts 14 routers:
| Router | Prefix | Purpose |
|---|---|---|
auth | /api/auth | JWT login, Supabase user management, invite/activate |
health | /api/health | Cluster health scores, unified scoring results |
issues | /api/issues | Issue tracker (grouped alerts with lifecycle) |
risk | /api/risk | Risk snapshots, risk distribution |
analysis | /api/analysis | Usage patterns, breaking-point analysis |
settings | /api/settings | User preferences, notification config |
investigations | /api/investigations | Agent investigation results |
llm_usage | /api/llm-usage | Token consumption tracking |
looker | /api/looker | Looker embed SSO token generation |
critical_clusters | /api/critical-clusters | Critical cluster list and details |
chat | /api/chat | BYOCAgent chat interface (streaming) |
ai_issues | /api/ai-issues | AI-grouped issue summaries |
fleet | /api/fleet | Fleet-wide stats and cluster list |
patrol | /api/patrol | Patrol reports and on-demand triggers |
Frontend (React SPA)
Built with React 19 + Vite + TypeScript + Tailwind CSS v4. Pages:
| Page | Route | Description |
|---|---|---|
| Overview | /overview | Fleet health dashboard with risk distribution, critical clusters |
| Issues | /issues | Issue tracker with filtering, triage workflow |
| Raw Alerts | /raw-alerts | Direct view of lark_alerts table |
| Breaking Point | /breaking | Capacity analysis and risk forecasting |
| Usage Patterns | /usage | Cluster utilization trends |
| Investigations | /investigations | Agent investigation reports |
| AI Issues | /ai-issues | AI-grouped and summarized issues |
| Patrol | /patrol | Fleet patrol reports |
| Chat | /chat | Interactive chat with BYOCAgent |
| LLM Usage | /llm-usage | Token and cost tracking for all agents |
| Settings | /settings | User and notification preferences |
| Help | /help | Platform documentation and guides |