Architecture Overview

The platform follows a layered architecture: data sources feed into ingestion pipelines, which populate StarRocks OLAP storage. An agent layer runs autonomous analysis, exposed through a FastAPI backend to a React frontend.

System Diagram

Component Details

Data Sources

Source	Protocol	Volume
Lark Alert Channel	Lark Open API (MCP `im_v1_message_list`)	~8,500+ alert messages since Jan 2026
Grafana / Alertmanager	Webhook POST to `alert_webhook.py`	Real-time firing/resolved notifications
StarRocks Metrics	Direct OLAP queries via `mysql-connector-python`	~830M rows in `metrics_table`, materialized to ~33.7M hourly snapshots
CelerData Control Plane	Replicated to `byoc` database (16 tables)	1,143 orgs, 3,534 clusters, ~2M billing rows

Ingestion Layer

Component	File	Trigger	What it does
Daily Alert Pipeline	`daily_alert_pipeline.py`	Cron (daily) or manual	Fetches Lark messages via API, parses alert cards, inserts into `lark_alerts`, generates recommendations
Alert Webhook	`alert_webhook.py`	Grafana webhook POST	Receives structured alert JSON, computes SHA256-based `message_id` for dedup, inserts into `lark_alerts`
Lark Alert Loader	`load_lark_alerts.py` + `process_lark_page.py`	Manual / MCP batch	Parses interactive card JSON from saved MCP results, handles multi-alert cards (Firing + Resolved)

Storage (StarRocks)

Three databases, all queried via mysql-connector-python over the StarRocks MySQL-compatible protocol:

metrics -- Telemetry data. The materialized view amv_hourly_snapshots_v1 is the primary analysis table. Enriched views (v5, v6) join metric dictionary metadata. dim_customer_profile provides customer tier (A/B/S).
byoc -- Cluster infrastructure. 16 tables covering the full entity hierarchy: organizations -> accounts -> clusters -> resources -> nodes -> VM specs. Plus billing (bill_detail), operations (billing_order), and weekly CCU rollups.
alerts -- Operational data. Raw alerts (lark_alerts), grouped issues (issues), risk snapshots (cluster_risk_snapshots), health scores (cluster_health_scores), agent tasks and investigations, patrol reports.

Agent Layer

Four agents with distinct roles:

Agent	Module	Schedule	LLM?	Purpose
Sentinel	`byoc_agent/sentinel.py`	Every 15 min	No	Detects trigger conditions (new Critical, score cliff >15pt, alert storm >5/hr, Tier A/S warning) and creates `agent_tasks`
Investigator	`byoc_agent/investigator.py`	After each Sentinel run	Yes (Claude)	Picks up pending tasks, runs 8-step investigation protocol using SQL tools + Knowledge Lake, writes structured findings to `agent_investigations`
Patrol Agent	`byoc_agent/patrol_agent.py`	2x/day (morning + evening)	Yes (Claude)	Fleet-wide health scan: aggregates risk/scores/alerts, identifies problem clusters, detects cross-cluster patterns, produces actionable reports
BYOCAgent	`byoc_agent/agent.py`	On-demand (chat)	Yes (Claude)	Interactive chat agent for ad-hoc cluster health queries. Uses MCP for open-ended StarRocks exploration

All LLM-backed agents inherit from AgentBase (byoc_agent/agent_base.py), which provides a reusable tool-use loop supporting both Anthropic and OpenAI-compatible providers.

Knowledge Lake (byoc_agent/knowledge_lake_client.py) -- an MCP server providing vector + fulltext search over proprietary StarRocks knowledge: DW on-call logs, known bugs by SR version, monitoring guidelines, and proven resolutions.

API Layer (FastAPI)

backend/main.py mounts 14 routers:

Router	Prefix	Purpose
`auth`	`/api/auth`	JWT login, Supabase user management, invite/activate
`health`	`/api/health`	Cluster health scores, unified scoring results
`issues`	`/api/issues`	Issue tracker (grouped alerts with lifecycle)
`risk`	`/api/risk`	Risk snapshots, risk distribution
`analysis`	`/api/analysis`	Usage patterns, breaking-point analysis
`settings`	`/api/settings`	User preferences, notification config
`investigations`	`/api/investigations`	Agent investigation results
`llm_usage`	`/api/llm-usage`	Token consumption tracking
`looker`	`/api/looker`	Looker embed SSO token generation
`critical_clusters`	`/api/critical-clusters`	Critical cluster list and details
`chat`	`/api/chat`	BYOCAgent chat interface (streaming)
`ai_issues`	`/api/ai-issues`	AI-grouped issue summaries
`fleet`	`/api/fleet`	Fleet-wide stats and cluster list
`patrol`	`/api/patrol`	Patrol reports and on-demand triggers

Frontend (React SPA)

Built with React 19 + Vite + TypeScript + Tailwind CSS v4. Pages:

Page	Route	Description
Overview	`/overview`	Fleet health dashboard with risk distribution, critical clusters
Issues	`/issues`	Issue tracker with filtering, triage workflow
Raw Alerts	`/raw-alerts`	Direct view of `lark_alerts` table
Breaking Point	`/breaking`	Capacity analysis and risk forecasting
Usage Patterns	`/usage`	Cluster utilization trends
Investigations	`/investigations`	Agent investigation reports
AI Issues	`/ai-issues`	AI-grouped and summarized issues
Patrol	`/patrol`	Fleet patrol reports
Chat	`/chat`	Interactive chat with BYOCAgent
LLM Usage	`/llm-usage`	Token and cost tracking for all agents
Settings	`/settings`	User and notification preferences
Help	`/help`	Platform documentation and guides

System Diagram​

Component Details​

Data Sources​

Ingestion Layer​

Storage (StarRocks)​

Agent Layer​

API Layer (FastAPI)​

Frontend (React SPA)​