Data Flow
Three primary data flows feed the platform: metrics, alerts, and agent pipelines.
1. Metrics Flow
Raw telemetry from StarRocks clusters is collected into metrics.metrics_table (~830M rows). A materialized view pre-aggregates this into hourly snapshots used by all downstream consumers.
Scoring Pipeline
The unified scorer (byoc_agent/unified_scorer.py) runs every 15 minutes:
- Fetches gauge and counter metrics from
amv_hourly_snapshots_v1(14 dimensions including CPU, memory, disk, compaction, query latency, error rate) - Fetches 7-day-ago baselines for trend comparison
- Pulls firing alert counts from
lark_alerts(7-day window) - Pulls customer tier from
dim_customer_profile - Computes composite score: 60% metrics + 25% alerts + 15% tier-adjusted
- Classifies: Healthy (≥80), Warning (50-79), Critical (<50)
- Writes results to
cluster_health_scoresandcluster_risk_snapshots
Key Metrics
| Metric | Category | Thresholds (Healthy / Warning / Critical) |
|---|---|---|
starrocks_be_cpu_util_percent | CPU | <55% / 55-70% / >70% avg |
starrocks_be_jvm_heap_used_percent | Memory | <55% / 55-75% / >75% |
starrocks_be_disks_data_used_pct | Disk | <70% / 70-85% / >85% |
starrocks_be_compaction_score_average | Compaction | <100 / 100-500 / >500 |
starrocks_fe_query_latency_ms_p99 | Latency | <5s / 5-15s / >15s |
starrocks_fe_query_err | Errors | <0.1% / 0.1-2% / >2% |
2. Alert Flow
Alerts arrive from two sources and converge into the alerts.lark_alerts table.
Lark Pipeline Details
daily_alert_pipeline.py -- the primary cron-based ingestion path:
- Runs on EC2 (same VPC as StarRocks, no SSH tunnel needed)
- Fetches messages from Lark API with pagination (
page_size: 50) - Each message is an interactive card containing one or more alert sections
parse_alert_card()extracts: cluster_name, region, account_name, cluster_id, alert_status (Firing/Resolved), alert_name, alert_detail, dashboard/silence URLs- Supports
--backfill N,--catch-up, and--todaymodes - Multi-alert cards (e.g., Firing + Resolved in same card) produce pipe-delimited fields in a single row
alert_webhook.py -- real-time Grafana integration:
- Flask app on port 5050 receiving Alertmanager webhook POSTs
- Each alert gets a deterministic
message_id=wh_+ SHA256(alertname + cluster_id + status + startsAt) - Re-posting the same alert is a no-op (idempotent)
- The
wh_prefix prevents collision with Lark message IDs (om_prefix)
Issue Grouping
The Issue Tracker (byoc_agent/issue_tracker.py) groups raw alerts into actionable issues:
- ANOMALY alerts (ongoing conditions like
OperationDurationGT,HeapUsageTooHigh): All firings on the same cluster merge into one issue until resolved - FAILURE alerts (discrete events like
OperationAbnormal,BeAliveAbnormal): Time-window grouping (30 min default), correlation grouping (4 hr) - ESCALATION: A failure on a cluster with an open anomaly in the same correlation group merges into the anomaly issue and escalates it
- Issue lifecycle:
issue_status(Ongoing/Resolved),triage_status(New -> Acknowledged -> Investigating -> Mitigating -> Closed),disposition_status(New -> No Action Needed / JIRA Created)
3. Agent Flow
The agent pipeline is a detect-investigate-report cycle running on a 15-minute cadence.
Sentinel Triggers
| Trigger | Condition | Priority | Cooldown |
|---|---|---|---|
| New Critical | risk_level = 'Critical' and no task in last 4 hours | 1 (highest) | 4 hours per cluster |
| Score Cliff | Health score dropped >15 points vs previous snapshot | 2 | 4 hours per cluster |
| Alert Storm | >5 firing alerts for one cluster in last hour | 2 | 4 hours per cluster |
| Tier A/S Warning | Customer tier A or S with Warning or Critical risk | 3 | 4 hours per cluster |
Investigator Protocol
The Investigator uses Claude with tool-use to run an 8-step autonomous investigation:
- Gather context --
query_cluster_infofor customer tier, region, config - Check alerts --
query_cluster_alertsfor recent firing/resolved patterns - Analyze metrics --
query_cluster_metricsfor CPU, memory, disk, compaction, latency, errors - Check operations --
query_cluster_operationsfor recent START/STOP/UPDATE events - Search Knowledge Lake -- Hybrid vector+fulltext search for similar past incidents
- Check past investigations --
query_agent_memoryfor repeat issues - Check sibling clusters --
query_similar_clustersto determine if account-wide or cluster-specific - Custom SQL --
run_sqlfor ad-hoc data not covered by pre-built tools
Output is structured: SEVERITY, CONFIDENCE (0.0-1.0), ROOT_CAUSE, EVIDENCE, and RECOMMENDATIONS.
Knowledge Lake
The Knowledge Lake MCP server (byoc_agent/knowledge_lake_client.py) provides search over proprietary StarRocks operational knowledge:
- DW on-call logs -- Past incident reports and resolutions
- Known bugs by SR version -- Version-specific issues and workarounds
- Monitoring guidelines -- Threshold recommendations and alert tuning
- Proven resolutions -- Documented fixes for common failure modes
Search modes: search_hybrid (vector + fulltext), search_fulltext (keyword), search_vector (semantic similarity).