Data Flow

Three primary data flows feed the platform: metrics, alerts, and agent pipelines.

1. Metrics Flow

Raw telemetry from StarRocks clusters is collected into metrics.metrics_table (~830M rows). A materialized view pre-aggregates this into hourly snapshots used by all downstream consumers.

Scoring Pipeline

The unified scorer (byoc_agent/unified_scorer.py) runs every 15 minutes:

Fetches gauge and counter metrics from amv_hourly_snapshots_v1 (14 dimensions including CPU, memory, disk, compaction, query latency, error rate)
Fetches 7-day-ago baselines for trend comparison
Pulls firing alert counts from lark_alerts (7-day window)
Pulls customer tier from dim_customer_profile
Computes composite score: 60% metrics + 25% alerts + 15% tier-adjusted
Classifies: Healthy (≥80), Warning (50-79), Critical (<50)
Writes results to cluster_health_scores and cluster_risk_snapshots

Key Metrics

Metric	Category	Thresholds (Healthy / Warning / Critical)
`starrocks_be_cpu_util_percent`	CPU	<55% / 55-70% / >70% avg
`starrocks_be_jvm_heap_used_percent`	Memory	<55% / 55-75% / >75%
`starrocks_be_disks_data_used_pct`	Disk	<70% / 70-85% / >85%
`starrocks_be_compaction_score_average`	Compaction	<100 / 100-500 / >500
`starrocks_fe_query_latency_ms_p99`	Latency	<5s / 5-15s / >15s
`starrocks_fe_query_err`	Errors	<0.1% / 0.1-2% / >2%

2. Alert Flow

Alerts arrive from two sources and converge into the alerts.lark_alerts table.

Lark Pipeline Details

daily_alert_pipeline.py -- the primary cron-based ingestion path:

Runs on EC2 (same VPC as StarRocks, no SSH tunnel needed)
Fetches messages from Lark API with pagination (page_size: 50)
Each message is an interactive card containing one or more alert sections
parse_alert_card() extracts: cluster_name, region, account_name, cluster_id, alert_status (Firing/Resolved), alert_name, alert_detail, dashboard/silence URLs
Supports --backfill N, --catch-up, and --today modes
Multi-alert cards (e.g., Firing + Resolved in same card) produce pipe-delimited fields in a single row

alert_webhook.py -- real-time Grafana integration:

Flask app on port 5050 receiving Alertmanager webhook POSTs
Each alert gets a deterministic message_id = wh_ + SHA256(alertname + cluster_id + status + startsAt)
Re-posting the same alert is a no-op (idempotent)
The wh_ prefix prevents collision with Lark message IDs (om_ prefix)

Issue Grouping

The Issue Tracker (byoc_agent/issue_tracker.py) groups raw alerts into actionable issues:

ANOMALY alerts (ongoing conditions like OperationDurationGT, HeapUsageTooHigh): All firings on the same cluster merge into one issue until resolved
FAILURE alerts (discrete events like OperationAbnormal, BeAliveAbnormal): Time-window grouping (30 min default), correlation grouping (4 hr)
ESCALATION: A failure on a cluster with an open anomaly in the same correlation group merges into the anomaly issue and escalates it
Issue lifecycle: issue_status (Ongoing/Resolved), triage_status (New -> Acknowledged -> Investigating -> Mitigating -> Closed), disposition_status (New -> No Action Needed / JIRA Created)

3. Agent Flow

The agent pipeline is a detect-investigate-report cycle running on a 15-minute cadence.

Sentinel Triggers

Trigger	Condition	Priority	Cooldown
New Critical	`risk_level = 'Critical'` and no task in last 4 hours	1 (highest)	4 hours per cluster
Score Cliff	Health score dropped >15 points vs previous snapshot	2	4 hours per cluster
Alert Storm	>5 firing alerts for one cluster in last hour	2	4 hours per cluster
Tier A/S Warning	Customer tier A or S with Warning or Critical risk	3	4 hours per cluster

Investigator Protocol

The Investigator uses Claude with tool-use to run an 8-step autonomous investigation:

Gather context -- query_cluster_info for customer tier, region, config
Check alerts -- query_cluster_alerts for recent firing/resolved patterns
Analyze metrics -- query_cluster_metrics for CPU, memory, disk, compaction, latency, errors
Check operations -- query_cluster_operations for recent START/STOP/UPDATE events
Search Knowledge Lake -- Hybrid vector+fulltext search for similar past incidents
Check past investigations -- query_agent_memory for repeat issues
Check sibling clusters -- query_similar_clusters to determine if account-wide or cluster-specific
Custom SQL -- run_sql for ad-hoc data not covered by pre-built tools

Output is structured: SEVERITY, CONFIDENCE (0.0-1.0), ROOT_CAUSE, EVIDENCE, and RECOMMENDATIONS.

Knowledge Lake

The Knowledge Lake MCP server (byoc_agent/knowledge_lake_client.py) provides search over proprietary StarRocks operational knowledge:

DW on-call logs -- Past incident reports and resolutions
Known bugs by SR version -- Version-specific issues and workarounds
Monitoring guidelines -- Threshold recommendations and alert tuning
Proven resolutions -- Documented fixes for common failure modes

Search modes: search_hybrid (vector + fulltext), search_fulltext (keyword), search_vector (semantic similarity).

1. Metrics Flow​

Scoring Pipeline​

Key Metrics​

2. Alert Flow​

Lark Pipeline Details​

Issue Grouping​

3. Agent Flow​

Sentinel Triggers​

Investigator Protocol​

Knowledge Lake​