Skip to main content

Data Flow

Three primary data flows feed the platform: metrics, alerts, and agent pipelines.

1. Metrics Flow

Raw telemetry from StarRocks clusters is collected into metrics.metrics_table (~830M rows). A materialized view pre-aggregates this into hourly snapshots used by all downstream consumers.

Scoring Pipeline

The unified scorer (byoc_agent/unified_scorer.py) runs every 15 minutes:

  1. Fetches gauge and counter metrics from amv_hourly_snapshots_v1 (14 dimensions including CPU, memory, disk, compaction, query latency, error rate)
  2. Fetches 7-day-ago baselines for trend comparison
  3. Pulls firing alert counts from lark_alerts (7-day window)
  4. Pulls customer tier from dim_customer_profile
  5. Computes composite score: 60% metrics + 25% alerts + 15% tier-adjusted
  6. Classifies: Healthy (≥80), Warning (50-79), Critical (<50)
  7. Writes results to cluster_health_scores and cluster_risk_snapshots

Key Metrics

MetricCategoryThresholds (Healthy / Warning / Critical)
starrocks_be_cpu_util_percentCPU<55% / 55-70% / >70% avg
starrocks_be_jvm_heap_used_percentMemory<55% / 55-75% / >75%
starrocks_be_disks_data_used_pctDisk<70% / 70-85% / >85%
starrocks_be_compaction_score_averageCompaction<100 / 100-500 / >500
starrocks_fe_query_latency_ms_p99Latency<5s / 5-15s / >15s
starrocks_fe_query_errErrors<0.1% / 0.1-2% / >2%

2. Alert Flow

Alerts arrive from two sources and converge into the alerts.lark_alerts table.

Lark Pipeline Details

daily_alert_pipeline.py -- the primary cron-based ingestion path:

  • Runs on EC2 (same VPC as StarRocks, no SSH tunnel needed)
  • Fetches messages from Lark API with pagination (page_size: 50)
  • Each message is an interactive card containing one or more alert sections
  • parse_alert_card() extracts: cluster_name, region, account_name, cluster_id, alert_status (Firing/Resolved), alert_name, alert_detail, dashboard/silence URLs
  • Supports --backfill N, --catch-up, and --today modes
  • Multi-alert cards (e.g., Firing + Resolved in same card) produce pipe-delimited fields in a single row

alert_webhook.py -- real-time Grafana integration:

  • Flask app on port 5050 receiving Alertmanager webhook POSTs
  • Each alert gets a deterministic message_id = wh_ + SHA256(alertname + cluster_id + status + startsAt)
  • Re-posting the same alert is a no-op (idempotent)
  • The wh_ prefix prevents collision with Lark message IDs (om_ prefix)

Issue Grouping

The Issue Tracker (byoc_agent/issue_tracker.py) groups raw alerts into actionable issues:

  • ANOMALY alerts (ongoing conditions like OperationDurationGT, HeapUsageTooHigh): All firings on the same cluster merge into one issue until resolved
  • FAILURE alerts (discrete events like OperationAbnormal, BeAliveAbnormal): Time-window grouping (30 min default), correlation grouping (4 hr)
  • ESCALATION: A failure on a cluster with an open anomaly in the same correlation group merges into the anomaly issue and escalates it
  • Issue lifecycle: issue_status (Ongoing/Resolved), triage_status (New -> Acknowledged -> Investigating -> Mitigating -> Closed), disposition_status (New -> No Action Needed / JIRA Created)

3. Agent Flow

The agent pipeline is a detect-investigate-report cycle running on a 15-minute cadence.

Sentinel Triggers

TriggerConditionPriorityCooldown
New Criticalrisk_level = 'Critical' and no task in last 4 hours1 (highest)4 hours per cluster
Score CliffHealth score dropped >15 points vs previous snapshot24 hours per cluster
Alert Storm>5 firing alerts for one cluster in last hour24 hours per cluster
Tier A/S WarningCustomer tier A or S with Warning or Critical risk34 hours per cluster

Investigator Protocol

The Investigator uses Claude with tool-use to run an 8-step autonomous investigation:

  1. Gather context -- query_cluster_info for customer tier, region, config
  2. Check alerts -- query_cluster_alerts for recent firing/resolved patterns
  3. Analyze metrics -- query_cluster_metrics for CPU, memory, disk, compaction, latency, errors
  4. Check operations -- query_cluster_operations for recent START/STOP/UPDATE events
  5. Search Knowledge Lake -- Hybrid vector+fulltext search for similar past incidents
  6. Check past investigations -- query_agent_memory for repeat issues
  7. Check sibling clusters -- query_similar_clusters to determine if account-wide or cluster-specific
  8. Custom SQL -- run_sql for ad-hoc data not covered by pre-built tools

Output is structured: SEVERITY, CONFIDENCE (0.0-1.0), ROOT_CAUSE, EVIDENCE, and RECOMMENDATIONS.

Knowledge Lake

The Knowledge Lake MCP server (byoc_agent/knowledge_lake_client.py) provides search over proprietary StarRocks operational knowledge:

  • DW on-call logs -- Past incident reports and resolutions
  • Known bugs by SR version -- Version-specific issues and workarounds
  • Monitoring guidelines -- Threshold recommendations and alert tuning
  • Proven resolutions -- Documented fixes for common failure modes

Search modes: search_hybrid (vector + fulltext), search_fulltext (keyword), search_vector (semantic similarity).