ClusterRiskAnalyzer -- Threshold-Based Risk Classification
Source: byoc_agent/cluster_risk_analyzer.py
Rules: byoc_agent/cluster_risk_rules.yaml (v3)
The ClusterRiskAnalyzer complements the weighted health scoring in HealthScorer. It uses a threshold-based approach: any single metric exceeding a threshold flags the cluster, with the overall risk level set to the worst triggered level.
Risk Dimensions
10 risk dimensions across 6 categories:
Category 1: Storage & Compaction
| Dimension | Warning | Critical | Source |
|---|---|---|---|
compaction_score | 1,000 | 5,000 | Gauge (max) -- matches production alert |
disk_used_pct | 80% | 90% | Derived (1 - free/size) |
data_growth_pct | 100% | 300% | Derived (current vs 7d ago) |
Category 2: Query Performance
| Dimension | Warning | Critical | Source |
|---|---|---|---|
query_errors_7d | 10,000 | 50,000 | Counter delta |
query_timeouts_7d | 500 | 5,000 | Counter delta |
slow_queries_7d | 50,000 | 500,000 | Counter delta |
internal_errors_7d | 1,000 | 10,000 | Counter delta |
Category 3: Ingestion Health
| Dimension | Warning | Critical | Source |
|---|---|---|---|
txn_failures_7d | 10,000 | 50,000 | Counter delta |
txn_rejects_7d | 100 | 5,000 | Counter delta |
Category 4: Resource Pressure
| Dimension | Warning | Critical | Source |
|---|---|---|---|
memory_available_pct | 20% | 10% | Derived, inverted |
fe_heap_used_pct | 85% | 95% | Derived |
be_process_mem_gb | 120 GB | 180 GB | Gauge (max), converted |
Category 5: FE & Metadata Health
| Dimension | Warning | Critical | Source |
|---|---|---|---|
fe_journal_write_latency_ms | 1,000 ms | 10,000 ms | Gauge (max) |
fe_connection_total | 500 | 900 | Gauge (max) |
Category 6: Process Health
| Dimension | Warning | Critical | Source |
|---|---|---|---|
be_fd_usage | 30,000 | 50,000 | Gauge (max) |
ClusterRiskSnapshot Dataclass
Each cluster produces a ClusterRiskSnapshot containing:
- All raw metric values (compaction, disk, memory, errors, etc.)
- Context fields (avg_latency_ms, qps, node_count)
- Classification result:
risk_level(Healthy/Warning/Critical),risk_reasons(list of triggered reasons),suggested_actions(list of remediation steps)
Classification Logic
For each dimension:
- Read the raw metric value from the snapshot.
- Compare against Warning and Critical thresholds.
- For inverted dimensions (lower = worse):
value <= criticaltriggers Critical. - For normal dimensions (higher = worse):
value >= criticaltriggers Critical. - Each triggered dimension adds a reason (from
reason_template) and an action. - Overall
risk_level= the worst triggered level across all dimensions.
Data Fetching
Three batch queries fetch all data efficiently:
| Query | Purpose |
|---|---|
_fetch_gauge_metrics(lookback) | Gauge + rate metrics (12 metric names) from MV |
_fetch_gauge_metrics_7d_ago(lookback) | Data size from ~7 days ago for growth calculation |
_fetch_counter_metrics(lookback) | Counter metrics (7 metric names) using MAX-MIN per host |
_fetch_enrichment() | Cluster names, account names, node counts from byoc tables |
Persistence
Results are stored in alerts.cluster_risk_snapshots as a PRIMARY KEY table (keyed on cluster_id). Each run upserts the latest snapshot, replacing the previous row per cluster.
Sorting
Results are sorted:
- Critical first, then Warning, then Healthy
- Within each level, by number of risk reasons descending (most reasons = most problematic)
Usage
from byoc_agent.cluster_risk_analyzer import compute_cluster_risk, persist_risk_snapshots
snapshots = compute_cluster_risk()
persist_risk_snapshots(snapshots)
# Load current snapshots from DB
from byoc_agent.cluster_risk_analyzer import load_risk_snapshots
current = load_risk_snapshots()
Threshold Basis
Thresholds are calibrated from:
- Production Grafana alert rules extracted from Lark alert history (Jan-Mar 2026, 8,474 alerts)
- Real metric distributions across ~184 active BYOC clusters
- StarRocks operational domain knowledge
Example production alert mappings:
FEMaxTabletCompactionfires at compaction > 1,000 (566 alerts, 9 clusters)FreeDiskLessThan10%fires at disk free < 10% (164 alerts, 7 clusters)FEHeapUsageTooHighfires at FE JVM heap > 95% (421 alerts, 12 clusters)