ClusterRiskAnalyzer -- Threshold-Based Risk Classification

Source: byoc_agent/cluster_risk_analyzer.py Rules: byoc_agent/cluster_risk_rules.yaml (v3)

The ClusterRiskAnalyzer complements the weighted health scoring in HealthScorer. It uses a threshold-based approach: any single metric exceeding a threshold flags the cluster, with the overall risk level set to the worst triggered level.

Risk Dimensions

10 risk dimensions across 6 categories:

Category 1: Storage & Compaction

Dimension	Warning	Critical	Source
`compaction_score`	1,000	5,000	Gauge (max) -- matches production alert
`disk_used_pct`	80%	90%	Derived (1 - free/size)
`data_growth_pct`	100%	300%	Derived (current vs 7d ago)

Category 2: Query Performance

Dimension	Warning	Critical	Source
`query_errors_7d`	10,000	50,000	Counter delta
`query_timeouts_7d`	500	5,000	Counter delta
`slow_queries_7d`	50,000	500,000	Counter delta
`internal_errors_7d`	1,000	10,000	Counter delta

Category 3: Ingestion Health

Dimension	Warning	Critical	Source
`txn_failures_7d`	10,000	50,000	Counter delta
`txn_rejects_7d`	100	5,000	Counter delta

Category 4: Resource Pressure

Dimension	Warning	Critical	Source
`memory_available_pct`	20%	10%	Derived, inverted
`fe_heap_used_pct`	85%	95%	Derived
`be_process_mem_gb`	120 GB	180 GB	Gauge (max), converted

Category 5: FE & Metadata Health

Dimension	Warning	Critical	Source
`fe_journal_write_latency_ms`	1,000 ms	10,000 ms	Gauge (max)
`fe_connection_total`	500	900	Gauge (max)

Category 6: Process Health

Dimension	Warning	Critical	Source
`be_fd_usage`	30,000	50,000	Gauge (max)

ClusterRiskSnapshot Dataclass

Each cluster produces a ClusterRiskSnapshot containing:

All raw metric values (compaction, disk, memory, errors, etc.)
Context fields (avg_latency_ms, qps, node_count)
Classification result: risk_level (Healthy/Warning/Critical), risk_reasons (list of triggered reasons), suggested_actions (list of remediation steps)

Classification Logic

For each dimension:

Read the raw metric value from the snapshot.
Compare against Warning and Critical thresholds.
For inverted dimensions (lower = worse): value <= critical triggers Critical.
For normal dimensions (higher = worse): value >= critical triggers Critical.
Each triggered dimension adds a reason (from reason_template) and an action.
Overall risk_level = the worst triggered level across all dimensions.

Data Fetching

Three batch queries fetch all data efficiently:

Query	Purpose
`_fetch_gauge_metrics(lookback)`	Gauge + rate metrics (12 metric names) from MV
`_fetch_gauge_metrics_7d_ago(lookback)`	Data size from ~7 days ago for growth calculation
`_fetch_counter_metrics(lookback)`	Counter metrics (7 metric names) using MAX-MIN per host
`_fetch_enrichment()`	Cluster names, account names, node counts from byoc tables

Persistence

Results are stored in alerts.cluster_risk_snapshots as a PRIMARY KEY table (keyed on cluster_id). Each run upserts the latest snapshot, replacing the previous row per cluster.

Sorting

Results are sorted:

Critical first, then Warning, then Healthy
Within each level, by number of risk reasons descending (most reasons = most problematic)

Usage

from byoc_agent.cluster_risk_analyzer import compute_cluster_risk, persist_risk_snapshots

snapshots = compute_cluster_risk()
persist_risk_snapshots(snapshots)

# Load current snapshots from DB
from byoc_agent.cluster_risk_analyzer import load_risk_snapshots
current = load_risk_snapshots()

Threshold Basis

Thresholds are calibrated from:

Production Grafana alert rules extracted from Lark alert history (Jan-Mar 2026, 8,474 alerts)
Real metric distributions across ~184 active BYOC clusters
StarRocks operational domain knowledge

Example production alert mappings:

FEMaxTabletCompaction fires at compaction > 1,000 (566 alerts, 9 clusters)
FreeDiskLessThan10% fires at disk free < 10% (164 alerts, 7 clusters)
FEHeapUsageTooHigh fires at FE JVM heap > 95% (421 alerts, 12 clusters)

Risk Dimensions​

Category 1: Storage & Compaction​

Category 2: Query Performance​

Category 3: Ingestion Health​

Category 4: Resource Pressure​

Category 5: FE & Metadata Health​

Category 6: Process Health​

ClusterRiskSnapshot Dataclass​

Classification Logic​

Data Fetching​

Persistence​

Sorting​

Usage​

Threshold Basis​