Skip to main content

ClusterRiskAnalyzer -- Threshold-Based Risk Classification

Source: byoc_agent/cluster_risk_analyzer.py Rules: byoc_agent/cluster_risk_rules.yaml (v3)

The ClusterRiskAnalyzer complements the weighted health scoring in HealthScorer. It uses a threshold-based approach: any single metric exceeding a threshold flags the cluster, with the overall risk level set to the worst triggered level.

Risk Dimensions

10 risk dimensions across 6 categories:

Category 1: Storage & Compaction

DimensionWarningCriticalSource
compaction_score1,0005,000Gauge (max) -- matches production alert
disk_used_pct80%90%Derived (1 - free/size)
data_growth_pct100%300%Derived (current vs 7d ago)

Category 2: Query Performance

DimensionWarningCriticalSource
query_errors_7d10,00050,000Counter delta
query_timeouts_7d5005,000Counter delta
slow_queries_7d50,000500,000Counter delta
internal_errors_7d1,00010,000Counter delta

Category 3: Ingestion Health

DimensionWarningCriticalSource
txn_failures_7d10,00050,000Counter delta
txn_rejects_7d1005,000Counter delta

Category 4: Resource Pressure

DimensionWarningCriticalSource
memory_available_pct20%10%Derived, inverted
fe_heap_used_pct85%95%Derived
be_process_mem_gb120 GB180 GBGauge (max), converted

Category 5: FE & Metadata Health

DimensionWarningCriticalSource
fe_journal_write_latency_ms1,000 ms10,000 msGauge (max)
fe_connection_total500900Gauge (max)

Category 6: Process Health

DimensionWarningCriticalSource
be_fd_usage30,00050,000Gauge (max)

ClusterRiskSnapshot Dataclass

Each cluster produces a ClusterRiskSnapshot containing:

  • All raw metric values (compaction, disk, memory, errors, etc.)
  • Context fields (avg_latency_ms, qps, node_count)
  • Classification result: risk_level (Healthy/Warning/Critical), risk_reasons (list of triggered reasons), suggested_actions (list of remediation steps)

Classification Logic

For each dimension:

  1. Read the raw metric value from the snapshot.
  2. Compare against Warning and Critical thresholds.
  3. For inverted dimensions (lower = worse): value <= critical triggers Critical.
  4. For normal dimensions (higher = worse): value >= critical triggers Critical.
  5. Each triggered dimension adds a reason (from reason_template) and an action.
  6. Overall risk_level = the worst triggered level across all dimensions.

Data Fetching

Three batch queries fetch all data efficiently:

QueryPurpose
_fetch_gauge_metrics(lookback)Gauge + rate metrics (12 metric names) from MV
_fetch_gauge_metrics_7d_ago(lookback)Data size from ~7 days ago for growth calculation
_fetch_counter_metrics(lookback)Counter metrics (7 metric names) using MAX-MIN per host
_fetch_enrichment()Cluster names, account names, node counts from byoc tables

Persistence

Results are stored in alerts.cluster_risk_snapshots as a PRIMARY KEY table (keyed on cluster_id). Each run upserts the latest snapshot, replacing the previous row per cluster.

Sorting

Results are sorted:

  1. Critical first, then Warning, then Healthy
  2. Within each level, by number of risk reasons descending (most reasons = most problematic)

Usage

from byoc_agent.cluster_risk_analyzer import compute_cluster_risk, persist_risk_snapshots

snapshots = compute_cluster_risk()
persist_risk_snapshots(snapshots)

# Load current snapshots from DB
from byoc_agent.cluster_risk_analyzer import load_risk_snapshots
current = load_risk_snapshots()

Threshold Basis

Thresholds are calibrated from:

  • Production Grafana alert rules extracted from Lark alert history (Jan-Mar 2026, 8,474 alerts)
  • Real metric distributions across ~184 active BYOC clusters
  • StarRocks operational domain knowledge

Example production alert mappings:

  • FEMaxTabletCompaction fires at compaction > 1,000 (566 alerts, 9 clusters)
  • FreeDiskLessThan10% fires at disk free < 10% (164 alerts, 7 clusters)
  • FEHeapUsageTooHigh fires at FE JVM heap > 95% (421 alerts, 12 clusters)