This page documents all YAML-based scoring rules used by the health scoring, risk analysis, unified scoring, and issue tracking systems.
Rule Files
| File | Version | Used By |
|---|
unified_rules.yaml | v1 | UnifiedScorer |
health_rules.yaml | v3 | HealthScorer |
cluster_risk_rules.yaml | v3 | ClusterRiskAnalyzer |
issue_rules.yaml | v2 | IssueTracker |
Unified Rules (unified_rules.yaml)
The authoritative scoring configuration. Single pipeline replacing separate health + risk runs.
Weights
| Component | Weight |
|---|
| Metrics | 0.60 |
| Alerts | 0.25 |
| Tier adjustment | 0.15 |
Tier Multipliers
| Tier | Multiplier | Effect |
|---|
| S | 1.5 | 50% amplification of penalty |
| A | 1.3 | 30% amplification of penalty |
| B | 1.0 | No amplification (default) |
Alert Scoring Bands
| Max Firing Alerts (7d) | Score |
|---|
| 0 | 100 |
| 3 | 80 |
| 10 | 50 |
| 20 | 20 |
| >20 | 0 |
Classification Bands
| Classification | Min Score |
|---|
| Healthy | 80 |
| Warning | 50 |
| Critical | 0 |
AI Investigation Trigger
| Parameter | Value |
|---|
| bottom_n | 20 |
| score_threshold | 30 |
Whichever yields fewer candidates is used.
Metric Dimensions (15 total)
Storage & Compaction
| Dimension | Weight | Green | Yellow | Red | Inverted |
|---|
| compaction_score | 0.10 | 500 | 1,000 | 5,000 | No |
| disk_used_pct | 0.08 | 70% | 80% | 90% | No |
| data_growth_pct | 0.05 | 50% | 100% | 300% | No |
| Dimension | Weight | Green | Yellow | Red | Inverted |
|---|
| query_errors_7d | 0.10 | 5,000 | 10,000 | 50,000 | No |
| query_timeouts_7d | 0.06 | 200 | 500 | 5,000 | No |
| slow_queries_7d | 0.06 | 20,000 | 50,000 | 500,000 | No |
| internal_errors_7d | 0.08 | 500 | 1,000 | 10,000 | No |
Ingestion Health
| Dimension | Weight | Green | Yellow | Red | Inverted |
|---|
| txn_failures_7d | 0.07 | 5,000 | 10,000 | 50,000 | No |
| txn_rejects_7d | 0.07 | 50 | 100 | 5,000 | No |
Resource Pressure
| Dimension | Weight | Green | Yellow | Red | Inverted |
|---|
| memory_available_pct | 0.08 | 40% | 20% | 10% | Yes |
| fe_heap_used_pct | 0.06 | 75% | 85% | 95% | No |
| be_process_mem_gb | 0.06 | 80 GB | 120 GB | 180 GB | No |
| Dimension | Weight | Green | Yellow | Red | Inverted |
|---|
| fe_journal_write_latency_ms | 0.05 | 500 ms | 1,000 ms | 10,000 ms | No |
| fe_connection_total | 0.04 | 300 | 500 | 900 | No |
Process Health
| Dimension | Weight | Green | Yellow | Red | Inverted |
|---|
| be_fd_usage | 0.04 | 20,000 | 30,000 | 50,000 | No |
Health Rules (health_rules.yaml)
Used by the standalone HealthScorer. 8 dimensions with lookback of 7 days.
Dimensions
| Dimension | Weight | Green | Yellow | Red | Inverted | Description |
|---|
| error_rate | 0.20 | 0.5% | 2.0% | 5.0% | No | Query error rate (errors/total*100) |
| compaction | 0.15 | 500 | 2,000 | 5,000 | No | Max tablet compaction score |
| memory_available_pct | 0.15 | 30% | 15% | 5% | Yes | Node memory available % |
| be_process_mem_gb | 0.15 | 80 GB | 120 GB | 180 GB | No | BE process memory (OOM risk) |
| disk_used_pct | 0.10 | 70% | 85% | 95% | No | Disk utilization % |
| alert_activity | 0.15 | 2 | 5 | 10 | No | Firing alert count |
| fe_connection_total | 0.05 | 300 | 500 | 900 | No | Max FE connections |
| fe_journal_latency | 0.05 | 500 ms | 1,000 ms | 10,000 ms | No | FE journal write latency |
Classification
| Level | Min Score | Label | Color |
|---|
| Healthy | 80 | Healthy | #2ecc71 |
| Warning | 50 | Warning | #f39c12 |
| Critical | 0 | Critical | #e74c3c |
Cluster Risk Rules (cluster_risk_rules.yaml)
Used by the standalone ClusterRiskAnalyzer. Threshold-based: any single metric exceeding a threshold flags the cluster. 16 dimensions across 6 categories.
Category 1: Storage & Compaction
| Dimension | Warning | Critical | Notes |
|---|
| compaction_score | 1,000 | 5,000 | Matches production alert (566 alerts, 9 clusters) |
| disk_used_pct | 80% | 90% | Production alert at >90% (164 alerts, 7 clusters) |
| data_growth_pct | 100% | 300% | No production alert; statistical threshold |
| Dimension | Warning | Critical | Notes |
|---|
| query_errors_7d | 10,000 | 50,000 | Real data: top=75K, p90~5K |
| query_timeouts_7d | 500 | 5,000 | Real data: top=9.5K, p90~100 |
| slow_queries_7d | 50,000 | 500,000 | Real data: top=1.3M, p90~25K |
| internal_errors_7d | 1,000 | 10,000 | Real data: top=25K, p90~500 |
Category 3: Ingestion Health
| Dimension | Warning | Critical | Notes |
|---|
| txn_failures_7d | 10,000 | 50,000 | Real data: top=85K, p90~4K |
| txn_rejects_7d | 100 | 5,000 | Real data: top=21K, most <50 |
Category 4: Resource Pressure
| Dimension | Warning | Critical | Notes |
|---|
| memory_available_pct | 20% | 10% | Inverted. Real data: bottom=31%, p10~40% |
| fe_heap_used_pct | 85% | 95% | Production alert at >95% (421 alerts) |
| be_process_mem_gb | 120 GB | 180 GB | Real data: top=200GB, p90~60GB |
| Dimension | Warning | Critical | Notes |
|---|
| fe_journal_write_latency_ms | 1,000 ms | 10,000 ms | Production alert P99 >10s (46 alerts) |
| fe_connection_total | 500 | 900 | Default max_connections=1024 |
Category 6: Process Health
| Dimension | Warning | Critical | Notes |
|---|
| be_fd_usage | 30,000 | 50,000 | Linux default ulimit=65536 (53 alerts) |
Issue Rules (issue_rules.yaml)
Used by the IssueTracker for alert grouping and correlation.
Grouping Configuration
| Parameter | Value | Description |
|---|
| time_window_minutes | 30 | Max gap for failure alert grouping |
| correlation_window_minutes | 240 | Wider window for correlated types |
| reopen_window_minutes | 15 | Reopen if fires again within this window |
Alert Categories
| Pattern | Category | Behavior |
|---|
| OperationDurationGT | anomaly | Merge all into one issue until resolved |
| FEHeapUsageTooHigh | anomaly | |
| HeapUsageTooHigh | anomaly | |
| FEGCCount | anomaly | |
| JVMOldGC | anomaly | |
| FEMaxTabletCompaction | anomaly | |
| CompactionScore | anomaly | |
| FreeDiskLessThan | anomaly | |
| RootFreeDiskLessThan | anomaly | |
| BEUsedFdTooMuch | anomaly | |
| FEQueryErrRateMoreThan | anomaly | |
| OperationAbnormal | failure | Time-window grouping |
| BeAliveAbnormal | failure | |
| ProcNotRunning | failure | |
| BeNodeAbnormal | failure | |
| FeNodeAbnormal | failure | |
| ClusterStateAbnormal | failure | |
Correlation Groups
| Group | Severity | Patterns | Escalation |
|---|
| be_node_failure | Critical | BeAliveAbnormal, ProcNotRunning, BeNodeAbnormal, ClusterStateAbnormal | -- |
| fe_memory_pressure | Warning | FEHeapUsageTooHigh, HeapUsageTooHigh, FEGCCount, JVMOldGC, ProcNotRunning | Escalates to Critical when ProcNotRunning (OOM crash) fires |
| compaction_backlog | Warning | FEMaxTabletCompaction, CompactionScore | -- |
| disk_pressure | Warning | FreeDiskLessThan, RootFreeDiskLessThan | -- |
| operational_anomaly | Warning | OperationDurationGT, OperationAbnormal | "Operational Anomaly" becomes "Operational Failure" when OperationAbnormal fires |
Threshold Comparison Across Systems
The three scoring systems use different thresholds for the same metrics. Here is a side-by-side comparison for key dimensions:
| Metric | Unified (Green/Yellow/Red) | Health (Green/Yellow/Red) | Risk (Warning/Critical) |
|---|
| Compaction Score | 500 / 1K / 5K | 500 / 2K / 5K | 1K / 5K |
| Disk Used % | 70 / 80 / 90 | 70 / 85 / 95 | 80 / 90 |
| Memory Available % | 40 / 20 / 10 | 30 / 15 / 5 | 20 / 10 |
| BE Process Mem GB | 80 / 120 / 180 | 80 / 120 / 180 | 120 / 180 |
| FE Connections | 300 / 500 / 900 | 300 / 500 / 900 | 500 / 900 |
| FE Journal Latency ms | 500 / 1K / 10K | 500 / 1K / 10K | 1K / 10K |
The unified rules generally use the most permissive green thresholds and are the authoritative source going forward.