Skip to main content

Scoring Rules Reference

This page documents all YAML-based scoring rules used by the health scoring, risk analysis, unified scoring, and issue tracking systems.

Rule Files

FileVersionUsed By
unified_rules.yamlv1UnifiedScorer
health_rules.yamlv3HealthScorer
cluster_risk_rules.yamlv3ClusterRiskAnalyzer
issue_rules.yamlv2IssueTracker

Unified Rules (unified_rules.yaml)

The authoritative scoring configuration. Single pipeline replacing separate health + risk runs.

Weights

ComponentWeight
Metrics0.60
Alerts0.25
Tier adjustment0.15

Tier Multipliers

TierMultiplierEffect
S1.550% amplification of penalty
A1.330% amplification of penalty
B1.0No amplification (default)

Alert Scoring Bands

Max Firing Alerts (7d)Score
0100
380
1050
2020
>200

Classification Bands

ClassificationMin Score
Healthy80
Warning50
Critical0

AI Investigation Trigger

ParameterValue
bottom_n20
score_threshold30

Whichever yields fewer candidates is used.

Metric Dimensions (15 total)

Storage & Compaction

DimensionWeightGreenYellowRedInverted
compaction_score0.105001,0005,000No
disk_used_pct0.0870%80%90%No
data_growth_pct0.0550%100%300%No

Query Performance

DimensionWeightGreenYellowRedInverted
query_errors_7d0.105,00010,00050,000No
query_timeouts_7d0.062005005,000No
slow_queries_7d0.0620,00050,000500,000No
internal_errors_7d0.085001,00010,000No

Ingestion Health

DimensionWeightGreenYellowRedInverted
txn_failures_7d0.075,00010,00050,000No
txn_rejects_7d0.07501005,000No

Resource Pressure

DimensionWeightGreenYellowRedInverted
memory_available_pct0.0840%20%10%Yes
fe_heap_used_pct0.0675%85%95%No
be_process_mem_gb0.0680 GB120 GB180 GBNo

FE & Metadata Health

DimensionWeightGreenYellowRedInverted
fe_journal_write_latency_ms0.05500 ms1,000 ms10,000 msNo
fe_connection_total0.04300500900No

Process Health

DimensionWeightGreenYellowRedInverted
be_fd_usage0.0420,00030,00050,000No

Health Rules (health_rules.yaml)

Used by the standalone HealthScorer. 8 dimensions with lookback of 7 days.

Dimensions

DimensionWeightGreenYellowRedInvertedDescription
error_rate0.200.5%2.0%5.0%NoQuery error rate (errors/total*100)
compaction0.155002,0005,000NoMax tablet compaction score
memory_available_pct0.1530%15%5%YesNode memory available %
be_process_mem_gb0.1580 GB120 GB180 GBNoBE process memory (OOM risk)
disk_used_pct0.1070%85%95%NoDisk utilization %
alert_activity0.152510NoFiring alert count
fe_connection_total0.05300500900NoMax FE connections
fe_journal_latency0.05500 ms1,000 ms10,000 msNoFE journal write latency

Classification

LevelMin ScoreLabelColor
Healthy80Healthy#2ecc71
Warning50Warning#f39c12
Critical0Critical#e74c3c

Cluster Risk Rules (cluster_risk_rules.yaml)

Used by the standalone ClusterRiskAnalyzer. Threshold-based: any single metric exceeding a threshold flags the cluster. 16 dimensions across 6 categories.

Category 1: Storage & Compaction

DimensionWarningCriticalNotes
compaction_score1,0005,000Matches production alert (566 alerts, 9 clusters)
disk_used_pct80%90%Production alert at >90% (164 alerts, 7 clusters)
data_growth_pct100%300%No production alert; statistical threshold

Category 2: Query Performance

DimensionWarningCriticalNotes
query_errors_7d10,00050,000Real data: top=75K, p90~5K
query_timeouts_7d5005,000Real data: top=9.5K, p90~100
slow_queries_7d50,000500,000Real data: top=1.3M, p90~25K
internal_errors_7d1,00010,000Real data: top=25K, p90~500

Category 3: Ingestion Health

DimensionWarningCriticalNotes
txn_failures_7d10,00050,000Real data: top=85K, p90~4K
txn_rejects_7d1005,000Real data: top=21K, most <50

Category 4: Resource Pressure

DimensionWarningCriticalNotes
memory_available_pct20%10%Inverted. Real data: bottom=31%, p10~40%
fe_heap_used_pct85%95%Production alert at >95% (421 alerts)
be_process_mem_gb120 GB180 GBReal data: top=200GB, p90~60GB

Category 5: FE & Metadata Health

DimensionWarningCriticalNotes
fe_journal_write_latency_ms1,000 ms10,000 msProduction alert P99 >10s (46 alerts)
fe_connection_total500900Default max_connections=1024

Category 6: Process Health

DimensionWarningCriticalNotes
be_fd_usage30,00050,000Linux default ulimit=65536 (53 alerts)

Issue Rules (issue_rules.yaml)

Used by the IssueTracker for alert grouping and correlation.

Grouping Configuration

ParameterValueDescription
time_window_minutes30Max gap for failure alert grouping
correlation_window_minutes240Wider window for correlated types
reopen_window_minutes15Reopen if fires again within this window

Alert Categories

PatternCategoryBehavior
OperationDurationGTanomalyMerge all into one issue until resolved
FEHeapUsageTooHighanomaly
HeapUsageTooHighanomaly
FEGCCountanomaly
JVMOldGCanomaly
FEMaxTabletCompactionanomaly
CompactionScoreanomaly
FreeDiskLessThananomaly
RootFreeDiskLessThananomaly
BEUsedFdTooMuchanomaly
FEQueryErrRateMoreThananomaly
OperationAbnormalfailureTime-window grouping
BeAliveAbnormalfailure
ProcNotRunningfailure
BeNodeAbnormalfailure
FeNodeAbnormalfailure
ClusterStateAbnormalfailure

Correlation Groups

GroupSeverityPatternsEscalation
be_node_failureCriticalBeAliveAbnormal, ProcNotRunning, BeNodeAbnormal, ClusterStateAbnormal--
fe_memory_pressureWarningFEHeapUsageTooHigh, HeapUsageTooHigh, FEGCCount, JVMOldGC, ProcNotRunningEscalates to Critical when ProcNotRunning (OOM crash) fires
compaction_backlogWarningFEMaxTabletCompaction, CompactionScore--
disk_pressureWarningFreeDiskLessThan, RootFreeDiskLessThan--
operational_anomalyWarningOperationDurationGT, OperationAbnormal"Operational Anomaly" becomes "Operational Failure" when OperationAbnormal fires

Threshold Comparison Across Systems

The three scoring systems use different thresholds for the same metrics. Here is a side-by-side comparison for key dimensions:

MetricUnified (Green/Yellow/Red)Health (Green/Yellow/Red)Risk (Warning/Critical)
Compaction Score500 / 1K / 5K500 / 2K / 5K1K / 5K
Disk Used %70 / 80 / 9070 / 85 / 9580 / 90
Memory Available %40 / 20 / 1030 / 15 / 520 / 10
BE Process Mem GB80 / 120 / 18080 / 120 / 180120 / 180
FE Connections300 / 500 / 900300 / 500 / 900500 / 900
FE Journal Latency ms500 / 1K / 10K500 / 1K / 10K1K / 10K

The unified rules generally use the most permissive green thresholds and are the authoritative source going forward.