Scoring Rules Reference

This page documents all YAML-based scoring rules used by the health scoring, risk analysis, unified scoring, and issue tracking systems.

Rule Files

File	Version	Used By
`unified_rules.yaml`	v1	UnifiedScorer
`health_rules.yaml`	v3	HealthScorer
`cluster_risk_rules.yaml`	v3	ClusterRiskAnalyzer
`issue_rules.yaml`	v2	IssueTracker

Unified Rules (unified_rules.yaml)

The authoritative scoring configuration. Single pipeline replacing separate health + risk runs.

Weights

Component	Weight
Metrics	0.60
Alerts	0.25
Tier adjustment	0.15

Tier Multipliers

Tier	Multiplier	Effect
S	1.5	50% amplification of penalty
A	1.3	30% amplification of penalty
B	1.0	No amplification (default)

Alert Scoring Bands

Max Firing Alerts (7d)	Score
0	100
3	80
10	50
20	20
>20	0

Classification Bands

Classification	Min Score
Healthy	80
Warning	50
Critical	0

AI Investigation Trigger

Parameter	Value
bottom_n	20
score_threshold	30

Whichever yields fewer candidates is used.

Metric Dimensions (15 total)

Storage & Compaction

Dimension	Weight	Green	Yellow	Red	Inverted
compaction_score	0.10	500	1,000	5,000	No
disk_used_pct	0.08	70%	80%	90%	No
data_growth_pct	0.05	50%	100%	300%	No

Query Performance

Dimension	Weight	Green	Yellow	Red	Inverted
query_errors_7d	0.10	5,000	10,000	50,000	No
query_timeouts_7d	0.06	200	500	5,000	No
slow_queries_7d	0.06	20,000	50,000	500,000	No
internal_errors_7d	0.08	500	1,000	10,000	No

Ingestion Health

Dimension	Weight	Green	Yellow	Red	Inverted
txn_failures_7d	0.07	5,000	10,000	50,000	No
txn_rejects_7d	0.07	50	100	5,000	No

Resource Pressure

Dimension	Weight	Green	Yellow	Red	Inverted
memory_available_pct	0.08	40%	20%	10%	Yes
fe_heap_used_pct	0.06	75%	85%	95%	No
be_process_mem_gb	0.06	80 GB	120 GB	180 GB	No

FE & Metadata Health

Dimension	Weight	Green	Yellow	Red	Inverted
fe_journal_write_latency_ms	0.05	500 ms	1,000 ms	10,000 ms	No
fe_connection_total	0.04	300	500	900	No

Process Health

Dimension	Weight	Green	Yellow	Red	Inverted
be_fd_usage	0.04	20,000	30,000	50,000	No

Health Rules (health_rules.yaml)

Used by the standalone HealthScorer. 8 dimensions with lookback of 7 days.

Dimensions

Dimension	Weight	Green	Yellow	Red	Inverted	Description
error_rate	0.20	0.5%	2.0%	5.0%	No	Query error rate (errors/total*100)
compaction	0.15	500	2,000	5,000	No	Max tablet compaction score
memory_available_pct	0.15	30%	15%	5%	Yes	Node memory available %
be_process_mem_gb	0.15	80 GB	120 GB	180 GB	No	BE process memory (OOM risk)
disk_used_pct	0.10	70%	85%	95%	No	Disk utilization %
alert_activity	0.15	2	5	10	No	Firing alert count
fe_connection_total	0.05	300	500	900	No	Max FE connections
fe_journal_latency	0.05	500 ms	1,000 ms	10,000 ms	No	FE journal write latency

Classification

Level	Min Score	Label	Color
Healthy	80	Healthy	#2ecc71
Warning	50	Warning	#f39c12
Critical	0	Critical	#e74c3c

Cluster Risk Rules (cluster_risk_rules.yaml)

Used by the standalone ClusterRiskAnalyzer. Threshold-based: any single metric exceeding a threshold flags the cluster. 16 dimensions across 6 categories.

Category 1: Storage & Compaction

Dimension	Warning	Critical	Notes
compaction_score	1,000	5,000	Matches production alert (566 alerts, 9 clusters)
disk_used_pct	80%	90%	Production alert at >90% (164 alerts, 7 clusters)
data_growth_pct	100%	300%	No production alert; statistical threshold

Category 2: Query Performance

Dimension	Warning	Critical	Notes
query_errors_7d	10,000	50,000	Real data: top=75K, p90~5K
query_timeouts_7d	500	5,000	Real data: top=9.5K, p90~100
slow_queries_7d	50,000	500,000	Real data: top=1.3M, p90~25K
internal_errors_7d	1,000	10,000	Real data: top=25K, p90~500

Category 3: Ingestion Health

Dimension	Warning	Critical	Notes
txn_failures_7d	10,000	50,000	Real data: top=85K, p90~4K
txn_rejects_7d	100	5,000	Real data: top=21K, most <50

Category 4: Resource Pressure

Dimension	Warning	Critical	Notes
memory_available_pct	20%	10%	Inverted. Real data: bottom=31%, p10~40%
fe_heap_used_pct	85%	95%	Production alert at >95% (421 alerts)
be_process_mem_gb	120 GB	180 GB	Real data: top=200GB, p90~60GB

Category 5: FE & Metadata Health

Dimension	Warning	Critical	Notes
fe_journal_write_latency_ms	1,000 ms	10,000 ms	Production alert P99 >10s (46 alerts)
fe_connection_total	500	900	Default max_connections=1024

Category 6: Process Health

Dimension	Warning	Critical	Notes
be_fd_usage	30,000	50,000	Linux default ulimit=65536 (53 alerts)

Issue Rules (issue_rules.yaml)

Used by the IssueTracker for alert grouping and correlation.

Grouping Configuration

Parameter	Value	Description
time_window_minutes	30	Max gap for failure alert grouping
correlation_window_minutes	240	Wider window for correlated types
reopen_window_minutes	15	Reopen if fires again within this window

Alert Categories

Pattern	Category	Behavior
OperationDurationGT	anomaly	Merge all into one issue until resolved
FEHeapUsageTooHigh	anomaly
HeapUsageTooHigh	anomaly
FEGCCount	anomaly
JVMOldGC	anomaly
FEMaxTabletCompaction	anomaly
CompactionScore	anomaly
FreeDiskLessThan	anomaly
RootFreeDiskLessThan	anomaly
BEUsedFdTooMuch	anomaly
FEQueryErrRateMoreThan	anomaly
OperationAbnormal	failure	Time-window grouping
BeAliveAbnormal	failure
ProcNotRunning	failure
BeNodeAbnormal	failure
FeNodeAbnormal	failure
ClusterStateAbnormal	failure

Correlation Groups

Group	Severity	Patterns	Escalation
be_node_failure	Critical	BeAliveAbnormal, ProcNotRunning, BeNodeAbnormal, ClusterStateAbnormal	--
fe_memory_pressure	Warning	FEHeapUsageTooHigh, HeapUsageTooHigh, FEGCCount, JVMOldGC, ProcNotRunning	Escalates to Critical when ProcNotRunning (OOM crash) fires
compaction_backlog	Warning	FEMaxTabletCompaction, CompactionScore	--
disk_pressure	Warning	FreeDiskLessThan, RootFreeDiskLessThan	--
operational_anomaly	Warning	OperationDurationGT, OperationAbnormal	"Operational Anomaly" becomes "Operational Failure" when OperationAbnormal fires

Threshold Comparison Across Systems

The three scoring systems use different thresholds for the same metrics. Here is a side-by-side comparison for key dimensions:

Metric	Unified (Green/Yellow/Red)	Health (Green/Yellow/Red)	Risk (Warning/Critical)
Compaction Score	500 / 1K / 5K	500 / 2K / 5K	1K / 5K
Disk Used %	70 / 80 / 90	70 / 85 / 95	80 / 90
Memory Available %	40 / 20 / 10	30 / 15 / 5	20 / 10
BE Process Mem GB	80 / 120 / 180	80 / 120 / 180	120 / 180
FE Connections	300 / 500 / 900	300 / 500 / 900	500 / 900
FE Journal Latency ms	500 / 1K / 10K	500 / 1K / 10K	1K / 10K

The unified rules generally use the most permissive green thresholds and are the authoritative source going forward.

Rule Files​

Unified Rules (unified_rules.yaml)​

Weights​

Tier Multipliers​

Alert Scoring Bands​

Classification Bands​

AI Investigation Trigger​

Metric Dimensions (15 total)​

Storage & Compaction​

Query Performance​

Ingestion Health​

Resource Pressure​

FE & Metadata Health​

Process Health​

Health Rules (health_rules.yaml)​

Dimensions​

Classification​

Cluster Risk Rules (cluster_risk_rules.yaml)​

Category 1: Storage & Compaction​

Category 2: Query Performance​

Category 3: Ingestion Health​

Category 4: Resource Pressure​

Category 5: FE & Metadata Health​

Category 6: Process Health​

Issue Rules (issue_rules.yaml)​

Grouping Configuration​

Alert Categories​

Correlation Groups​

Threshold Comparison Across Systems​