HealthScorer -- Per-Cluster Health Scoring

Source: byoc_agent/health_scorer.py Rules: byoc_agent/health_rules.yaml (v3)

The HealthScorer computes per-cluster health scores (0-100) from configurable YAML rules. Each cluster is scored across multiple dimensions using linear interpolation between green/yellow/red thresholds, then combined via weighted average.

Dataclasses

DimensionScore

Represents a single dimension's score for one cluster.

Field	Type	Description
`name`	str	Dimension name (e.g., `error_rate`, `compaction`)
`score`	float	0-100 score for this dimension
`value`	float	Raw metric value used
`threshold_green`	float	Green (healthy) threshold
`threshold_yellow`	float	Yellow (warning) threshold
`threshold_red`	float	Red (critical) threshold
`weight`	float	Relative weight in overall score
`description`	str	Human-readable description

ClusterHealthScore

Composite health score for one cluster.

Field	Type	Description
`cluster_id`	str	Cluster UUID
`account_id`	str	Account ID
`overall_score`	float	Weighted average of all dimensions (0-100)
`classification`	str	Healthy / Warning / Critical
`emoji`	str	Color indicator
`color`	str	Hex color code
`dimension_scores`	dict	Map of dimension name to `DimensionScore`

CustomerSummary

Aggregated health for a customer across all their clusters.

Field	Type	Description
`account_id`	str	Account ID
`customer_name`	str	Display name
`tier`	str	A, B, or S
`cluster_count`	int	Number of clusters
`avg_score`	float	Average score across clusters
`worst_cluster_id`	str	Cluster with lowest score
`risk_factors`	list[str]	Dimensions scoring below 50

Scoring Algorithm

Linear Interpolation

Each dimension is scored via _score_dimension():

Normal mode (higher = worse, e.g., compaction score):
- value <= green: score = 100
- value between green and yellow: score = 100 to 50 (linear)
- value between yellow and red: score = 50 to 0 (linear)
- value >= red: score = 0
Inverted mode (lower = worse, e.g., memory available %):
- value >= green: score = 100
- value between yellow and green: score = 100 to 50 (linear)
- value between red and yellow: score = 50 to 0 (linear)
- value <= red: score = 0

Overall Score

overall = SUM(dimension_score * weight) / SUM(weight)

Classification

Classification	Min Score
Healthy	80
Warning	50
Critical	0

Data Fetching

The scorer uses two distinct fetch strategies based on metric type:

Gauge Metrics

Fetched directly from amv_hourly_snapshots_v1 using MAX/AVG aggregation. These are point-in-time values where MAX and AVG are meaningful.

Counter Metrics

Fetched using MAX-MIN delta per host, then SUM across hosts per cluster. This correctly computes the 7-day count for cumulative counters.

Important: The v6 view's LAG-based delta is NOT used because it inflates cumulative values for flat counters by orders of magnitude.

Alert Counts

Firing alert counts per cluster from lark_alerts.

Dimensions (from health_rules.yaml)

Dimension	Weight	Green	Yellow	Red	Notes
error_rate	0.20	0.5%	2.0%	5.0%	Derived: errors/total*100
compaction	0.15	500	2,000	5,000	Max tablet compaction score
memory_available_pct	0.15	30%	15%	5%	Inverted (lower=worse)
be_process_mem_gb	0.15	80 GB	120 GB	180 GB	Peak BE process memory
disk_used_pct	0.10	70%	85%	95%	Derived from filesystem bytes
alert_activity	0.15	2	5	10	Firing alert count
fe_connection_total	0.05	300	500	900	Max FE connections
fe_journal_latency	0.05	500ms	1,000ms	10,000ms	Max journal write latency

Persistence

Scores are stored as snapshots in alerts.cluster_health_scores (DUPLICATE KEY table). Each run inserts a new row per cluster with the current timestamp, enabling historical comparison.

Usage

from byoc_agent.health_scorer import compute_cluster_scores, persist_scores

scores = compute_cluster_scores()
persist_scores(scores)

# Score specific clusters only
scores = compute_cluster_scores(cluster_ids=["abc-123", "def-456"])

# Load most recent snapshot from DB
from byoc_agent.health_scorer import load_latest_scores
latest = load_latest_scores()

Dataclasses​

DimensionScore​

ClusterHealthScore​

CustomerSummary​

Scoring Algorithm​

Linear Interpolation​

Overall Score​

Classification​

Data Fetching​

Gauge Metrics​

Counter Metrics​

Alert Counts​

Dimensions (from health_rules.yaml)​

Persistence​

Usage​