HealthScorer -- Per-Cluster Health Scoring
Source: byoc_agent/health_scorer.py
Rules: byoc_agent/health_rules.yaml (v3)
The HealthScorer computes per-cluster health scores (0-100) from configurable YAML rules. Each cluster is scored across multiple dimensions using linear interpolation between green/yellow/red thresholds, then combined via weighted average.
Dataclasses
DimensionScore
Represents a single dimension's score for one cluster.
| Field | Type | Description |
|---|---|---|
name | str | Dimension name (e.g., error_rate, compaction) |
score | float | 0-100 score for this dimension |
value | float | Raw metric value used |
threshold_green | float | Green (healthy) threshold |
threshold_yellow | float | Yellow (warning) threshold |
threshold_red | float | Red (critical) threshold |
weight | float | Relative weight in overall score |
description | str | Human-readable description |
ClusterHealthScore
Composite health score for one cluster.
| Field | Type | Description |
|---|---|---|
cluster_id | str | Cluster UUID |
account_id | str | Account ID |
overall_score | float | Weighted average of all dimensions (0-100) |
classification | str | Healthy / Warning / Critical |
emoji | str | Color indicator |
color | str | Hex color code |
dimension_scores | dict | Map of dimension name to DimensionScore |
CustomerSummary
Aggregated health for a customer across all their clusters.
| Field | Type | Description |
|---|---|---|
account_id | str | Account ID |
customer_name | str | Display name |
tier | str | A, B, or S |
cluster_count | int | Number of clusters |
avg_score | float | Average score across clusters |
worst_cluster_id | str | Cluster with lowest score |
risk_factors | list[str] | Dimensions scoring below 50 |
Scoring Algorithm
Linear Interpolation
Each dimension is scored via _score_dimension():
-
Normal mode (higher = worse, e.g., compaction score):
value <= green: score = 100- value between green and yellow: score = 100 to 50 (linear)
- value between yellow and red: score = 50 to 0 (linear)
value >= red: score = 0
-
Inverted mode (lower = worse, e.g., memory available %):
value >= green: score = 100- value between yellow and green: score = 100 to 50 (linear)
- value between red and yellow: score = 50 to 0 (linear)
value <= red: score = 0
Overall Score
overall = SUM(dimension_score * weight) / SUM(weight)
Classification
| Classification | Min Score |
|---|---|
| Healthy | 80 |
| Warning | 50 |
| Critical | 0 |
Data Fetching
The scorer uses two distinct fetch strategies based on metric type:
Gauge Metrics
Fetched directly from amv_hourly_snapshots_v1 using MAX/AVG aggregation. These are point-in-time values where MAX and AVG are meaningful.
Counter Metrics
Fetched using MAX-MIN delta per host, then SUM across hosts per cluster. This correctly computes the 7-day count for cumulative counters.
Important: The v6 view's LAG-based delta is NOT used because it inflates cumulative values for flat counters by orders of magnitude.
Alert Counts
Firing alert counts per cluster from lark_alerts.
Dimensions (from health_rules.yaml)
| Dimension | Weight | Green | Yellow | Red | Notes |
|---|---|---|---|---|---|
| error_rate | 0.20 | 0.5% | 2.0% | 5.0% | Derived: errors/total*100 |
| compaction | 0.15 | 500 | 2,000 | 5,000 | Max tablet compaction score |
| memory_available_pct | 0.15 | 30% | 15% | 5% | Inverted (lower=worse) |
| be_process_mem_gb | 0.15 | 80 GB | 120 GB | 180 GB | Peak BE process memory |
| disk_used_pct | 0.10 | 70% | 85% | 95% | Derived from filesystem bytes |
| alert_activity | 0.15 | 2 | 5 | 10 | Firing alert count |
| fe_connection_total | 0.05 | 300 | 500 | 900 | Max FE connections |
| fe_journal_latency | 0.05 | 500ms | 1,000ms | 10,000ms | Max journal write latency |
Persistence
Scores are stored as snapshots in alerts.cluster_health_scores (DUPLICATE KEY table). Each run inserts a new row per cluster with the current timestamp, enabling historical comparison.
Usage
from byoc_agent.health_scorer import compute_cluster_scores, persist_scores
scores = compute_cluster_scores()
persist_scores(scores)
# Score specific clusters only
scores = compute_cluster_scores(cluster_ids=["abc-123", "def-456"])
# Load most recent snapshot from DB
from byoc_agent.health_scorer import load_latest_scores
latest = load_latest_scores()