Skip to main content

HealthScorer -- Per-Cluster Health Scoring

Source: byoc_agent/health_scorer.py Rules: byoc_agent/health_rules.yaml (v3)

The HealthScorer computes per-cluster health scores (0-100) from configurable YAML rules. Each cluster is scored across multiple dimensions using linear interpolation between green/yellow/red thresholds, then combined via weighted average.

Dataclasses

DimensionScore

Represents a single dimension's score for one cluster.

FieldTypeDescription
namestrDimension name (e.g., error_rate, compaction)
scorefloat0-100 score for this dimension
valuefloatRaw metric value used
threshold_greenfloatGreen (healthy) threshold
threshold_yellowfloatYellow (warning) threshold
threshold_redfloatRed (critical) threshold
weightfloatRelative weight in overall score
descriptionstrHuman-readable description

ClusterHealthScore

Composite health score for one cluster.

FieldTypeDescription
cluster_idstrCluster UUID
account_idstrAccount ID
overall_scorefloatWeighted average of all dimensions (0-100)
classificationstrHealthy / Warning / Critical
emojistrColor indicator
colorstrHex color code
dimension_scoresdictMap of dimension name to DimensionScore

CustomerSummary

Aggregated health for a customer across all their clusters.

FieldTypeDescription
account_idstrAccount ID
customer_namestrDisplay name
tierstrA, B, or S
cluster_countintNumber of clusters
avg_scorefloatAverage score across clusters
worst_cluster_idstrCluster with lowest score
risk_factorslist[str]Dimensions scoring below 50

Scoring Algorithm

Linear Interpolation

Each dimension is scored via _score_dimension():

  • Normal mode (higher = worse, e.g., compaction score):

    • value <= green: score = 100
    • value between green and yellow: score = 100 to 50 (linear)
    • value between yellow and red: score = 50 to 0 (linear)
    • value >= red: score = 0
  • Inverted mode (lower = worse, e.g., memory available %):

    • value >= green: score = 100
    • value between yellow and green: score = 100 to 50 (linear)
    • value between red and yellow: score = 50 to 0 (linear)
    • value <= red: score = 0

Overall Score

overall = SUM(dimension_score * weight) / SUM(weight)

Classification

ClassificationMin Score
Healthy80
Warning50
Critical0

Data Fetching

The scorer uses two distinct fetch strategies based on metric type:

Gauge Metrics

Fetched directly from amv_hourly_snapshots_v1 using MAX/AVG aggregation. These are point-in-time values where MAX and AVG are meaningful.

Counter Metrics

Fetched using MAX-MIN delta per host, then SUM across hosts per cluster. This correctly computes the 7-day count for cumulative counters.

Important: The v6 view's LAG-based delta is NOT used because it inflates cumulative values for flat counters by orders of magnitude.

Alert Counts

Firing alert counts per cluster from lark_alerts.

Dimensions (from health_rules.yaml)

DimensionWeightGreenYellowRedNotes
error_rate0.200.5%2.0%5.0%Derived: errors/total*100
compaction0.155002,0005,000Max tablet compaction score
memory_available_pct0.1530%15%5%Inverted (lower=worse)
be_process_mem_gb0.1580 GB120 GB180 GBPeak BE process memory
disk_used_pct0.1070%85%95%Derived from filesystem bytes
alert_activity0.152510Firing alert count
fe_connection_total0.05300500900Max FE connections
fe_journal_latency0.05500ms1,000ms10,000msMax journal write latency

Persistence

Scores are stored as snapshots in alerts.cluster_health_scores (DUPLICATE KEY table). Each run inserts a new row per cluster with the current timestamp, enabling historical comparison.

Usage

from byoc_agent.health_scorer import compute_cluster_scores, persist_scores

scores = compute_cluster_scores()
persist_scores(scores)

# Score specific clusters only
scores = compute_cluster_scores(cluster_ids=["abc-123", "def-456"])

# Load most recent snapshot from DB
from byoc_agent.health_scorer import load_latest_scores
latest = load_latest_scores()