AIIssueGrouper -- AI-Powered Alert Classification
Source: byoc_agent/ai_issue_grouper.py
The AIIssueGrouper replaces static rule-based grouping with LLM intelligence. It uses Claude Haiku to determine whether each new alert belongs to an existing issue or requires a new issue, classifies severity, sets status, and generates human-readable summaries.
Called by the alert pipeline every 5 minutes on the jumper host.
Configuration
| Parameter | Value |
|---|---|
| Model | claude-haiku-4-5-20251001 |
| Provider | Anthropic |
| Max tokens | 512 per classification |
How It Works
Batch Processing Flow
- Group by cluster -- New alerts are grouped by
cluster_idfor efficient processing. - Fetch open issues -- For each cluster, fetch open or recently-resolved (within 4 hours) issues from
alerts.ai_issues. - Classify each alert -- In chronological order, call Claude to classify each alert against the open issues context.
- Reconcile batch statuses -- If the same issue has both Firing and Resolved alerts in one batch, the final status is Ongoing (prevents flip-flop).
- Persist -- Upsert results to
alerts.ai_issues.
AI Decision Rules
The system prompt instructs Claude to:
Group into the SAME issue when:
- Same root cause (e.g., CompactionScore and FEMaxTabletCompaction both relate to compaction backlog)
- A "Resolved" alert clearly resolves a "Firing" alert of the same type
- Multiple node failures on the same cluster likely stem from the same incident
- Memory pressure alerts on the same cluster are related
Create a NEW issue when:
- Fundamentally different alert type from open issues
- Different subsystem (FE vs BE) with no causal link
- Significant time gap (>4 hours) with no related alerts
- Important: Do NOT create new issues just because an existing issue shows "Resolved" -- recently-resolved issues (within 4 hours) with the same alert type should be merged (flapping/recurring incident)
Severity Classification
| Severity | Criteria |
|---|---|
| Critical | Node down, process not running, cluster state abnormal, data loss risk |
| Warning | Performance degradation, resource pressure, elevated error rates |
| Info | Operational events, silences, minor threshold breaches |
Severity escalates when: alert frequency is high (>5 firings in 1 hour), multiple alert types fire together, or a failure follows an anomaly.
Issue Status Logic
| Condition | Status |
|---|---|
| Latest alert is "Resolved" and no other active firings | Resolved |
| Any alert is "Firing" | Ongoing |
Deterministic override: the last alert's status always wins at the individual level, but batch reconciliation ensures Ongoing wins if any firing alert exists for the issue.
Response Format
Claude returns strict JSON:
{
"decision": "existing" | "new",
"existing_issue_id": "<id if existing, else null>",
"issue_name": "BE Memory Pressure + Compaction Backlog",
"severity": "Critical" | "Warning" | "Info",
"issue_status": "Ongoing" | "Resolved",
"reasoning": "1-2 sentences explaining the decision",
"summary": "Brief description for the ops team"
}
Fallback Behavior
If AI classification fails for an alert, a fallback creates a standalone issue without AI:
- Issue name:
{alert_name} -- {cluster_name} - Severity: Critical if alert name contains "Abnormal", "Failed", or "NotRunning"; otherwise Warning
- No AI reasoning or summary
Persistence
Results are written to alerts.ai_issues with:
| Column | Description |
|---|---|
issue_id | Deterministic hash from cluster_id + timestamp |
issue_name | AI-generated descriptive name |
severity | AI-classified severity |
issue_status | Ongoing / Resolved |
triage_status | Starts as "New" |
disposition_status | Starts as "New" |
alert_count | Only counts Firing alerts (Resolved is a notification, not a real alert) |
alert_names | Comma-separated distinct alert types |
ai_reasoning | Claude's reasoning for the grouping decision |
ai_summary | Human-readable issue summary |
alert_message_ids | Comma-separated Lark message IDs |
For existing issues, the update merges alert counts, names, and message IDs.
LLM Usage Tracking
Every classification call logs usage to alerts.llm_usage with source "ai_issue_grouper", including input/output tokens and duration.
Usage
from byoc_agent.ai_issue_grouper import process_new_alerts_with_ai
alerts = [
{"cluster_id": "abc", "cluster_name": "prod-1", "alert_name": "CompactionScore",
"alert_status": "Firing", "alert_detail": "...", "created_at": "2026-03-25 10:00:00",
"message_id": "msg123", "account_name": "Acme", "region": "us-east-1"},
]
count = process_new_alerts_with_ai(alerts)
# Returns: number of alerts processed