AIIssueGrouper -- AI-Powered Alert Classification

Source: byoc_agent/ai_issue_grouper.py

The AIIssueGrouper replaces static rule-based grouping with LLM intelligence. It uses Claude Haiku to determine whether each new alert belongs to an existing issue or requires a new issue, classifies severity, sets status, and generates human-readable summaries.

Called by the alert pipeline every 5 minutes on the jumper host.

Configuration

Parameter	Value
Model	`claude-haiku-4-5-20251001`
Provider	Anthropic
Max tokens	512 per classification

How It Works

Batch Processing Flow

Group by cluster -- New alerts are grouped by cluster_id for efficient processing.
Fetch open issues -- For each cluster, fetch open or recently-resolved (within 4 hours) issues from alerts.ai_issues.
Classify each alert -- In chronological order, call Claude to classify each alert against the open issues context.
Reconcile batch statuses -- If the same issue has both Firing and Resolved alerts in one batch, the final status is Ongoing (prevents flip-flop).
Persist -- Upsert results to alerts.ai_issues.

AI Decision Rules

The system prompt instructs Claude to:

Group into the SAME issue when:

Same root cause (e.g., CompactionScore and FEMaxTabletCompaction both relate to compaction backlog)
A "Resolved" alert clearly resolves a "Firing" alert of the same type
Multiple node failures on the same cluster likely stem from the same incident
Memory pressure alerts on the same cluster are related

Create a NEW issue when:

Fundamentally different alert type from open issues
Different subsystem (FE vs BE) with no causal link
Significant time gap (>4 hours) with no related alerts
Important: Do NOT create new issues just because an existing issue shows "Resolved" -- recently-resolved issues (within 4 hours) with the same alert type should be merged (flapping/recurring incident)

Severity Classification

Severity	Criteria
Critical	Node down, process not running, cluster state abnormal, data loss risk
Warning	Performance degradation, resource pressure, elevated error rates
Info	Operational events, silences, minor threshold breaches

Severity escalates when: alert frequency is high (>5 firings in 1 hour), multiple alert types fire together, or a failure follows an anomaly.

Issue Status Logic

Condition	Status
Latest alert is "Resolved" and no other active firings	Resolved
Any alert is "Firing"	Ongoing

Deterministic override: the last alert's status always wins at the individual level, but batch reconciliation ensures Ongoing wins if any firing alert exists for the issue.

Response Format

Claude returns strict JSON:

{
  "decision": "existing" | "new",
  "existing_issue_id": "<id if existing, else null>",
  "issue_name": "BE Memory Pressure + Compaction Backlog",
  "severity": "Critical" | "Warning" | "Info",
  "issue_status": "Ongoing" | "Resolved",
  "reasoning": "1-2 sentences explaining the decision",
  "summary": "Brief description for the ops team"
}

Fallback Behavior

If AI classification fails for an alert, a fallback creates a standalone issue without AI:

Issue name: {alert_name} -- {cluster_name}
Severity: Critical if alert name contains "Abnormal", "Failed", or "NotRunning"; otherwise Warning
No AI reasoning or summary

Persistence

Results are written to alerts.ai_issues with:

Column	Description
`issue_id`	Deterministic hash from cluster_id + timestamp
`issue_name`	AI-generated descriptive name
`severity`	AI-classified severity
`issue_status`	Ongoing / Resolved
`triage_status`	Starts as "New"
`disposition_status`	Starts as "New"
`alert_count`	Only counts Firing alerts (Resolved is a notification, not a real alert)
`alert_names`	Comma-separated distinct alert types
`ai_reasoning`	Claude's reasoning for the grouping decision
`ai_summary`	Human-readable issue summary
`alert_message_ids`	Comma-separated Lark message IDs

For existing issues, the update merges alert counts, names, and message IDs.

LLM Usage Tracking

Every classification call logs usage to alerts.llm_usage with source "ai_issue_grouper", including input/output tokens and duration.

Usage

from byoc_agent.ai_issue_grouper import process_new_alerts_with_ai

alerts = [
    {"cluster_id": "abc", "cluster_name": "prod-1", "alert_name": "CompactionScore",
     "alert_status": "Firing", "alert_detail": "...", "created_at": "2026-03-25 10:00:00",
     "message_id": "msg123", "account_name": "Acme", "region": "us-east-1"},
]

count = process_new_alerts_with_ai(alerts)
# Returns: number of alerts processed

Configuration​

How It Works​

Batch Processing Flow​

AI Decision Rules​

Severity Classification​

Issue Status Logic​

Response Format​

Fallback Behavior​

Persistence​

LLM Usage Tracking​

Usage​