Skip to main content

AIIssueGrouper -- AI-Powered Alert Classification

Source: byoc_agent/ai_issue_grouper.py

The AIIssueGrouper replaces static rule-based grouping with LLM intelligence. It uses Claude Haiku to determine whether each new alert belongs to an existing issue or requires a new issue, classifies severity, sets status, and generates human-readable summaries.

Called by the alert pipeline every 5 minutes on the jumper host.

Configuration

ParameterValue
Modelclaude-haiku-4-5-20251001
ProviderAnthropic
Max tokens512 per classification

How It Works

Batch Processing Flow

  1. Group by cluster -- New alerts are grouped by cluster_id for efficient processing.
  2. Fetch open issues -- For each cluster, fetch open or recently-resolved (within 4 hours) issues from alerts.ai_issues.
  3. Classify each alert -- In chronological order, call Claude to classify each alert against the open issues context.
  4. Reconcile batch statuses -- If the same issue has both Firing and Resolved alerts in one batch, the final status is Ongoing (prevents flip-flop).
  5. Persist -- Upsert results to alerts.ai_issues.

AI Decision Rules

The system prompt instructs Claude to:

Group into the SAME issue when:

  • Same root cause (e.g., CompactionScore and FEMaxTabletCompaction both relate to compaction backlog)
  • A "Resolved" alert clearly resolves a "Firing" alert of the same type
  • Multiple node failures on the same cluster likely stem from the same incident
  • Memory pressure alerts on the same cluster are related

Create a NEW issue when:

  • Fundamentally different alert type from open issues
  • Different subsystem (FE vs BE) with no causal link
  • Significant time gap (>4 hours) with no related alerts
  • Important: Do NOT create new issues just because an existing issue shows "Resolved" -- recently-resolved issues (within 4 hours) with the same alert type should be merged (flapping/recurring incident)

Severity Classification

SeverityCriteria
CriticalNode down, process not running, cluster state abnormal, data loss risk
WarningPerformance degradation, resource pressure, elevated error rates
InfoOperational events, silences, minor threshold breaches

Severity escalates when: alert frequency is high (>5 firings in 1 hour), multiple alert types fire together, or a failure follows an anomaly.

Issue Status Logic

ConditionStatus
Latest alert is "Resolved" and no other active firingsResolved
Any alert is "Firing"Ongoing

Deterministic override: the last alert's status always wins at the individual level, but batch reconciliation ensures Ongoing wins if any firing alert exists for the issue.

Response Format

Claude returns strict JSON:

{
"decision": "existing" | "new",
"existing_issue_id": "<id if existing, else null>",
"issue_name": "BE Memory Pressure + Compaction Backlog",
"severity": "Critical" | "Warning" | "Info",
"issue_status": "Ongoing" | "Resolved",
"reasoning": "1-2 sentences explaining the decision",
"summary": "Brief description for the ops team"
}

Fallback Behavior

If AI classification fails for an alert, a fallback creates a standalone issue without AI:

  • Issue name: {alert_name} -- {cluster_name}
  • Severity: Critical if alert name contains "Abnormal", "Failed", or "NotRunning"; otherwise Warning
  • No AI reasoning or summary

Persistence

Results are written to alerts.ai_issues with:

ColumnDescription
issue_idDeterministic hash from cluster_id + timestamp
issue_nameAI-generated descriptive name
severityAI-classified severity
issue_statusOngoing / Resolved
triage_statusStarts as "New"
disposition_statusStarts as "New"
alert_countOnly counts Firing alerts (Resolved is a notification, not a real alert)
alert_namesComma-separated distinct alert types
ai_reasoningClaude's reasoning for the grouping decision
ai_summaryHuman-readable issue summary
alert_message_idsComma-separated Lark message IDs

For existing issues, the update merges alert counts, names, and message IDs.

LLM Usage Tracking

Every classification call logs usage to alerts.llm_usage with source "ai_issue_grouper", including input/output tokens and duration.

Usage

from byoc_agent.ai_issue_grouper import process_new_alerts_with_ai

alerts = [
{"cluster_id": "abc", "cluster_name": "prod-1", "alert_name": "CompactionScore",
"alert_status": "Firing", "alert_detail": "...", "created_at": "2026-03-25 10:00:00",
"message_id": "msg123", "account_name": "Acme", "region": "us-east-1"},
]

count = process_new_alerts_with_ai(alerts)
# Returns: number of alerts processed