Skip to main content

IssueTracker -- Issue Lifecycle Management

Source: byoc_agent/issue_tracker.py Rules: byoc_agent/issue_rules.yaml (v2)

The IssueTracker groups alerts into issues, manages issue lifecycle, and provides triage workflow capabilities. Issues are persisted in alerts.issues (PRIMARY KEY table supporting UPDATE).

Alert Grouping Strategy (v2)

Alerts fall into two categories requiring fundamentally different grouping:

Anomaly Alerts (ongoing conditions)

Examples: OperationDurationGT10m, HeapUsageTooHigh, CompactionScore

These represent a condition that persists. The alert fires repeatedly (~every 15 min) while the condition exists.

Grouping: All firings on the same cluster for the same alert type belong to the SAME issue until a Resolved event arrives. No time window needed.

Failure Alerts (discrete events)

Examples: OperationAbnormal, BeAliveAbnormal, ProcNotRunning

These represent something that broke -- discrete events.

Grouping: Time-window based (30 min default). Multiple failures on the same cluster within the window are the same incident. Correlated failures use a wider window (4 hours).

Escalation

When a failure fires on a cluster that already has an open anomaly issue in the same correlation group, the failure merges into the anomaly issue and escalates it. For example, "Operational Anomaly" becomes "Operational Failure", severity Warning becomes Critical.

Issue Lifecycle (v3)

Issue Status

StatusMeaning
OngoingAlert is currently firing
ResolvedAlert has been resolved

Triage Status

StatusDescription
NewJust created, not yet reviewed
AcknowledgedOperator has seen it
Monitoring/InvestigatingUnder active investigation
MitigatingFix in progress
Closed - Fixed IssueRoot cause fixed
Closed - Auto ResolvedResolved without intervention
Closed - False PositiveNot a real issue

Disposition Status

StatusDescription
NewNot yet dispositioned
No Action NeededBenign or self-resolved
JIRA CreatedTracking ticket created

Severity Classification

Alert names are mapped to severity based on pattern matching (more specific patterns take precedence):

PatternSeverityExamples
BeAliveAbnormal, BeNodeAbnormal, ClusterStateAbnormal, FeNodeAbnormalCriticalNode/cluster failures
Failed, NotRunningCriticalProcess failures
Abnormal (catch-all)CriticalOther abnormal states
OperationAbnormalWarningCluster create/resume failures (not critical)
HeapUsageTooHigh, FEHeapUsageTooHighWarningMemory pressure
FEMaxTabletCompaction, CompactionScoreWarningCompaction backlog
DurationGT*, GT200, GT0, GT10WarningLatency/threshold breaches
SilenceInfoOperational events

Correlation Groups

Alerts in the same group merge into one issue when they fire on the same cluster:

be_node_failure (Critical)

  • BeAliveAbnormal, ProcNotRunning, BeNodeAbnormal, ClusterStateAbnormal

fe_memory_pressure (Warning, escalates to Critical)

  • FEHeapUsageTooHigh, HeapUsageTooHigh, FEGCCount, JVMOldGC
  • ProcNotRunning (OOM crash = escalation trigger)

compaction_backlog (Warning)

  • FEMaxTabletCompaction, CompactionScore

disk_pressure (Warning)

  • FreeDiskLessThan, RootFreeDiskLessThan

operational_anomaly (Warning)

  • OperationDurationGT (anomaly), OperationAbnormal (failure)

Runbooks

The IssueTracker includes built-in runbooks for common alert types (ALERT_RUNBOOKS dict). Each runbook provides:

  • A description of the problem
  • Step-by-step investigation and remediation instructions
  • A runbook URL reference

Covered alert types: ProcNotRunning, ClusterStateAbnormal, BeAliveAbnormal, BeNodeAbnormal, FEHeapUsageTooHigh, HeapUsageTooHigh, FEMaxTabletCompaction, CompactionScore, OperationDurationGT, OperationAbnormal, FEQueryErrRate, FreeDiskLessThan.

AlertIssue Dataclass

FieldTypeDescription
issue_idstrUnique issue identifier
issue_numberintSequential number
issue_namestrDerived from correlation groups
cluster_idstrAffected cluster
issue_statusstrOngoing / Resolved
severitystrCritical / Warning / Info
triage_statusstrLifecycle stage
disposition_statusstrAction taken
alert_countintNumber of grouped alerts
alert_nameslistDistinct alert types
assigneestrAssigned operator
suggested_actionstrFrom runbook
jira_keystrLinked JIRA ticket

Configuration (issue_rules.yaml)

SettingValueDescription
time_window_minutes30Max gap for failure alert grouping
correlation_window_minutes240Wider window for correlated alert types
reopen_window_minutes15If resolved then fires again within this window, reopen