IssueTracker -- Issue Lifecycle Management
Source: byoc_agent/issue_tracker.py
Rules: byoc_agent/issue_rules.yaml (v2)
The IssueTracker groups alerts into issues, manages issue lifecycle, and provides triage workflow capabilities. Issues are persisted in alerts.issues (PRIMARY KEY table supporting UPDATE).
Alert Grouping Strategy (v2)
Alerts fall into two categories requiring fundamentally different grouping:
Anomaly Alerts (ongoing conditions)
Examples: OperationDurationGT10m, HeapUsageTooHigh, CompactionScore
These represent a condition that persists. The alert fires repeatedly (~every 15 min) while the condition exists.
Grouping: All firings on the same cluster for the same alert type belong to the SAME issue until a Resolved event arrives. No time window needed.
Failure Alerts (discrete events)
Examples: OperationAbnormal, BeAliveAbnormal, ProcNotRunning
These represent something that broke -- discrete events.
Grouping: Time-window based (30 min default). Multiple failures on the same cluster within the window are the same incident. Correlated failures use a wider window (4 hours).
Escalation
When a failure fires on a cluster that already has an open anomaly issue in the same correlation group, the failure merges into the anomaly issue and escalates it. For example, "Operational Anomaly" becomes "Operational Failure", severity Warning becomes Critical.
Issue Lifecycle (v3)
Issue Status
| Status | Meaning |
|---|---|
Ongoing | Alert is currently firing |
Resolved | Alert has been resolved |
Triage Status
| Status | Description |
|---|---|
New | Just created, not yet reviewed |
Acknowledged | Operator has seen it |
Monitoring/Investigating | Under active investigation |
Mitigating | Fix in progress |
Closed - Fixed Issue | Root cause fixed |
Closed - Auto Resolved | Resolved without intervention |
Closed - False Positive | Not a real issue |
Disposition Status
| Status | Description |
|---|---|
New | Not yet dispositioned |
No Action Needed | Benign or self-resolved |
JIRA Created | Tracking ticket created |
Severity Classification
Alert names are mapped to severity based on pattern matching (more specific patterns take precedence):
| Pattern | Severity | Examples |
|---|---|---|
BeAliveAbnormal, BeNodeAbnormal, ClusterStateAbnormal, FeNodeAbnormal | Critical | Node/cluster failures |
Failed, NotRunning | Critical | Process failures |
Abnormal (catch-all) | Critical | Other abnormal states |
OperationAbnormal | Warning | Cluster create/resume failures (not critical) |
HeapUsageTooHigh, FEHeapUsageTooHigh | Warning | Memory pressure |
FEMaxTabletCompaction, CompactionScore | Warning | Compaction backlog |
DurationGT*, GT200, GT0, GT10 | Warning | Latency/threshold breaches |
Silence | Info | Operational events |
Correlation Groups
Alerts in the same group merge into one issue when they fire on the same cluster:
be_node_failure (Critical)
BeAliveAbnormal,ProcNotRunning,BeNodeAbnormal,ClusterStateAbnormal
fe_memory_pressure (Warning, escalates to Critical)
FEHeapUsageTooHigh,HeapUsageTooHigh,FEGCCount,JVMOldGCProcNotRunning(OOM crash = escalation trigger)
compaction_backlog (Warning)
FEMaxTabletCompaction,CompactionScore
disk_pressure (Warning)
FreeDiskLessThan,RootFreeDiskLessThan
operational_anomaly (Warning)
OperationDurationGT(anomaly),OperationAbnormal(failure)
Runbooks
The IssueTracker includes built-in runbooks for common alert types (ALERT_RUNBOOKS dict). Each runbook provides:
- A description of the problem
- Step-by-step investigation and remediation instructions
- A runbook URL reference
Covered alert types: ProcNotRunning, ClusterStateAbnormal, BeAliveAbnormal, BeNodeAbnormal, FEHeapUsageTooHigh, HeapUsageTooHigh, FEMaxTabletCompaction, CompactionScore, OperationDurationGT, OperationAbnormal, FEQueryErrRate, FreeDiskLessThan.
AlertIssue Dataclass
| Field | Type | Description |
|---|---|---|
issue_id | str | Unique issue identifier |
issue_number | int | Sequential number |
issue_name | str | Derived from correlation groups |
cluster_id | str | Affected cluster |
issue_status | str | Ongoing / Resolved |
severity | str | Critical / Warning / Info |
triage_status | str | Lifecycle stage |
disposition_status | str | Action taken |
alert_count | int | Number of grouped alerts |
alert_names | list | Distinct alert types |
assignee | str | Assigned operator |
suggested_action | str | From runbook |
jira_key | str | Linked JIRA ticket |
Configuration (issue_rules.yaml)
| Setting | Value | Description |
|---|---|---|
time_window_minutes | 30 | Max gap for failure alert grouping |
correlation_window_minutes | 240 | Wider window for correlated alert types |
reopen_window_minutes | 15 | If resolved then fires again within this window, reopen |