IssueTracker -- Issue Lifecycle Management

Source: byoc_agent/issue_tracker.py Rules: byoc_agent/issue_rules.yaml (v2)

The IssueTracker groups alerts into issues, manages issue lifecycle, and provides triage workflow capabilities. Issues are persisted in alerts.issues (PRIMARY KEY table supporting UPDATE).

Alert Grouping Strategy (v2)

Alerts fall into two categories requiring fundamentally different grouping:

Anomaly Alerts (ongoing conditions)

Examples: OperationDurationGT10m, HeapUsageTooHigh, CompactionScore

These represent a condition that persists. The alert fires repeatedly (~every 15 min) while the condition exists.

Grouping: All firings on the same cluster for the same alert type belong to the SAME issue until a Resolved event arrives. No time window needed.

Failure Alerts (discrete events)

Examples: OperationAbnormal, BeAliveAbnormal, ProcNotRunning

These represent something that broke -- discrete events.

Grouping: Time-window based (30 min default). Multiple failures on the same cluster within the window are the same incident. Correlated failures use a wider window (4 hours).

Escalation

When a failure fires on a cluster that already has an open anomaly issue in the same correlation group, the failure merges into the anomaly issue and escalates it. For example, "Operational Anomaly" becomes "Operational Failure", severity Warning becomes Critical.

Issue Lifecycle (v3)

Issue Status

Status	Meaning
`Ongoing`	Alert is currently firing
`Resolved`	Alert has been resolved

Triage Status

Status	Description
`New`	Just created, not yet reviewed
`Acknowledged`	Operator has seen it
`Monitoring/Investigating`	Under active investigation
`Mitigating`	Fix in progress
`Closed - Fixed Issue`	Root cause fixed
`Closed - Auto Resolved`	Resolved without intervention
`Closed - False Positive`	Not a real issue

Disposition Status

Status	Description
`New`	Not yet dispositioned
`No Action Needed`	Benign or self-resolved
`JIRA Created`	Tracking ticket created

Severity Classification

Alert names are mapped to severity based on pattern matching (more specific patterns take precedence):

Pattern	Severity	Examples
`BeAliveAbnormal`, `BeNodeAbnormal`, `ClusterStateAbnormal`, `FeNodeAbnormal`	Critical	Node/cluster failures
`Failed`, `NotRunning`	Critical	Process failures
`Abnormal` (catch-all)	Critical	Other abnormal states
`OperationAbnormal`	Warning	Cluster create/resume failures (not critical)
`HeapUsageTooHigh`, `FEHeapUsageTooHigh`	Warning	Memory pressure
`FEMaxTabletCompaction`, `CompactionScore`	Warning	Compaction backlog
`DurationGT*`, `GT200`, `GT0`, `GT10`	Warning	Latency/threshold breaches
`Silence`	Info	Operational events

Correlation Groups

Alerts in the same group merge into one issue when they fire on the same cluster:

be_node_failure (Critical)

BeAliveAbnormal, ProcNotRunning, BeNodeAbnormal, ClusterStateAbnormal

fe_memory_pressure (Warning, escalates to Critical)

FEHeapUsageTooHigh, HeapUsageTooHigh, FEGCCount, JVMOldGC
ProcNotRunning (OOM crash = escalation trigger)

compaction_backlog (Warning)

FEMaxTabletCompaction, CompactionScore

disk_pressure (Warning)

FreeDiskLessThan, RootFreeDiskLessThan

operational_anomaly (Warning)

OperationDurationGT (anomaly), OperationAbnormal (failure)

Runbooks

The IssueTracker includes built-in runbooks for common alert types (ALERT_RUNBOOKS dict). Each runbook provides:

A description of the problem
Step-by-step investigation and remediation instructions
A runbook URL reference

Covered alert types: ProcNotRunning, ClusterStateAbnormal, BeAliveAbnormal, BeNodeAbnormal, FEHeapUsageTooHigh, HeapUsageTooHigh, FEMaxTabletCompaction, CompactionScore, OperationDurationGT, OperationAbnormal, FEQueryErrRate, FreeDiskLessThan.

AlertIssue Dataclass

Field	Type	Description
`issue_id`	str	Unique issue identifier
`issue_number`	int	Sequential number
`issue_name`	str	Derived from correlation groups
`cluster_id`	str	Affected cluster
`issue_status`	str	Ongoing / Resolved
`severity`	str	Critical / Warning / Info
`triage_status`	str	Lifecycle stage
`disposition_status`	str	Action taken
`alert_count`	int	Number of grouped alerts
`alert_names`	list	Distinct alert types
`assignee`	str	Assigned operator
`suggested_action`	str	From runbook
`jira_key`	str	Linked JIRA ticket

Configuration (issue_rules.yaml)

Setting	Value	Description
`time_window_minutes`	30	Max gap for failure alert grouping
`correlation_window_minutes`	240	Wider window for correlated alert types
`reopen_window_minutes`	15	If resolved then fires again within this window, reopen

Alert Grouping Strategy (v2)​

Anomaly Alerts (ongoing conditions)​

Failure Alerts (discrete events)​

Escalation​

Issue Lifecycle (v3)​

Issue Status​

Triage Status​

Disposition Status​

Severity Classification​

Correlation Groups​

be_node_failure (Critical)​

fe_memory_pressure (Warning, escalates to Critical)​

compaction_backlog (Warning)​

disk_pressure (Warning)​

operational_anomaly (Warning)​

Runbooks​

AlertIssue Dataclass​

Configuration (issue_rules.yaml)​