InvestigatorAgent -- Autonomous Cluster Investigation
Source: byoc_agent/investigator.py
The InvestigatorAgent picks up pending tasks from the Sentinel's agent_tasks queue, performs a thorough autonomous investigation using SQL tools and the Knowledge Lake, and persists structured findings to agent_investigations.
Pipeline
Sentinel creates agent_task (status=pending)
↓
run_pending_investigations() fetches up to N pending tasks
↓
For each task:
1. Mark task status → "running"
2. Build investigation prompt from task context
3. Run InvestigatorAgent (autonomous LLM loop)
4. Parse structured output (severity, root cause, evidence, actions)
5. Persist to agent_investigations
6. Mark task status → "completed" (or "failed")
Investigation Protocol (8 steps)
The system prompt instructs the agent to follow this protocol:
- Gather context --
query_cluster_infofor customer tier, region, cluster config. - Check recent alerts --
query_cluster_alertsfor firing/resolved patterns. - Analyze metrics --
query_cluster_metricswith metric names specific to the alert type:- CPU:
starrocks_be_cpu_util_percent - Memory:
starrocks_be_jvm_heap_used_percent,starrocks_be_process_mem_bytes - Disk:
starrocks_be_disks_data_used_pct - Compaction:
starrocks_be_compaction_score_average - Query performance:
starrocks_fe_query_latency_ms_p99,starrocks_fe_query_total - Errors:
starrocks_fe_query_err
- CPU:
- Check operations --
query_cluster_operationsfor recent START/STOP/UPDATE/NODE_UPDATE changes. - Search Knowledge Lake -- CRITICAL: always search for similar past incidents, known bugs by SR version, monitoring guidelines, and proven resolutions.
- Check past investigations --
query_agent_memoryfor prior investigations of this cluster or issue type. - Check sibling clusters --
query_similar_clustersto determine if the issue is account-wide or cluster-specific. - Custom SQL --
run_sqlfor specific data not covered by pre-built tools.
Output Format
The agent produces a structured report:
| Field | Description |
|---|---|
SEVERITY | Critical, Warning, or Info |
CONFIDENCE | Float 0.0-1.0 |
ROOT_CAUSE | 1-2 sentence hypothesis |
EVIDENCE | Bullet points of supporting data |
KNOWLEDGE_REFS | Relevant Knowledge Lake documents, past incidents, known bugs |
RECOMMENDED_ACTIONS | Prioritized actionable steps |
SUMMARY | 2-3 sentence executive summary suitable for Tier A customer updates |
Configuration
| Parameter | Value | Description |
|---|---|---|
agent_type | "investigator" | Agent identifier |
max_rounds | 12 | Maximum tool-use rounds (aims for 5-8 tool calls) |
max_tasks | 5 | Default number of pending tasks to process per run |
Task Prioritization
Tasks are fetched from the queue ordered by:
priority ASC(1 = highest priority)created_at ASC(oldest first within same priority)
Priority is set by the Sentinel based on trigger type:
- Priority 1: New Critical cluster, alert storm, Tier A/S Critical
- Priority 2: Score cliff, Tier A/S Warning
Task Prompt Construction
Each investigation prompt includes:
- Cluster ID and name
- Customer name and tier
- Trigger reason (from Sentinel)
- Priority level
- Context JSON (risk level, risk reasons, score details, alert counts)
Persistence
Findings are saved to alerts.agent_investigations with:
- Parsed structured fields (severity, root_cause, evidence, actions, knowledge_refs, confidence)
- Full report text
- LLM usage metadata (provider, model, tokens, duration)
- Task linkage (task_id, cluster_id, customer info)
Usage
from byoc_agent.investigator import run_pending_investigations
# Process up to 5 pending tasks
results = run_pending_investigations(max_tasks=5)
# CLI
# python -m byoc_agent.investigator 5
Important Rules
- Be thorough but efficient -- 5-8 tool calls, not 15.
- If Knowledge Lake has a matching past incident with a known fix, lead with that.
- Factor customer tier into severity: Tier A/S issues are always higher priority.
- If metrics look normal despite the alert, say so -- false positives happen.
- Always check the SR version -- many issues are version-specific.