InvestigatorAgent -- Autonomous Cluster Investigation

Source: byoc_agent/investigator.py

The InvestigatorAgent picks up pending tasks from the Sentinel's agent_tasks queue, performs a thorough autonomous investigation using SQL tools and the Knowledge Lake, and persists structured findings to agent_investigations.

Pipeline

Sentinel creates agent_task (status=pending)
       ↓
run_pending_investigations() fetches up to N pending tasks
       ↓
For each task:
  1. Mark task status → "running"
  2. Build investigation prompt from task context
  3. Run InvestigatorAgent (autonomous LLM loop)
  4. Parse structured output (severity, root cause, evidence, actions)
  5. Persist to agent_investigations
  6. Mark task status → "completed" (or "failed")

Investigation Protocol (8 steps)

The system prompt instructs the agent to follow this protocol:

Gather context -- query_cluster_info for customer tier, region, cluster config.
Check recent alerts -- query_cluster_alerts for firing/resolved patterns.
Analyze metrics -- query_cluster_metrics with metric names specific to the alert type:
- CPU: starrocks_be_cpu_util_percent
- Memory: starrocks_be_jvm_heap_used_percent, starrocks_be_process_mem_bytes
- Disk: starrocks_be_disks_data_used_pct
- Compaction: starrocks_be_compaction_score_average
- Query performance: starrocks_fe_query_latency_ms_p99, starrocks_fe_query_total
- Errors: starrocks_fe_query_err
Check operations -- query_cluster_operations for recent START/STOP/UPDATE/NODE_UPDATE changes.
Search Knowledge Lake -- CRITICAL: always search for similar past incidents, known bugs by SR version, monitoring guidelines, and proven resolutions.
Check past investigations -- query_agent_memory for prior investigations of this cluster or issue type.
Check sibling clusters -- query_similar_clusters to determine if the issue is account-wide or cluster-specific.
Custom SQL -- run_sql for specific data not covered by pre-built tools.

Output Format

The agent produces a structured report:

Field	Description
`SEVERITY`	Critical, Warning, or Info
`CONFIDENCE`	Float 0.0-1.0
`ROOT_CAUSE`	1-2 sentence hypothesis
`EVIDENCE`	Bullet points of supporting data
`KNOWLEDGE_REFS`	Relevant Knowledge Lake documents, past incidents, known bugs
`RECOMMENDED_ACTIONS`	Prioritized actionable steps
`SUMMARY`	2-3 sentence executive summary suitable for Tier A customer updates

Configuration

Parameter	Value	Description
`agent_type`	`"investigator"`	Agent identifier
`max_rounds`	12	Maximum tool-use rounds (aims for 5-8 tool calls)
`max_tasks`	5	Default number of pending tasks to process per run

Task Prioritization

Tasks are fetched from the queue ordered by:

priority ASC (1 = highest priority)
created_at ASC (oldest first within same priority)

Priority is set by the Sentinel based on trigger type:

Priority 1: New Critical cluster, alert storm, Tier A/S Critical
Priority 2: Score cliff, Tier A/S Warning

Task Prompt Construction

Each investigation prompt includes:

Cluster ID and name
Customer name and tier
Trigger reason (from Sentinel)
Priority level
Context JSON (risk level, risk reasons, score details, alert counts)

Persistence

Findings are saved to alerts.agent_investigations with:

Parsed structured fields (severity, root_cause, evidence, actions, knowledge_refs, confidence)
Full report text
LLM usage metadata (provider, model, tokens, duration)
Task linkage (task_id, cluster_id, customer info)

Usage

from byoc_agent.investigator import run_pending_investigations

# Process up to 5 pending tasks
results = run_pending_investigations(max_tasks=5)

# CLI
# python -m byoc_agent.investigator 5

Important Rules

Be thorough but efficient -- 5-8 tool calls, not 15.
If Knowledge Lake has a matching past incident with a known fix, lead with that.
Factor customer tier into severity: Tier A/S issues are always higher priority.
If metrics look normal despite the alert, say so -- false positives happen.
Always check the SR version -- many issues are version-specific.

Pipeline​

Investigation Protocol (8 steps)​

Output Format​

Configuration​

Task Prioritization​

Task Prompt Construction​

Persistence​

Usage​

Important Rules​