Skip to main content

InvestigatorAgent -- Autonomous Cluster Investigation

Source: byoc_agent/investigator.py

The InvestigatorAgent picks up pending tasks from the Sentinel's agent_tasks queue, performs a thorough autonomous investigation using SQL tools and the Knowledge Lake, and persists structured findings to agent_investigations.

Pipeline

Sentinel creates agent_task (status=pending)

run_pending_investigations() fetches up to N pending tasks

For each task:
1. Mark task status → "running"
2. Build investigation prompt from task context
3. Run InvestigatorAgent (autonomous LLM loop)
4. Parse structured output (severity, root cause, evidence, actions)
5. Persist to agent_investigations
6. Mark task status → "completed" (or "failed")

Investigation Protocol (8 steps)

The system prompt instructs the agent to follow this protocol:

  1. Gather context -- query_cluster_info for customer tier, region, cluster config.
  2. Check recent alerts -- query_cluster_alerts for firing/resolved patterns.
  3. Analyze metrics -- query_cluster_metrics with metric names specific to the alert type:
    • CPU: starrocks_be_cpu_util_percent
    • Memory: starrocks_be_jvm_heap_used_percent, starrocks_be_process_mem_bytes
    • Disk: starrocks_be_disks_data_used_pct
    • Compaction: starrocks_be_compaction_score_average
    • Query performance: starrocks_fe_query_latency_ms_p99, starrocks_fe_query_total
    • Errors: starrocks_fe_query_err
  4. Check operations -- query_cluster_operations for recent START/STOP/UPDATE/NODE_UPDATE changes.
  5. Search Knowledge Lake -- CRITICAL: always search for similar past incidents, known bugs by SR version, monitoring guidelines, and proven resolutions.
  6. Check past investigations -- query_agent_memory for prior investigations of this cluster or issue type.
  7. Check sibling clusters -- query_similar_clusters to determine if the issue is account-wide or cluster-specific.
  8. Custom SQL -- run_sql for specific data not covered by pre-built tools.

Output Format

The agent produces a structured report:

FieldDescription
SEVERITYCritical, Warning, or Info
CONFIDENCEFloat 0.0-1.0
ROOT_CAUSE1-2 sentence hypothesis
EVIDENCEBullet points of supporting data
KNOWLEDGE_REFSRelevant Knowledge Lake documents, past incidents, known bugs
RECOMMENDED_ACTIONSPrioritized actionable steps
SUMMARY2-3 sentence executive summary suitable for Tier A customer updates

Configuration

ParameterValueDescription
agent_type"investigator"Agent identifier
max_rounds12Maximum tool-use rounds (aims for 5-8 tool calls)
max_tasks5Default number of pending tasks to process per run

Task Prioritization

Tasks are fetched from the queue ordered by:

  1. priority ASC (1 = highest priority)
  2. created_at ASC (oldest first within same priority)

Priority is set by the Sentinel based on trigger type:

  • Priority 1: New Critical cluster, alert storm, Tier A/S Critical
  • Priority 2: Score cliff, Tier A/S Warning

Task Prompt Construction

Each investigation prompt includes:

  • Cluster ID and name
  • Customer name and tier
  • Trigger reason (from Sentinel)
  • Priority level
  • Context JSON (risk level, risk reasons, score details, alert counts)

Persistence

Findings are saved to alerts.agent_investigations with:

  • Parsed structured fields (severity, root_cause, evidence, actions, knowledge_refs, confidence)
  • Full report text
  • LLM usage metadata (provider, model, tokens, duration)
  • Task linkage (task_id, cluster_id, customer info)

Usage

from byoc_agent.investigator import run_pending_investigations

# Process up to 5 pending tasks
results = run_pending_investigations(max_tasks=5)

# CLI
# python -m byoc_agent.investigator 5

Important Rules

  • Be thorough but efficient -- 5-8 tool calls, not 15.
  • If Knowledge Lake has a matching past incident with a known fix, lead with that.
  • Factor customer tier into severity: Tier A/S issues are always higher priority.
  • If metrics look normal despite the alert, say so -- false positives happen.
  • Always check the SR version -- many issues are version-specific.