Skip to main content

PatrolAgent -- Autonomous Fleet-Wide Health Scan

Source: byoc_agent/patrol_agent.py

The PatrolAgent performs comprehensive fleet-wide health scans and produces actionable reports for the operations team. It runs autonomously 2x daily (morning + evening) or on-demand from the UI.

How It Works

  1. Pre-aggregate fleet state -- Before calling the LLM, the agent runs 6 SQL queries to build a compact summary of current fleet health. This reduces tool calls and gives the LLM a head start.
  2. Run the LLM agent -- The PatrolAgent (subclass of AutonomousAgent) runs with the pre-aggregated context as the user prompt. It has access to the full tool registry for deeper investigation.
  3. Parse the report -- The LLM's response is parsed into structured sections.
  4. Persist -- The parsed report is saved to alerts.patrol_reports.

Pre-Aggregated Fleet State

The _aggregate_fleet_state() function collects:

SectionQuery
Cluster countsTotal and active clusters from byoc.clusters
Risk distributionCritical/Warning/Healthy counts from cluster_risk_snapshots
Health distributionHealthy (80+), Moderate (60-80), Poor (<60) from cluster_health_scores
Alerts (24h)Firing/Resolved counts from lark_alerts
Open issuesOngoing issues by severity from alerts.issues
Worst 10 clustersBottom health scores joined with risk reasons

Patrol Protocol (6 steps)

The system prompt instructs the LLM to follow this protocol:

  1. Fleet Overview -- query_fleet_summary for account-wide stats, query_open_issues for current issues.
  2. Identify Problem Clusters -- SQL join of risk snapshots and health scores, filtered to Critical/Warning or score < 70.
  3. Deep-Dive Top Clusters -- For the top 5 most critical: check metrics, alerts, live infrastructure state, and logs.
  4. Cross-Cluster Patterns -- Look for systemic issues: same alert type across clusters, version-specific bugs, account-level concerns.
  5. Search Knowledge Lake -- Match observed patterns against known issues and past incidents.
  6. Predictions -- Based on metric trends, predict which clusters may have issues in 24-48 hours.

Report Output Format

The LLM produces a report with these structured sections:

SectionContent
FLEET_NARRATIVE2-4 paragraph overview for ops managers
FLAGGED_CLUSTERSJSON array with cluster_id, severity, reason, key_metrics
PREDICTIONSBullet points with confidence levels (High/Medium/Low)
CUSTOMER_SUMMARIESPer Tier A/S customer with issues, 2-3 sentence summaries
RECOMMENDED_ACTIONSPrioritized action items with urgency (Immediate/Today/This Week)

Configuration

ParameterValueDescription
agent_type"patrol"Agent identifier
max_rounds10Maximum tool-use rounds (aims for 6-10 tool calls)
tools_schemaTOOL_SCHEMASFull tool registry from agent_tools

Report Types

TypeTrigger
morningCron job, morning shift
eveningCron job, evening shift
weeklyWeekly summary
on_demandManual trigger from UI or CLI

Persistence

Reports are saved to alerts.patrol_reports with columns for each parsed section, plus LLM usage metadata (provider, model, input/output tokens, duration).

Usage

from byoc_agent.patrol_agent import run_patrol

# Programmatic
report_id, result = run_patrol(report_type="morning", triggered_by="cron")

# CLI
# python -m byoc_agent.patrol_agent on_demand

Important Design Decisions

  • Pre-aggregation reduces LLM tool calls from ~15+ to 6-10, cutting cost and latency.
  • Tier A/S customers are always prioritized in analysis.
  • The agent is instructed: "If the fleet is healthy, say so confidently -- don't invent problems."