PatrolAgent -- Autonomous Fleet-Wide Health Scan
Source: byoc_agent/patrol_agent.py
The PatrolAgent performs comprehensive fleet-wide health scans and produces actionable reports for the operations team. It runs autonomously 2x daily (morning + evening) or on-demand from the UI.
How It Works
- Pre-aggregate fleet state -- Before calling the LLM, the agent runs 6 SQL queries to build a compact summary of current fleet health. This reduces tool calls and gives the LLM a head start.
- Run the LLM agent -- The
PatrolAgent(subclass ofAutonomousAgent) runs with the pre-aggregated context as the user prompt. It has access to the full tool registry for deeper investigation. - Parse the report -- The LLM's response is parsed into structured sections.
- Persist -- The parsed report is saved to
alerts.patrol_reports.
Pre-Aggregated Fleet State
The _aggregate_fleet_state() function collects:
| Section | Query |
|---|---|
| Cluster counts | Total and active clusters from byoc.clusters |
| Risk distribution | Critical/Warning/Healthy counts from cluster_risk_snapshots |
| Health distribution | Healthy (80+), Moderate (60-80), Poor (<60) from cluster_health_scores |
| Alerts (24h) | Firing/Resolved counts from lark_alerts |
| Open issues | Ongoing issues by severity from alerts.issues |
| Worst 10 clusters | Bottom health scores joined with risk reasons |
Patrol Protocol (6 steps)
The system prompt instructs the LLM to follow this protocol:
- Fleet Overview --
query_fleet_summaryfor account-wide stats,query_open_issuesfor current issues. - Identify Problem Clusters -- SQL join of risk snapshots and health scores, filtered to Critical/Warning or score < 70.
- Deep-Dive Top Clusters -- For the top 5 most critical: check metrics, alerts, live infrastructure state, and logs.
- Cross-Cluster Patterns -- Look for systemic issues: same alert type across clusters, version-specific bugs, account-level concerns.
- Search Knowledge Lake -- Match observed patterns against known issues and past incidents.
- Predictions -- Based on metric trends, predict which clusters may have issues in 24-48 hours.
Report Output Format
The LLM produces a report with these structured sections:
| Section | Content |
|---|---|
FLEET_NARRATIVE | 2-4 paragraph overview for ops managers |
FLAGGED_CLUSTERS | JSON array with cluster_id, severity, reason, key_metrics |
PREDICTIONS | Bullet points with confidence levels (High/Medium/Low) |
CUSTOMER_SUMMARIES | Per Tier A/S customer with issues, 2-3 sentence summaries |
RECOMMENDED_ACTIONS | Prioritized action items with urgency (Immediate/Today/This Week) |
Configuration
| Parameter | Value | Description |
|---|---|---|
agent_type | "patrol" | Agent identifier |
max_rounds | 10 | Maximum tool-use rounds (aims for 6-10 tool calls) |
tools_schema | TOOL_SCHEMAS | Full tool registry from agent_tools |
Report Types
| Type | Trigger |
|---|---|
morning | Cron job, morning shift |
evening | Cron job, evening shift |
weekly | Weekly summary |
on_demand | Manual trigger from UI or CLI |
Persistence
Reports are saved to alerts.patrol_reports with columns for each parsed section, plus LLM usage metadata (provider, model, input/output tokens, duration).
Usage
from byoc_agent.patrol_agent import run_patrol
# Programmatic
report_id, result = run_patrol(report_type="morning", triggered_by="cron")
# CLI
# python -m byoc_agent.patrol_agent on_demand
Important Design Decisions
- Pre-aggregation reduces LLM tool calls from ~15+ to 6-10, cutting cost and latency.
- Tier A/S customers are always prioritized in analysis.
- The agent is instructed: "If the fleet is healthy, say so confidently -- don't invent problems."