PatrolAgent -- Autonomous Fleet-Wide Health Scan

Source: byoc_agent/patrol_agent.py

The PatrolAgent performs comprehensive fleet-wide health scans and produces actionable reports for the operations team. It runs autonomously 2x daily (morning + evening) or on-demand from the UI.

How It Works

Pre-aggregate fleet state -- Before calling the LLM, the agent runs 6 SQL queries to build a compact summary of current fleet health. This reduces tool calls and gives the LLM a head start.
Run the LLM agent -- The PatrolAgent (subclass of AutonomousAgent) runs with the pre-aggregated context as the user prompt. It has access to the full tool registry for deeper investigation.
Parse the report -- The LLM's response is parsed into structured sections.
Persist -- The parsed report is saved to alerts.patrol_reports.

Pre-Aggregated Fleet State

The _aggregate_fleet_state() function collects:

Section	Query
Cluster counts	Total and active clusters from `byoc.clusters`
Risk distribution	Critical/Warning/Healthy counts from `cluster_risk_snapshots`
Health distribution	Healthy (80+), Moderate (60-80), Poor (<60) from `cluster_health_scores`
Alerts (24h)	Firing/Resolved counts from `lark_alerts`
Open issues	Ongoing issues by severity from `alerts.issues`
Worst 10 clusters	Bottom health scores joined with risk reasons

Patrol Protocol (6 steps)

The system prompt instructs the LLM to follow this protocol:

Fleet Overview -- query_fleet_summary for account-wide stats, query_open_issues for current issues.
Identify Problem Clusters -- SQL join of risk snapshots and health scores, filtered to Critical/Warning or score < 70.
Deep-Dive Top Clusters -- For the top 5 most critical: check metrics, alerts, live infrastructure state, and logs.
Cross-Cluster Patterns -- Look for systemic issues: same alert type across clusters, version-specific bugs, account-level concerns.
Search Knowledge Lake -- Match observed patterns against known issues and past incidents.
Predictions -- Based on metric trends, predict which clusters may have issues in 24-48 hours.

Report Output Format

The LLM produces a report with these structured sections:

Section	Content
`FLEET_NARRATIVE`	2-4 paragraph overview for ops managers
`FLAGGED_CLUSTERS`	JSON array with cluster_id, severity, reason, key_metrics
`PREDICTIONS`	Bullet points with confidence levels (High/Medium/Low)
`CUSTOMER_SUMMARIES`	Per Tier A/S customer with issues, 2-3 sentence summaries
`RECOMMENDED_ACTIONS`	Prioritized action items with urgency (Immediate/Today/This Week)

Configuration

Parameter	Value	Description
`agent_type`	`"patrol"`	Agent identifier
`max_rounds`	10	Maximum tool-use rounds (aims for 6-10 tool calls)
`tools_schema`	`TOOL_SCHEMAS`	Full tool registry from `agent_tools`

Report Types

Type	Trigger
`morning`	Cron job, morning shift
`evening`	Cron job, evening shift
`weekly`	Weekly summary
`on_demand`	Manual trigger from UI or CLI

Persistence

Reports are saved to alerts.patrol_reports with columns for each parsed section, plus LLM usage metadata (provider, model, input/output tokens, duration).

Usage

from byoc_agent.patrol_agent import run_patrol

# Programmatic
report_id, result = run_patrol(report_type="morning", triggered_by="cron")

# CLI
# python -m byoc_agent.patrol_agent on_demand

Important Design Decisions

Pre-aggregation reduces LLM tool calls from ~15+ to 6-10, cutting cost and latency.
Tier A/S customers are always prioritized in analysis.
The agent is instructed: "If the fleet is healthy, say so confidently -- don't invent problems."

How It Works​

Pre-Aggregated Fleet State​

Patrol Protocol (6 steps)​

Report Output Format​

Configuration​

Report Types​

Persistence​

Usage​

Important Design Decisions​