BYOC Agentic AI Operations Platform
Internal documentation and onboarding guide for CelerData engineers working on the BYOC cluster health monitoring and operations platform.
What is this platform?
A full-stack, AI-powered operations platform that monitors the health of CelerData's BYOC (Bring Your Own Cloud) StarRocks clusters. It combines telemetry data (~830M metric rows), customer profiles, and real-time alert streams into a unified system that detects, investigates, and reports cluster health issues -- with minimal human intervention.
Who is it for?
- DW Operations engineers who triage alerts and respond to cluster incidents
- Support engineers investigating customer-reported performance issues
- Engineering managers who need fleet-wide health visibility
- New team members onboarding to the BYOC operations workflow
What problems does it solve?
| Problem | Solution |
|---|---|
| Alert fatigue from hundreds of raw Lark alerts per day | AI-powered issue grouping merges related alerts into actionable issues |
| Manual investigation of each unhealthy cluster | Investigator Agent autonomously deep-dives and produces structured reports |
| No fleet-wide visibility across 170+ active clusters | Patrol Agent runs 2x/day fleet sweeps with cross-cluster pattern detection |
| Reactive-only operations | Sentinel detects score drops, alert storms, and new Critical clusters every 15 min |
| Scattered data across Grafana, Lark, StarRocks | Single pane of glass: React dashboard with 12 pages and a chat interface |
High-Level Architecture
Quick Links
| Section | What you will find |
|---|---|
| Architecture Overview | System components and how they connect |
| Data Flow | Metrics, alert, and agent data pipelines |
| Tech Stack | Languages, frameworks, infrastructure |