Skip to main content

BYOC Agentic AI Operations Platform

Internal documentation and onboarding guide for CelerData engineers working on the BYOC cluster health monitoring and operations platform.

What is this platform?

A full-stack, AI-powered operations platform that monitors the health of CelerData's BYOC (Bring Your Own Cloud) StarRocks clusters. It combines telemetry data (~830M metric rows), customer profiles, and real-time alert streams into a unified system that detects, investigates, and reports cluster health issues -- with minimal human intervention.

Who is it for?

  • DW Operations engineers who triage alerts and respond to cluster incidents
  • Support engineers investigating customer-reported performance issues
  • Engineering managers who need fleet-wide health visibility
  • New team members onboarding to the BYOC operations workflow

What problems does it solve?

ProblemSolution
Alert fatigue from hundreds of raw Lark alerts per dayAI-powered issue grouping merges related alerts into actionable issues
Manual investigation of each unhealthy clusterInvestigator Agent autonomously deep-dives and produces structured reports
No fleet-wide visibility across 170+ active clustersPatrol Agent runs 2x/day fleet sweeps with cross-cluster pattern detection
Reactive-only operationsSentinel detects score drops, alert storms, and new Critical clusters every 15 min
Scattered data across Grafana, Lark, StarRocksSingle pane of glass: React dashboard with 12 pages and a chat interface

High-Level Architecture

SectionWhat you will find
Architecture OverviewSystem components and how they connect
Data FlowMetrics, alert, and agent data pipelines
Tech StackLanguages, frameworks, infrastructure