Agent Observability 101: What to Track and Why
Your AI agents are running in production. But do you know what they're actually doing?
The Problem: Agents Are Black Boxes
Traditional software is deterministic. You call a function, it returns the same result every time. AI agents are fundamentally different. They make decisions, call tools, generate responses — and every run can take a completely different path.
This means traditional monitoring tools (Datadog, New Relic, Sentry) catch infrastructure issues but miss the things that actually break agents: hallucinations, cost spikes, tool call failures, and gradual quality degradation.
The 5 Pillars of Agent Observability
1. Session Tracking
Every agent run is a session — a sequence of decisions, tool calls, and outputs. You need to capture the full trajectory: what the agent decided to do, in what order, and what the outcome was.
Key metrics:
- Session duration — how long each run takes
- Session outcome — success, failure, partial completion
- Step count — how many decisions/actions per session
- Tool calls per session — are agents over-using or under-using tools?
2. Cost Monitoring
AI agents can burn through API credits faster than you expect. A single runaway agent loop can cost hundreds of dollars in minutes. You need real-time cost tracking per agent, per model, per session.
Key metrics:
- Cost per session — what does each agent run actually cost?
- Token usage — input vs output tokens, by model
- Cost anomalies — sudden spikes that indicate infinite loops or prompt injection
- Model comparison — are you using the most cost-effective model for each task?
3. Tool Call Analytics
Agents interact with the world through tools — API calls, database queries, web searches, file operations. When tools fail, agents fail. But they often fail silently, hallucinating a response instead of reporting an error.
Key metrics:
- Tool success rate — which tools fail most often?
- Tool latency — p50, p95, p99 response times
- Tool usage patterns — which tools are over/under-utilized?
- Failure cascades — does one tool failure cause a chain of failures?
4. Error Detection
Agent errors aren't like traditional errors. They don't always throw exceptions. Sometimes the agent simply gives a wrong answer, makes up data, or gets stuck in a loop. You need detection that goes beyond stack traces.
Key signals:
- Explicit errors — exceptions, API failures, timeouts
- Behavioral anomalies — unusually long sessions, repeated tool calls, circular reasoning
- User frustration — follow-up corrections, rephrased requests, abandonments
- Output quality drift — gradual degradation that's hard to notice day-to-day
5. Daily Digest
Nobody has time to stare at dashboards all day. You need an automated daily summary: what happened, what went wrong, and what needs attention. Think of it as your agents' morning standup.
A good daily digest includes:
- Total sessions, success rate, and cost
- Top errors and new error patterns
- Cost anomalies and model performance comparison
- Action items (things that need human attention)
Why Traditional APM Doesn't Work
Tools like Datadog and New Relic are built for request-response architectures. They track HTTP latency, error rates, and infrastructure health. But agents are multi-step, non-deterministic processes that span multiple API calls, tool invocations, and decision points.
Monitoring an AI agent with traditional APM is like monitoring a conversation with a ping test. You know it's happening, but you have no idea what's being said.
Getting Started
The good news: you don't need to build this from scratch. Canary gives you all five pillars out of the box with a 2-line SDK integration:
import { Canary } from '@canary/sdk';
const canary = new Canary({ apiKey: 'ck_...' });Every LLM call, tool invocation, and session outcome is automatically captured. You get a dashboard, alerts, and a daily digest — without writing any monitoring code.
Start monitoring your agents for free →
What's Next
In upcoming posts, we'll cover:
- How to set up cost budgets and alerts for AI agents
- Detecting prompt injection attacks in production
- A/B testing prompts and models with real traffic
- Building an error budget for AI agent quality