Agent Observability 101: What to Track and Why

The Problem: Agents Are Black Boxes

Traditional software is deterministic. You call a function, it returns the same result every time. AI agents are fundamentally different. They make decisions, call tools, generate responses — and every run can take a completely different path.

This means traditional monitoring tools (Datadog, New Relic, Sentry) catch infrastructure issues but miss the things that actually break agents: hallucinations, cost spikes, tool call failures, and gradual quality degradation.

The 5 Pillars of Agent Observability

1. Session Tracking

Every agent run is a session — a sequence of decisions, tool calls, and outputs. You need to capture the full trajectory: what the agent decided to do, in what order, and what the outcome was.

Key metrics:

Session duration — how long each run takes
Session outcome — success, failure, partial completion
Step count — how many decisions/actions per session
Tool calls per session — are agents over-using or under-using tools?

2. Cost Monitoring

AI agents can burn through API credits faster than you expect. A single runaway agent loop can cost hundreds of dollars in minutes. You need real-time cost tracking per agent, per model, per session.

Key metrics:

Cost per session — what does each agent run actually cost?
Token usage — input vs output tokens, by model
Cost anomalies — sudden spikes that indicate infinite loops or prompt injection
Model comparison — are you using the most cost-effective model for each task?

3. Tool Call Analytics

Agents interact with the world through tools — API calls, database queries, web searches, file operations. When tools fail, agents fail. But they often fail silently, hallucinating a response instead of reporting an error.

Key metrics:

Tool success rate — which tools fail most often?
Tool latency — p50, p95, p99 response times
Tool usage patterns — which tools are over/under-utilized?
Failure cascades — does one tool failure cause a chain of failures?

4. Error Detection

Agent errors aren't like traditional errors. They don't always throw exceptions. Sometimes the agent simply gives a wrong answer, makes up data, or gets stuck in a loop. You need detection that goes beyond stack traces.

Key signals:

Explicit errors — exceptions, API failures, timeouts
Behavioral anomalies — unusually long sessions, repeated tool calls, circular reasoning
User frustration — follow-up corrections, rephrased requests, abandonments
Output quality drift — gradual degradation that's hard to notice day-to-day

5. Daily Digest

Nobody has time to stare at dashboards all day. You need an automated daily summary: what happened, what went wrong, and what needs attention. Think of it as your agents' morning standup.

A good daily digest includes:

Total sessions, success rate, and cost
Top errors and new error patterns
Cost anomalies and model performance comparison
Action items (things that need human attention)

Why Traditional APM Doesn't Work

Tools like Datadog and New Relic are built for request-response architectures. They track HTTP latency, error rates, and infrastructure health. But agents are multi-step, non-deterministic processes that span multiple API calls, tool invocations, and decision points.

Monitoring an AI agent with traditional APM is like monitoring a conversation with a ping test. You know it's happening, but you have no idea what's being said.

Getting Started

The good news: you don't need to build this from scratch. Canary gives you all five pillars out of the box with a 2-line SDK integration:

import { Canary } from '@canary/sdk';
const canary = new Canary({ apiKey: 'ck_...' });

Every LLM call, tool invocation, and session outcome is automatically captured. You get a dashboard, alerts, and a daily digest — without writing any monitoring code.

Start monitoring your agents for free →

What's Next

In upcoming posts, we'll cover:

How to set up cost budgets and alerts for AI agents
Detecting prompt injection attacks in production
A/B testing prompts and models with real traffic
Building an error budget for AI agent quality