← Back to Blog
February 19, 2026·8 min read

Agent Observability 101: What to Track and Why

Your AI agents are running in production. But do you know what they're actually doing?

The Problem: Agents Are Black Boxes

Traditional software is deterministic. You call a function, it returns the same result every time. AI agents are fundamentally different. They make decisions, call tools, generate responses — and every run can take a completely different path.

This means traditional monitoring tools (Datadog, New Relic, Sentry) catch infrastructure issues but miss the things that actually break agents: hallucinations, cost spikes, tool call failures, and gradual quality degradation.

The 5 Pillars of Agent Observability

1. Session Tracking

Every agent run is a session — a sequence of decisions, tool calls, and outputs. You need to capture the full trajectory: what the agent decided to do, in what order, and what the outcome was.

Key metrics:

  • Session duration — how long each run takes
  • Session outcome — success, failure, partial completion
  • Step count — how many decisions/actions per session
  • Tool calls per session — are agents over-using or under-using tools?

2. Cost Monitoring

AI agents can burn through API credits faster than you expect. A single runaway agent loop can cost hundreds of dollars in minutes. You need real-time cost tracking per agent, per model, per session.

Key metrics:

  • Cost per session — what does each agent run actually cost?
  • Token usage — input vs output tokens, by model
  • Cost anomalies — sudden spikes that indicate infinite loops or prompt injection
  • Model comparison — are you using the most cost-effective model for each task?

3. Tool Call Analytics

Agents interact with the world through tools — API calls, database queries, web searches, file operations. When tools fail, agents fail. But they often fail silently, hallucinating a response instead of reporting an error.

Key metrics:

  • Tool success rate — which tools fail most often?
  • Tool latency — p50, p95, p99 response times
  • Tool usage patterns — which tools are over/under-utilized?
  • Failure cascades — does one tool failure cause a chain of failures?

4. Error Detection

Agent errors aren't like traditional errors. They don't always throw exceptions. Sometimes the agent simply gives a wrong answer, makes up data, or gets stuck in a loop. You need detection that goes beyond stack traces.

Key signals:

  • Explicit errors — exceptions, API failures, timeouts
  • Behavioral anomalies — unusually long sessions, repeated tool calls, circular reasoning
  • User frustration — follow-up corrections, rephrased requests, abandonments
  • Output quality drift — gradual degradation that's hard to notice day-to-day

5. Daily Digest

Nobody has time to stare at dashboards all day. You need an automated daily summary: what happened, what went wrong, and what needs attention. Think of it as your agents' morning standup.

A good daily digest includes:

  • Total sessions, success rate, and cost
  • Top errors and new error patterns
  • Cost anomalies and model performance comparison
  • Action items (things that need human attention)

Why Traditional APM Doesn't Work

Tools like Datadog and New Relic are built for request-response architectures. They track HTTP latency, error rates, and infrastructure health. But agents are multi-step, non-deterministic processes that span multiple API calls, tool invocations, and decision points.

Monitoring an AI agent with traditional APM is like monitoring a conversation with a ping test. You know it's happening, but you have no idea what's being said.

Getting Started

The good news: you don't need to build this from scratch. Canary gives you all five pillars out of the box with a 2-line SDK integration:

import { Canary } from '@canary/sdk';
const canary = new Canary({ apiKey: 'ck_...' });

Every LLM call, tool invocation, and session outcome is automatically captured. You get a dashboard, alerts, and a daily digest — without writing any monitoring code.

Start monitoring your agents for free →

What's Next

In upcoming posts, we'll cover:

  • How to set up cost budgets and alerts for AI agents
  • Detecting prompt injection attacks in production
  • A/B testing prompts and models with real traffic
  • Building an error budget for AI agent quality