The 5 Metrics Every AI Agent Team Should Track
Most teams either track nothing meaningful or drown in metrics that don't drive decisions. Here are the 5 that actually matter.
Metric 1: Cost Per Completed Task
Not cost per API call. Not monthly LLM spend. Cost per completed task — the total dollars spent to produce one unit of value for your user.
Why this metric specifically:
- It normalizes for complexity. A 20-step research task should cost more than a 3-step classification.
- It catches efficiency regressions. If cost per task creeps up, something changed — longer prompts, more retries, model upgrades, context bloat.
- It directly maps to unit economics. If your product charges $0.10 per task and it costs $0.15 to complete, you have a problem.
How to Measure
from agentops import Session
session = Session(tags=["support-resolution"])
result = support_agent.resolve(ticket)
session.end(
result="success" if result.resolved else "failed",
metadata={"ticket_id": ticket.id, "category": ticket.category}
)
# session.total_cost has the full cost
# Aggregate: AVG(cost) WHERE tag = "support-resolution" AND result = "success"What Good Looks Like
Track three numbers: median, P95, and P99 cost per task type.
A high P99 with a normal median means you have runaway sessions — 1% of tasks are costing 50-100x the norm. Those are loops, retries, or context explosions. Fix them and your bill drops dramatically.
Alert on: P99 cost > 10x median, sustained over 1 hour.
Metric 2: Task Success Rate
What percentage of agent sessions actually accomplish their goal? Most teams measure completion (did the agent return a response?) not success (was the response correct and useful?).
Three flavors of success measurement:
Automated Evaluation
def evaluate_agent_output(query, response, context):
"""Returns a score 0-1 for response quality."""
eval_prompt = f"""
User query: {query}
Agent response: {response}
Available context: {context}
Rate the response quality from 0-1:
- 1.0: Correct, complete, well-sourced
- 0.7: Mostly correct, minor gaps
- 0.4: Partially correct, significant gaps
- 0.0: Incorrect or hallucinated
"""
score = eval_llm.score(eval_prompt)
session.record_metric("quality_score", score)
return scoreUser Feedback
# Track user signals
session.record_event("user_feedback", {
"type": "thumbs_up",
# or: "thumbs_down", "follow_up_question", "escalated_to_human"
"timestamp": now()
})What Good Looks Like
- Support agents: >85% resolution without escalation
- RAG/Q&A agents: >80% quality score (automated eval)
- Code generation: >70% accepted without edits
Alert on: Success rate drops >10% compared to 7-day rolling average.
Metric 3: End-to-End Latency
Users don't care how many LLM calls your agent makes. They care how long it takes to get an answer. Track the full session duration from user input to final response.
But also decompose it — because when latency spikes, you need to know where:
trace = session.get_trace()
breakdown = {
"total_ms": trace.duration_ms,
"llm_time_ms": sum(s.duration_ms for s in trace.llm_spans),
"tool_time_ms": sum(s.duration_ms for s in trace.tool_spans),
"overhead_ms": trace.duration_ms - llm_time - tool_time,
"llm_calls": len(trace.llm_spans),
"time_to_first_token_ms": trace.first_token_ms,
}Common latency patterns:
- LLM-bound: 80%+ of time in LLM calls. Optimize with smaller models, shorter prompts, or parallel calls.
- Tool-bound: Slow database queries, API timeouts. Fix the tools, not the agent.
- Overhead-bound: Framework serialization, context assembly. Your orchestration code needs work.
- Retry-bound: Agent keeps retrying failed steps. Fix the root cause.
Alert on: P95 latency > 2x the 7-day average.
Metric 4: Tool Reliability
Your agent is only as reliable as its tools. If the search API returns garbage 20% of the time, your agent will confidently process that garbage 20% of the time.
tool_metrics = {
"tool_name": "search_knowledge_base",
"call_count": 1547,
"success_rate": 0.94,
"avg_latency_ms": 230,
"p99_latency_ms": 1800,
"empty_result_rate": 0.12,
"used_by_agent_rate": 0.78, # Agent actually used the result
}That last metric — used_by_agent_rate — is crucial. If the agent calls a tool and then ignores its output 40% of the time, either the tool returns unhelpful results or the agent doesn't know how to use them. Both are bugs.
Alert on: Any tool with success rate <90% or used-by-agent rate <60%.
Metric 5: Token Efficiency
Not just "how many tokens did we use" — but how many tokens per unit of useful output. This tells you whether your prompts are efficient or whether you're burning money on context that doesn't help.
session_stats = {
"total_input_tokens": 45_200,
"total_output_tokens": 3_800,
"useful_output_tokens": 1_200,
"input_output_ratio": 11.9, # 12 input tokens per output token
"context_growth_rate": 1.4, # Context grows 40% per step
"system_prompt_overhead": 0.35, # 35% of input tokens are system prompt
}Key ratios to watch:
- Input-to-output ratio. 5:1 is efficient. 20:1 means massive contexts for small outputs. 50:1 means something is very wrong.
- Context growth rate. Linear growth (1.0-1.2x per step) is manageable. Super-linear (>1.5x per step) means you're appending full outputs without pruning.
- System prompt overhead. If your system prompt is 2,000 tokens and you make 15 LLM calls per session, that's 30,000 tokens just in repeated system prompts.
Optimization Example
# Before: 45K input tokens per session
# System prompt: 2,100 tokens × 12 calls = 25,200 (56%!)
# Full history appended each step: grows 8K per step
# After: 18K input tokens per session
# System prompt: 800 tokens × 12 calls = 9,600 (53%, lower absolute)
# Summarized history after step 5: caps at ~3K tokens
# Net savings: 60% fewer input tokens → 60% lower costAlert on: Input-to-output ratio > 20:1 sustained, or context growth rate > 1.5x per step.
Putting It All Together
These five metrics give you a complete picture of agent health:
- Cost per task — unit economics health
- Task success rate — quality of agent output
- E2E latency — user experience
- Tool reliability — infrastructure health
- Token efficiency — prompt/architecture health
Together, they answer: Is my agent system working, and is it getting better or worse?
Start Tracking Today
You don't need a data team to implement these. Canary captures all five metrics automatically — cost attribution, success tracking, latency breakdown, tool reliability, and token efficiency — from a single SDK integration.
import agentops
agentops.init()
# All 5 metrics are now being captured for every agent session.Start measuring what matters for free →
The teams that ship reliable agents aren't the ones with the best models — they're the ones with the best visibility into how those models perform.