The 5 Metrics Every AI Agent Team Should Track

Metric 1: Cost Per Completed Task

Not cost per API call. Not monthly LLM spend. Cost per completed task — the total dollars spent to produce one unit of value for your user.

Why this metric specifically:

It normalizes for complexity. A 20-step research task should cost more than a 3-step classification.
It catches efficiency regressions. If cost per task creeps up, something changed — longer prompts, more retries, model upgrades, context bloat.
It directly maps to unit economics. If your product charges $0.10 per task and it costs $0.15 to complete, you have a problem.

How to Measure

from agentops import Session

session = Session(tags=["support-resolution"])

result = support_agent.resolve(ticket)

session.end(
    result="success" if result.resolved else "failed",
    metadata={"ticket_id": ticket.id, "category": ticket.category}
)

# session.total_cost has the full cost
# Aggregate: AVG(cost) WHERE tag = "support-resolution" AND result = "success"

What Good Looks Like

Track three numbers: median, P95, and P99 cost per task type.

A high P99 with a normal median means you have runaway sessions — 1% of tasks are costing 50-100x the norm. Those are loops, retries, or context explosions. Fix them and your bill drops dramatically.

Alert on: P99 cost > 10x median, sustained over 1 hour.

Metric 2: Task Success Rate

What percentage of agent sessions actually accomplish their goal? Most teams measure completion (did the agent return a response?) not success (was the response correct and useful?).

Three flavors of success measurement:

Automated Evaluation

def evaluate_agent_output(query, response, context):
    """Returns a score 0-1 for response quality."""
    eval_prompt = f"""
    User query: {query}
    Agent response: {response}
    Available context: {context}
    
    Rate the response quality from 0-1:
    - 1.0: Correct, complete, well-sourced
    - 0.7: Mostly correct, minor gaps
    - 0.4: Partially correct, significant gaps
    - 0.0: Incorrect or hallucinated
    """
    score = eval_llm.score(eval_prompt)
    session.record_metric("quality_score", score)
    return score

User Feedback

# Track user signals
session.record_event("user_feedback", {
    "type": "thumbs_up",
    # or: "thumbs_down", "follow_up_question", "escalated_to_human"
    "timestamp": now()
})

What Good Looks Like

Support agents: >85% resolution without escalation
RAG/Q&A agents: >80% quality score (automated eval)
Code generation: >70% accepted without edits

Alert on: Success rate drops >10% compared to 7-day rolling average.

Metric 3: End-to-End Latency

Users don't care how many LLM calls your agent makes. They care how long it takes to get an answer. Track the full session duration from user input to final response.

But also decompose it — because when latency spikes, you need to know where:

trace = session.get_trace()

breakdown = {
    "total_ms": trace.duration_ms,
    "llm_time_ms": sum(s.duration_ms for s in trace.llm_spans),
    "tool_time_ms": sum(s.duration_ms for s in trace.tool_spans),
    "overhead_ms": trace.duration_ms - llm_time - tool_time,
    "llm_calls": len(trace.llm_spans),
    "time_to_first_token_ms": trace.first_token_ms,
}

Common latency patterns:

LLM-bound: 80%+ of time in LLM calls. Optimize with smaller models, shorter prompts, or parallel calls.
Tool-bound: Slow database queries, API timeouts. Fix the tools, not the agent.
Overhead-bound: Framework serialization, context assembly. Your orchestration code needs work.
Retry-bound: Agent keeps retrying failed steps. Fix the root cause.

Alert on: P95 latency > 2x the 7-day average.

Metric 4: Tool Reliability

Your agent is only as reliable as its tools. If the search API returns garbage 20% of the time, your agent will confidently process that garbage 20% of the time.

tool_metrics = {
    "tool_name": "search_knowledge_base",
    "call_count": 1547,
    "success_rate": 0.94,
    "avg_latency_ms": 230,
    "p99_latency_ms": 1800,
    "empty_result_rate": 0.12,
    "used_by_agent_rate": 0.78,  # Agent actually used the result
}

That last metric — used_by_agent_rate — is crucial. If the agent calls a tool and then ignores its output 40% of the time, either the tool returns unhelpful results or the agent doesn't know how to use them. Both are bugs.

Alert on: Any tool with success rate <90% or used-by-agent rate <60%.

Metric 5: Token Efficiency

Not just "how many tokens did we use" — but how many tokens per unit of useful output. This tells you whether your prompts are efficient or whether you're burning money on context that doesn't help.

session_stats = {
    "total_input_tokens": 45_200,
    "total_output_tokens": 3_800,
    "useful_output_tokens": 1_200,
    "input_output_ratio": 11.9,      # 12 input tokens per output token
    "context_growth_rate": 1.4,      # Context grows 40% per step
    "system_prompt_overhead": 0.35,  # 35% of input tokens are system prompt
}

Key ratios to watch:

Input-to-output ratio. 5:1 is efficient. 20:1 means massive contexts for small outputs. 50:1 means something is very wrong.
Context growth rate. Linear growth (1.0-1.2x per step) is manageable. Super-linear (>1.5x per step) means you're appending full outputs without pruning.
System prompt overhead. If your system prompt is 2,000 tokens and you make 15 LLM calls per session, that's 30,000 tokens just in repeated system prompts.

Optimization Example

# Before: 45K input tokens per session
# System prompt: 2,100 tokens × 12 calls = 25,200 (56%!)
# Full history appended each step: grows 8K per step

# After: 18K input tokens per session
# System prompt: 800 tokens × 12 calls = 9,600 (53%, lower absolute)
# Summarized history after step 5: caps at ~3K tokens
# Net savings: 60% fewer input tokens → 60% lower cost

Alert on: Input-to-output ratio > 20:1 sustained, or context growth rate > 1.5x per step.

Putting It All Together

These five metrics give you a complete picture of agent health:

Cost per task — unit economics health
Task success rate — quality of agent output
E2E latency — user experience
Tool reliability — infrastructure health
Token efficiency — prompt/architecture health

Together, they answer: Is my agent system working, and is it getting better or worse?

Start Tracking Today

You don't need a data team to implement these. Canary captures all five metrics automatically — cost attribution, success tracking, latency breakdown, tool reliability, and token efficiency — from a single SDK integration.

import agentops
agentops.init()

# All 5 metrics are now being captured for every agent session.

Start measuring what matters for free →

The teams that ship reliable agents aren't the ones with the best models — they're the ones with the best visibility into how those models perform.