Tracing Multi-Agent Systems: A Practical Guide

Why Multi-Agent Tracing Is Different

In a single-agent system, you have one execution thread: prompt → reasoning → tool calls → response. The trace is a linear sequence or a shallow tree.

Multi-agent systems are directed graphs. Agent A calls Agent B and Agent C in parallel. Agent B calls Agent D. Agent D's result feeds back into Agent A's next decision. The execution path looks less like a call stack and more like a distributed system — because it is one.

The same problems that made distributed systems tracing hard (correlation, causality, fan-out, async boundaries) apply to multi-agent systems, plus a new one: semantic dependencies. Agent B's output isn't just data flowing to Agent A — it's information that changes Agent A's reasoning in unpredictable ways.

The Trace Structure

A multi-agent trace needs three things: sessions, spans, and parent-child relationships.

Session: "Analyze quarterly report"
├── Span: Coordinator Agent
│   ├── LLM Call: Plan decomposition (GPT-4o, 3.2K tokens)
│   ├── Span: Data Extraction Agent
│   │   ├── LLM Call: Parse document (GPT-4o-mini, 8.1K tokens)
│   │   ├── Tool Call: pdf_extract(q3_report.pdf)
│   │   └── LLM Call: Structure data (GPT-4o-mini, 2.4K tokens)
│   ├── Span: Analysis Agent
│   │   ├── LLM Call: Identify trends (GPT-4o, 6.7K tokens)
│   │   ├── Tool Call: query_database("revenue by quarter")
│   │   ├── LLM Call: Compare to benchmarks (GPT-4o, 4.1K tokens)
│   │   └── Span: Fact-Check Agent
│   │       ├── LLM Call: Verify claims (GPT-4o-mini, 3.3K tokens)
│   │       └── Tool Call: web_search("Q3 2025 industry benchmarks")
│   └── LLM Call: Synthesize final report (GPT-4o, 9.8K tokens)

Total: 5 agents, 7 LLM calls, 3 tool calls, 37.6K tokens, $0.34. Without tracing, you'd see none of this structure.

Implementation: Trace Context Propagation

The critical piece is propagating trace context across agent boundaries. When the coordinator spawns a sub-agent, the sub-agent's trace needs to be linked as a child of the coordinator's span.

Pattern 1: Framework-Level Propagation

If you're using a framework like CrewAI or AutoGen, the framework manages agent handoffs. You instrument at the framework level:

import agentops
from crewai import Agent, Task, Crew

agentops.init()

researcher = Agent(
    role="Research Analyst",
    goal="Find relevant market data",
    tools=[search_tool, database_tool],
)

analyst = Agent(
    role="Financial Analyst",
    goal="Analyze trends and generate insights",
    tools=[calculator_tool, chart_tool],
)

crew = Crew(
    agents=[researcher, analyst],
    tasks=[research_task, analysis_task],
    verbose=True,
)

# Canary automatically captures the full multi-agent trace
result = crew.kickoff()

Pattern 2: Manual Context Passing

If you're building your own orchestration, propagate trace context explicitly:

from agentops import Session, Span

session = Session(tags=["quarterly-analysis"])

async def coordinator(query: str):
    with session.span("coordinator") as coord_span:
        plan = await planner.decompose(query)
        
        tasks = []
        for subtask in plan.subtasks:
            agent = select_agent(subtask.type)
            tasks.append(
                run_sub_agent(agent, subtask, parent_span=coord_span)
            )
        
        results = await asyncio.gather(*tasks)
        return await synthesizer.compile(results)

async def run_sub_agent(agent, task, parent_span):
    with parent_span.child(f"agent:{agent.name}") as agent_span:
        return await agent.execute(task, trace_span=agent_span)

Pattern 3: Cross-Process Tracing

For agents running as separate services (microservice architecture), propagate trace context via headers:

# Coordinator service
import httpx

async def call_research_agent(query: str, trace_context: dict):
    response = await httpx.post(
        "http://research-agent/analyze",
        json={"query": query},
        headers={
            "X-Trace-ID": trace_context["trace_id"],
            "X-Parent-Span-ID": trace_context["span_id"],
        }
    )
    return response.json()

# Research agent service
from fastapi import FastAPI, Request

app = FastAPI()

@app.post("/analyze")
async def analyze(request: Request):
    trace_id = request.headers.get("X-Trace-ID")
    parent_span = request.headers.get("X-Parent-Span-ID")
    
    with agentops.continue_trace(trace_id, parent_span) as span:
        result = await research_agent.run(request.json()["query"])
        return {"result": result}

Debugging Multi-Agent Failures

With tracing in place, debugging becomes systematic:

Failure Pattern 1: Wrong Agent Selection

The coordinator picked the wrong sub-agent for a task. In the trace, you see:

Coordinator → LLM Call: "Route task to appropriate agent"
  → Decision: sent "calculate revenue growth" to TextSummarizer
  → TextSummarizer produced narrative instead of calculations
  → Coordinator used incorrect data in final synthesis

Fix: Inspect the coordinator's routing prompt. Add examples of correct routing. Test with eval set.

Failure Pattern 2: Context Loss at Handoff

Sub-agent didn't receive enough context from the coordinator:

Coordinator → passes "analyze the data" to AnalysisAgent
  → AnalysisAgent has no idea what "the data" refers to
  → Hallucinates analysis of generic data
  → Coordinator incorporates hallucinated analysis into report

Fix: Trace shows the exact prompt sent to the sub-agent. Add the missing context to the handoff.

Failure Pattern 3: Cascading Retries

One slow or failing tool causes a cascade:

ResearchAgent → web_search("Q3 benchmarks") → timeout (30s)
  → retry → timeout (30s)
  → retry → partial results
  → AnalysisAgent waiting on ResearchAgent → stalled
  → Coordinator timeout → retries entire flow
  → Total: 4 minutes, $2.80, still incomplete

Fix: Trace reveals the bottleneck immediately. Add tool timeouts, fallback strategies, and circuit breakers.

Key Metrics for Multi-Agent Systems

Per-agent metrics:

Cost contribution (what % of session cost does each agent consume?)
Success rate (how often does this agent produce usable output?)
Latency distribution (is one agent the bottleneck?)
Token efficiency (tokens consumed vs. useful output produced)

System-level metrics:

Agent fan-out depth (how many levels of sub-agents?)
Inter-agent retry rate (are agents redoing each other's work?)
End-to-end latency breakdown (where does time actually go?)
Cost per completed task (across all agents involved)

trace = session.get_trace()

for agent_span in trace.agent_spans:
    print(f"{agent_span.agent_name}:")
    print(f"  Cost: ${agent_span.total_cost:.4f}")
    print(f"  Tokens: {agent_span.total_tokens}")
    print(f"  LLM calls: {agent_span.llm_call_count}")
    print(f"  Duration: {agent_span.duration_ms}ms")
    print(f"  Children: {len(agent_span.child_spans)}")

Practical Tips

Start with the coordinator. If you can only trace one agent, trace the orchestrator. It shows you the full task decomposition.
Log agent-to-agent messages. The data passed between agents is where most bugs hide. Capture it in full.
Trace in development too. Multi-agent bugs surface during development. If you only add tracing in production, you'll debug in production.
Set up trace-based alerts. "Any session with >5 agent handoffs" or "any session where a sub-agent retried >3 times."
Visualize the trace tree. A good trace viewer shows the full agent graph — worth 100x more than grepping logs.

Getting Started

Multi-agent tracing doesn't have to be a research project. Canary supports multi-agent trace capture out of the box — automatic span propagation for CrewAI, AutoGen, and LangGraph, with manual context passing for custom orchestration.

import agentops
agentops.init()

# Multi-agent traces are captured automatically
crew.kickoff()  # Full trace visible in Canary dashboard

Start tracing your multi-agent systems for free →

Your agents are collaborating. Make sure you can see the conversation.