Tracing Multi-Agent Systems: A Practical Guide
When your research agent hands off to your analysis agent, which spawns a fact-checking agent, and somewhere the answer goes wrong — where do you start looking?
Why Multi-Agent Tracing Is Different
In a single-agent system, you have one execution thread: prompt → reasoning → tool calls → response. The trace is a linear sequence or a shallow tree.
Multi-agent systems are directed graphs. Agent A calls Agent B and Agent C in parallel. Agent B calls Agent D. Agent D's result feeds back into Agent A's next decision. The execution path looks less like a call stack and more like a distributed system — because it is one.
The same problems that made distributed systems tracing hard (correlation, causality, fan-out, async boundaries) apply to multi-agent systems, plus a new one: semantic dependencies. Agent B's output isn't just data flowing to Agent A — it's information that changes Agent A's reasoning in unpredictable ways.
The Trace Structure
A multi-agent trace needs three things: sessions, spans, and parent-child relationships.
Session: "Analyze quarterly report"
├── Span: Coordinator Agent
│ ├── LLM Call: Plan decomposition (GPT-4o, 3.2K tokens)
│ ├── Span: Data Extraction Agent
│ │ ├── LLM Call: Parse document (GPT-4o-mini, 8.1K tokens)
│ │ ├── Tool Call: pdf_extract(q3_report.pdf)
│ │ └── LLM Call: Structure data (GPT-4o-mini, 2.4K tokens)
│ ├── Span: Analysis Agent
│ │ ├── LLM Call: Identify trends (GPT-4o, 6.7K tokens)
│ │ ├── Tool Call: query_database("revenue by quarter")
│ │ ├── LLM Call: Compare to benchmarks (GPT-4o, 4.1K tokens)
│ │ └── Span: Fact-Check Agent
│ │ ├── LLM Call: Verify claims (GPT-4o-mini, 3.3K tokens)
│ │ └── Tool Call: web_search("Q3 2025 industry benchmarks")
│ └── LLM Call: Synthesize final report (GPT-4o, 9.8K tokens)Total: 5 agents, 7 LLM calls, 3 tool calls, 37.6K tokens, $0.34. Without tracing, you'd see none of this structure.
Implementation: Trace Context Propagation
The critical piece is propagating trace context across agent boundaries. When the coordinator spawns a sub-agent, the sub-agent's trace needs to be linked as a child of the coordinator's span.
Pattern 1: Framework-Level Propagation
If you're using a framework like CrewAI or AutoGen, the framework manages agent handoffs. You instrument at the framework level:
import agentops
from crewai import Agent, Task, Crew
agentops.init()
researcher = Agent(
role="Research Analyst",
goal="Find relevant market data",
tools=[search_tool, database_tool],
)
analyst = Agent(
role="Financial Analyst",
goal="Analyze trends and generate insights",
tools=[calculator_tool, chart_tool],
)
crew = Crew(
agents=[researcher, analyst],
tasks=[research_task, analysis_task],
verbose=True,
)
# Canary automatically captures the full multi-agent trace
result = crew.kickoff()Pattern 2: Manual Context Passing
If you're building your own orchestration, propagate trace context explicitly:
from agentops import Session, Span
session = Session(tags=["quarterly-analysis"])
async def coordinator(query: str):
with session.span("coordinator") as coord_span:
plan = await planner.decompose(query)
tasks = []
for subtask in plan.subtasks:
agent = select_agent(subtask.type)
tasks.append(
run_sub_agent(agent, subtask, parent_span=coord_span)
)
results = await asyncio.gather(*tasks)
return await synthesizer.compile(results)
async def run_sub_agent(agent, task, parent_span):
with parent_span.child(f"agent:{agent.name}") as agent_span:
return await agent.execute(task, trace_span=agent_span)Pattern 3: Cross-Process Tracing
For agents running as separate services (microservice architecture), propagate trace context via headers:
# Coordinator service
import httpx
async def call_research_agent(query: str, trace_context: dict):
response = await httpx.post(
"http://research-agent/analyze",
json={"query": query},
headers={
"X-Trace-ID": trace_context["trace_id"],
"X-Parent-Span-ID": trace_context["span_id"],
}
)
return response.json()
# Research agent service
from fastapi import FastAPI, Request
app = FastAPI()
@app.post("/analyze")
async def analyze(request: Request):
trace_id = request.headers.get("X-Trace-ID")
parent_span = request.headers.get("X-Parent-Span-ID")
with agentops.continue_trace(trace_id, parent_span) as span:
result = await research_agent.run(request.json()["query"])
return {"result": result}Debugging Multi-Agent Failures
With tracing in place, debugging becomes systematic:
Failure Pattern 1: Wrong Agent Selection
The coordinator picked the wrong sub-agent for a task. In the trace, you see:
Coordinator → LLM Call: "Route task to appropriate agent"
→ Decision: sent "calculate revenue growth" to TextSummarizer
→ TextSummarizer produced narrative instead of calculations
→ Coordinator used incorrect data in final synthesisFix: Inspect the coordinator's routing prompt. Add examples of correct routing. Test with eval set.
Failure Pattern 2: Context Loss at Handoff
Sub-agent didn't receive enough context from the coordinator:
Coordinator → passes "analyze the data" to AnalysisAgent
→ AnalysisAgent has no idea what "the data" refers to
→ Hallucinates analysis of generic data
→ Coordinator incorporates hallucinated analysis into reportFix: Trace shows the exact prompt sent to the sub-agent. Add the missing context to the handoff.
Failure Pattern 3: Cascading Retries
One slow or failing tool causes a cascade:
ResearchAgent → web_search("Q3 benchmarks") → timeout (30s)
→ retry → timeout (30s)
→ retry → partial results
→ AnalysisAgent waiting on ResearchAgent → stalled
→ Coordinator timeout → retries entire flow
→ Total: 4 minutes, $2.80, still incompleteFix: Trace reveals the bottleneck immediately. Add tool timeouts, fallback strategies, and circuit breakers.
Key Metrics for Multi-Agent Systems
Per-agent metrics:
- Cost contribution (what % of session cost does each agent consume?)
- Success rate (how often does this agent produce usable output?)
- Latency distribution (is one agent the bottleneck?)
- Token efficiency (tokens consumed vs. useful output produced)
System-level metrics:
- Agent fan-out depth (how many levels of sub-agents?)
- Inter-agent retry rate (are agents redoing each other's work?)
- End-to-end latency breakdown (where does time actually go?)
- Cost per completed task (across all agents involved)
trace = session.get_trace()
for agent_span in trace.agent_spans:
print(f"{agent_span.agent_name}:")
print(f" Cost: ${agent_span.total_cost:.4f}")
print(f" Tokens: {agent_span.total_tokens}")
print(f" LLM calls: {agent_span.llm_call_count}")
print(f" Duration: {agent_span.duration_ms}ms")
print(f" Children: {len(agent_span.child_spans)}")Practical Tips
- Start with the coordinator. If you can only trace one agent, trace the orchestrator. It shows you the full task decomposition.
- Log agent-to-agent messages. The data passed between agents is where most bugs hide. Capture it in full.
- Trace in development too. Multi-agent bugs surface during development. If you only add tracing in production, you'll debug in production.
- Set up trace-based alerts. "Any session with >5 agent handoffs" or "any session where a sub-agent retried >3 times."
- Visualize the trace tree. A good trace viewer shows the full agent graph — worth 100x more than grepping logs.
Getting Started
Multi-agent tracing doesn't have to be a research project. Canary supports multi-agent trace capture out of the box — automatic span propagation for CrewAI, AutoGen, and LangGraph, with manual context passing for custom orchestration.
import agentops
agentops.init()
# Multi-agent traces are captured automatically
crew.kickoff() # Full trace visible in Canary dashboardStart tracing your multi-agent systems for free →
Your agents are collaborating. Make sure you can see the conversation.