Multi-Agent Wiki

Observability and Event Model

Trace, event, and metrics design for a multi-agent platform.

Without a trace, a multi-agent platform is essentially unmaintainable. You need to record more than the final answer — every routing decision, message, tool call, handoff, state change, approval, failure, and retry.

Event model

TypeScript
export type AgentEvent = {
  id: string;
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  sessionId: string;
  runId: string;
  taskId?: string;
  actor: string;
  type: AgentEventType;
  payload: unknown;
  timestamp: string;
  schemaVersion: string;
};
TypeMeaning
session.startedSession begins
workflow.node.enterEntered a workflow node
agent.message.createdAgent produced a message
agent.task.assignedTask assigned to an agent
tool.call.startedTool call began
tool.call.completedTool call finished
handoff.requestedHandoff initiated
handoff.acceptedHandoff accepted
blackboard.item.createdShared state written
approval.requestedApproval requested
approval.grantedApproval granted
verifier.issue.foundVerifier raised an issue
loop.round.completedRefinement loop iteration finished
budget.exceededBudget exhausted
session.completedSession ended

Metrics

MetricMeaning
Task success rateShare of tasks that succeed
Handoff loop rateShare of sessions with handoff loops
Verifier rejection rateHow often the verifier rejects
Average agent depthAverage call depth
Tool failure rateTool errors per call
Cost per successful taskCost amortized over wins
Human approval latencyApproval queue delay
Context compression ratioCompression effectiveness

Trace UI suggestion

Render each session as a tree:

Text
Session
├─ Planner
│  └─ plan.created
├─ Search Agent
│  ├─ tool.web_search
│  └─ result.summary
├─ Code Agent
│  ├─ tool.read_file
│  ├─ tool.edit_file
│  └─ patch.created
├─ Test Agent
│  └─ test.failed
├─ Code Agent retry
└─ Reviewer
   └─ approved

Minimum viable pipeline

  1. Write every event to append-only JSONL first.
  2. Mirror key fields into Postgres / ClickHouse.
  3. Use traceId / spanId for tree rendering.
  4. Redact sensitive fields from messages and tool calls.
  5. Eventually feed OpenTelemetry or your own observability platform.