Multi-Agent Wiki

Orchestrator Implementation Guide

Scheduler, router, and state machine for a multi-agent platform.

The orchestrator is the heart of a multi-agent platform. It is not "an agent that writes good prompts" — it is the combination of workflow engine + policy engine + agent scheduler + trace emitter.

Responsibilities

ResponsibilityDescription
PlanTurn a user goal into a task tree or workflow
RouteChoose the agent, the tool, the next state
ScheduleControl concurrency, timeouts, retry, cancel
CheckpointPersist state to support resume
VerifyCall critic / test / human approval
ObserveEmit trace events
GovernEnforce permission, budget, and safety policy

Routing strategies

1. Rule-based routing

Good for deterministic flows:

TypeScript
function routeByRule(task: Task): AgentId {
  if (task.goal.includes("test")) return "test-agent";
  if (task.goal.includes("code")) return "code-agent";
  if (task.goal.includes("search")) return "search-agent";
  return "general-agent";
}

2. LLM routing

Suitable for open tasks, but the output must be structured:

TypeScript
const decisionSchema = z.object({
  action: z.enum(["call_agent", "handoff", "ask_user", "final"]),
  target: z.string().optional(),
  reason: z.string(),
  confidence: z.number(),
});

3. Hybrid

Production recommendation: rules narrow the choice, the LLM picks within the allowed set.

TypeScript
const allowed = policy.allowedAgents(state);
const decision = await llmRouter.pick({ state, allowed });
if (!allowed.includes(decision.target)) throw new PolicyError();

Scheduling strategies

StrategyUse
FIFOOrdinary task queue
Priority queueBlocking user tasks first
Budget-awareTrim by token / time / cost budget
SpeculativeRun agents in parallel; keep the best
Retry with backoffTool or network failures
Circuit breakerTrip when an agent fails repeatedly

Checkpoint shape

Each checkpoint should carry at least:

TypeScript
export type Checkpoint = {
  sessionId: string;
  runId: string;
  workflowNode: string;
  taskTree: Task[];
  activeAgent?: string;
  blackboardVersion: string;
  messageCursor: string;
  toolCallCursor: string;
  budget: BudgetState;
  createdAt: string;
};

Common mistakes

  • The orchestrator does everything itself — devolves into a single agent.
  • No checkpointing — tasks cannot resume.
  • LLM routing without policy guard — agents call tools they shouldn't.
  • Sub-agent outputs without schemas — downstream nodes can't parse them.