Agent Arena

[!warning] Agent Arena is experimental with known limitations.

Dispatch multiple AI models simultaneously on the same task, compare their solutions side-by-side, and select the best result to apply to your workspace.

When to use Arena

Model benchmarking — evaluate different models on real tasks in your codebase
Best-of-N selection — get multiple independent solutions, pick the best
Risk reduction — validate that multiple models converge on the same approach before committing to a critical change
Exploring approaches — see how different models reason about the same problem

Arena uses significantly more tokens than a single session (each agent has its own context window). Use it when the value of comparison justifies the cost.

Start a session


/arena --models gpt-4o,claude-sonnet-4,gemini-2.5-pro "Refactor the authentication module to use JWT tokens"

Omit --models to get an interactive model selection dialog.

What happens

proto creates isolated Git worktrees for each agent (one per model)
Each agent starts with full tool access and works independently — no shared state, no communication
You can monitor progress and send messages to individual agents via tab switching
When all agents finish, you compare results and select a winner

Navigate between agents

Use keyboard shortcuts to switch between agent tabs:

Shortcut	Action
`→` Right	Next agent tab
`←` Left	Previous agent tab
`↑` Up	Focus input box
`↓` Down	Focus tab bar

Tab status indicators: ● running, ✓ done, ✗ failed, ○ cancelled.

Select a winner

When all agents complete, choose one to apply its changes to your main workspace. The winning agent’s diff is applied and all worktrees are cleaned up automatically.

Configuration


{
  "arena": {
    "worktreeBaseDir": "~/.proto/arena",
    "maxRoundsPerAgent": 50,
    "timeoutSeconds": 600
  }
}

Best practices

2–3 agents give the best balance of insight vs. cost. Max is 5.
Choose complementary models — comparing models with different strengths gives more signal than comparing versions of the same model.
Keep tasks self-contained — Arena agents cannot communicate, so the task must be fully describable in the prompt.
Use for high-impact decisions — architecture choices, critical refactors, not routine changes.

Troubleshooting

Symptom	Fix
Agent fails to start	Verify model API credentials; check Git repo and write access to worktree dir
Worktree creation fails	Run `git worktree prune` to clean up stale worktrees; requires Git 2.5+
Agent timeout	Increase `arena.timeoutSeconds` in settings
Winner apply fails	Check for conflicting uncommitted changes in your main working directory

Limitations

In-process mode only — split-pane display (tmux/iTerm2) is not yet available
No diff preview before selecting a winner
No worktree retention after selection
No session resumption — if you close the terminal mid-session, clean up manually with git worktree prune
Maximum 5 agents
Git repository required