Evals
Adapter scoreboard
Accuracy on tool-use and planning, plus streaming latency. Runs come from @agentskit/eval. Today's numbers are seed data — the CI pipeline will swap them as real runs land.
Last updated: 2026-04-24
tool-use accuracy
Did the model call the right tool with the right args?
higher is better
- 1.anthropic/claude-sonnet-4-695%· n=200
- 2.openai/gpt-4o-mini92%· n=200
- 3.gemini/2.5-flash88%· n=200
- 4.openrouter/llama-3.1-70b81%· n=200
- 5.ollama/llama3.174%· n=200
plan decomposition
Can the model split a goal into executable steps the worker then completes?
higher is better
- 1.anthropic/claude-sonnet-4-691%· n=120
- 2.openai/gpt-4o-mini84%· n=120
- 3.gemini/2.5-flash79%· n=120
- 4.openrouter/llama-3.1-70b72%· n=120
- 5.ollama/llama3.164%· n=120
streaming latency (p95, ms)
Time to first token, p95 over 500 requests, same prompt, same region.
lower is better
- 1.ollama/llama3.1180 ms· n=500
- 2.gemini/2.5-flash310 ms· n=500
- 3.openai/gpt-4o-mini380 ms· n=500
- 4.anthropic/claude-sonnet-4-6420 ms· n=500
- 5.openrouter/llama-3.1-70b950 ms· n=500
Want to reproduce? Every suite corresponds to a file under packages/eval/tests/scoreboard/. Run pnpm --filter @agentskit/eval bench locally.