Evals

Adapter scoreboard

Accuracy on tool-use and planning, plus streaming latency. Runs come from @agentskit/eval. Today's numbers are seed data — the CI pipeline will swap them as real runs land.

Last updated: 2026-04-24

tool-use accuracy

Did the model call the right tool with the right args?

higher is better

  • 1.anthropic/claude-sonnet-4-695%· n=200
  • 2.openai/gpt-4o-mini92%· n=200
  • 3.gemini/2.5-flash88%· n=200
  • 4.openrouter/llama-3.1-70b81%· n=200
  • 5.ollama/llama3.174%· n=200

plan decomposition

Can the model split a goal into executable steps the worker then completes?

higher is better

  • 1.anthropic/claude-sonnet-4-691%· n=120
  • 2.openai/gpt-4o-mini84%· n=120
  • 3.gemini/2.5-flash79%· n=120
  • 4.openrouter/llama-3.1-70b72%· n=120
  • 5.ollama/llama3.164%· n=120

streaming latency (p95, ms)

Time to first token, p95 over 500 requests, same prompt, same region.

lower is better

  • 1.ollama/llama3.1180 ms· n=500
  • 2.gemini/2.5-flash310 ms· n=500
  • 3.openai/gpt-4o-mini380 ms· n=500
  • 4.anthropic/claude-sonnet-4-6420 ms· n=500
  • 5.openrouter/llama-3.1-70b950 ms· n=500

Want to reproduce? Every suite corresponds to a file under packages/eval/tests/scoreboard/. Run pnpm --filter @agentskit/eval bench locally.