agentskit.js

@agentskit/eval-braintrust — for agents

Braintrust scoring pipeline — 4 quality + 4 robustness scorers, runner, and CI regression helpers.

#Install

npm install @agentskit/eval-braintrust braintrust

braintrust is loaded lazily; runs without it still produce scored output (no upload).

#Primary exports

#Runner

  • runBraintrustEval({ cases, agent, scorers, options, bt? }) — scores cases through the agent, optionally logs to a Braintrust experiment.
  • scoreCase(scorers, args) — scores a single case; surfaces scorer crashes as scorer_error.
  • summarize(cases){ name: { mean, n } } aggregate.

#Quality scorers

  • taskSuccess — substring / regex / predicate match against expected.
  • factualGrounding — fraction of metadata.sources referenced in the output.
  • citationCorrectness — cite-tag presence + match against metadata.expectedCitations.
  • toolArgValidity — fraction of metadata.toolCalls with schemaValid !== false.

#Robustness scorers

  • schemaSurvival — 1 unless metadata.parseError or schemaValid === false.
  • hitlGateCorrectness — 1 when hitlExpected === hitlTriggered.
  • fallbackResilience — 1 on clean run or recovered fallback; 0 on uncovered errors.
  • noCrashSurvival — 0 when metadata.crashed or uncaughtException.

#Families

  • qualityFamily, robustnessFamily, ALL_SCORERS.

#Subpaths

SubpathContents
@agentskit/eval-braintrust/scorersIndividual scorer factories + families.
@agentskit/eval-braintrust/cidetectRegressions(baseline, current, thresholds), formatAlertsMarkdown(alerts).

#Minimal example

import {
  runBraintrustEval,
  ALL_SCORERS,
} from '@agentskit/eval-braintrust'

const result = await runBraintrustEval({
  cases: [{ input: 'Capital of France?', output: '', expected: 'Paris' }],
  agent: async input => ({ output: await agent.run(input) }),
  scorers: ALL_SCORERS,
  options: { projectName: 'agentskit-showcase' },
})

console.log(result.summary, result.url)

#CI regression alert

import { detectRegressions, formatAlertsMarkdown } from '@agentskit/eval-braintrust/ci'

const alerts = detectRegressions(baseline.summary, current.summary, { default: 0.05 })
if (alerts.length) {
  process.stdout.write(formatAlertsMarkdown(alerts))
  process.exit(1)
}

#Source

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

On this page