agentskit.js
Packages

@agentskit/eval-braintrust

Braintrust scoring pipeline for @agentskit/eval — quality + robustness scorers, CI regression alerts, dataset sync.

@agentskit/eval-braintrust connects AgentsKit eval suites to Braintrust. It runs your test cases through a set of deterministic scorers, ships results to a Braintrust experiment, and surfaces regressions as CI annotations — all without requiring Braintrust to be installed at build time (peer-resolved at runtime).

#When to reach for it

  • You already use Braintrust for dataset management and want AgentsKit runs to show up in its experiment UI.
  • You want multi-dimensional scoring (task success, factual grounding, citation correctness, schema survival, HITL gate, crash resilience) with a single import.
  • You want regression alerts between two experiment snapshots in CI.

#Install

npm install @agentskit/eval-braintrust
# peer deps
npm install @agentskit/eval braintrust

braintrust is an optional peer — the adapter loads it dynamically at runtime. If the key / package is absent, results are scored locally and no remote sync occurs.

#Public API

#Root (@agentskit/eval-braintrust)

ExportKindPurpose
runBraintrustEval(args, internals?)async fnRun cases, score, sync to Braintrust, return ExperimentResult
scoreCase(scorers, input)async fnScore a single ScorerInput against an array of scorers
summarize(cases)fnAggregate ScoredCase[] into per-scorer { mean, n } map
BraintrustRunOptionstypeOptions passed to runBraintrustEval
ScoredCasetypeSingle case result with scores + duration
ExperimentResulttypeFull run result: cases, summary, remote URL
RunBraintrustEvalArgstypeArgument shape for runBraintrustEval
Scorertype(args: ScorerInput) => ScorerResult | Promise<ScorerResult>
ScorerInputtype{ input, output, expected?, metadata? }
ScorerResulttype{ name, score, rationale?, metadata? }
ScorerFamilytype{ family: 'quality' | 'robustness'; scorers }

#Scorers (@agentskit/eval-braintrust/scorers)

ExportFamilyWhat it measures
taskSuccessqualityDid the output satisfy the task?
factualGroundingqualityIs the output grounded in provided context?
citationCorrectnessqualityAre citations present and accurate?
toolArgValidityqualityAre tool call arguments schema-valid?
schemaSurvivalrobustnessDoes output survive schema round-trip?
hitlGateCorrectnessrobustnessDid HITL gate fire correctly?
fallbackResiliencerobustnessDoes agent recover from injected failures?
noCrashSurvivalrobustnessDoes agent complete without throwing?
qualityFamilyAll four quality scorers bundled
robustnessFamilyAll four robustness scorers bundled
ALL_SCORERSFlat array of all eight scorers

#CI helpers (@agentskit/eval-braintrust/ci)

ExportPurpose
detectRegressions(baseline, current, thresholds?)Returns RegressionAlert[] for scorers that dropped beyond threshold
formatAlertsMarkdown(alerts)Renders a markdown table of regressions for PR comments / CI summary
RegressionThresholds{ default?: number; perScorer?: Record<string, number> }
RegressionAlert{ scorer, baseline, current, delta, threshold }

#Minimal example

import { runBraintrustEval } from '@agentskit/eval-braintrust'
import { ALL_SCORERS } from '@agentskit/eval-braintrust/scorers'

const result = await runBraintrustEval({
  cases: [
    { input: 'Capital of France?', output: 'Paris', expected: 'Paris' },
  ],
  agent: async (input) => ({ output: await myAgent(input) }),
  scorers: ALL_SCORERS,
  options: {
    projectName: 'my-agent',
    experimentName: `ci-${Date.now()}`,
  },
})

console.log(result.summary)
// { taskSuccess: { mean: 1, n: 1 }, noCrashSurvival: { mean: 1, n: 1 }, ... }
console.log(result.url) // Braintrust experiment URL if BRAINTRUST_API_KEY is set

#CI regression check

import { detectRegressions, formatAlertsMarkdown } from '@agentskit/eval-braintrust/ci'

const alerts = detectRegressions(baselineSummary, currentSummary, {
  default: 0.05,
  perScorer: { factualGrounding: 0.03 },
})

if (alerts.length) {
  console.error(formatAlertsMarkdown(alerts))
  process.exit(1)
}

#Configuration

Env varPurpose
BRAINTRUST_API_KEYAuthenticates with the Braintrust API. Without it, scoring runs locally and no data is uploaded.
BRAINTRUST_BASE_URLOverride the Braintrust API base URL (default: Braintrust cloud).

You can also pass apiKey / baseUrl directly in BraintrustRunOptions — explicit values take precedence over env vars.

#Stability

  • Version: 0.1.0
  • Tier: beta
  • Contract: scorer shapes stable; Braintrust SDK peer-resolved at runtime.

#Source

npm: @agentskit/eval-braintrust · repo: packages/eval-braintrust

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

On this page