Skip to main content
Scorers are functions that measure your AI capability’s output. They receive the inputs and outputs of a capability run, and return a score. The same Scorer API works in both offline and online evaluations. The key difference between the two contexts is what the scorer receives:
  • Offline scorers receive input, output, and expected (ground truth from your test collection).
  • Online scorers are reference-free. They receive input and output without an expected value.
Because the API is the same, you can reuse scorers across both contexts. A scorer you write for offline evaluations works in online evaluations as long as it doesn’t depend on expected.

Create scorers

Create scorers using the Scorer wrapper. A scorer takes a name and a scoring function:
import { Scorer } from 'axiom/ai/scorers';

const MyScorer = Scorer(
  'my-scorer',
  ({ input, output }) => {
    // Return a boolean, a number (0-1), or { score, metadata }
  }
);

Return types

Scorers can return three types of values:

Boolean

Return true or false for simple pass/fail checks. The SDK converts booleans to 1 (pass) or 0 (fail) and marks the score as boolean in telemetry.
const isKnownCategory = Scorer(
  'is-known-category',
  ({ output }: { output: string }) => {
    return ['support', 'complaint', 'spam', 'unknown'].includes(output);
  },
);

Numeric

Return a number between 0 and 1 for graded scoring:
const formatConfidence = Scorer(
  'format-confidence',
  ({ output }: { output: string }) => {
    const trimmed = output.trim().toLowerCase();
    const isSingleWord = !trimmed.includes(' ');
    const isClean = /^[a-z_]+$/.test(trimmed);

    return (isSingleWord ? 0.5 : 0) + (isClean ? 0.5 : 0);
  },
);

Score with metadata

Return an object with score and metadata to attach additional context to the eval span:
const validCategory = Scorer(
  'valid-category',
  ({ output }: { output: string }) => {
    const validCategories = ['support', 'complaint', 'spam', 'unknown'];
    return {
      score: validCategories.includes(output),
      metadata: {
        category: output,
        validCategories,
      },
    };
  },
);

Scorer patterns

Exact match (offline)

Compare the output directly against the expected value. This pattern only works in offline evaluations where ground truth is available.
const ExactMatchScorer = Scorer(
  'exact-match',
  ({ output, expected }) => {
    return output.sentiment === expected.sentiment ? true : false;
  }
);

Heuristic checks

Validate output structure or format without ground truth. These scorers work in both offline and online evaluations.
const formatScorer = Scorer('format', ({ output }: { output: string }) => {
  const trimmed = output.trim();
  return /[.!?]$/.test(trimmed) && !trimmed.includes('\n') && trimmed.length <= 200;
});

LLM-as-judge

Use a second model to evaluate the output. Async scorers are useful in both contexts, especially in online evaluations where you don’t have ground truth and need semantic quality assessment.
import { generateObject } from 'ai';
import { z } from 'zod';

const relevanceScorer = Scorer(
  'relevance',
  async ({ input, output }: { input: string; output: string }) => {
    const result = await generateObject({
      model: judgeModel,
      schema: z.object({
        relevant: z.boolean().describe('Whether the response answers the question'),
      }),
      system: 'You evaluate if an AI response answers the user question.',
      prompt: `Question: ${input}\n\nResponse: ${output}`,
    });
    return result.object.relevant;
  },
);
LLM judge scorers add latency and cost per evaluation. In online evaluations, use sampling to control how often they run.

Use autoevals

The autoevals library provides prebuilt scorers for common tasks:
npm install autoevals
import { Scorer } from 'axiom/ai/scorers';
import { Levenshtein, FactualityScorer } from 'autoevals';

const LevenshteinScorer = Scorer(
  'levenshtein',
  ({ output, expected }) => {
    return Levenshtein({ output: output.text, expected: expected.text });
  }
);

const FactualityCheck = Scorer(
  'factuality',
  async ({ output, expected }) => {
    return await FactualityScorer({
      output: output.text,
      expected: expected.text,
    });
  }
);
Use multiple scorers to evaluate different aspects of your capability. For example, check both exact accuracy and semantic similarity to get a complete picture of performance.

Telemetry

Each scorer produces an OTel span with the following attributes:
AttributeDescription
gen_ai.operation.nameAlways eval.score
eval.nameThe eval name
eval.score.nameThe scorer name
eval.score.valueThe numeric score (0-1)
eval.score.metadataJSON string of scorer metadata. Includes eval.score.is_boolean: true when the scorer returned a boolean.
eval.capability.nameThe capability being evaluated
eval.step.nameThe step within the capability (when set)
eval.tags["online"] for online evaluations

What’s next?