Axiom Docs

The Write evaluations skill turns AI agents into evaluation suite authors for AI capabilities. Evaluations are the test suite for non-deterministic systems: they measure whether a capability still behaves correctly after every change. You can write offline evaluations to test your AI capability against a curated collection of inputs with expected outputs (ground truth). Or you can write online evaluations to score your AI capability’s outputs on live production traffic. For more information, see Evaluations.

What the Write evaluations skill does

Offline evaluations

Evaluation files: Generates complete evaluation files with scorers, test data, and task definitions ready to run
Scoring methods: Supports a range of scoring strategies, from simple exact matching to structured output and tool-use validation
Experiment configuration: Sets up typed configuration so you can sweep models and parameters without changing code
Project setup: Scaffolds the config file that wires your evaluations into the Axiom AI SDK
Test data: Designs test cases that cover normal use, edge cases, and failure scenarios

Online evaluations

Live traffic scoring: Scores real production responses without needing a reference answer to compare against
Sampling controls: Lets you score a fraction of traffic or apply conditional logic to decide what to evaluate
Trace linking: Connects evaluation results back to the traces that produced them

Prerequisites

Create an Axiom account.
Create a dataset in Axiom where you send your data.
Create an API token in Axiom with permissions to ingest data to the dataset you have created.

Follow the procedure in Quickstart to set up Axiom AI SDK in your TypeScript project. Authenticate with environment variables.
Install vitest as a dev dependency by running npm install vitest --save-dev.

Install Axiom Skills

Install all Axiom skills at once for Claude Code, Cursor, and other Claude-compatible agents:

npx skills add axiomhq/skills

Configure Axiom credentials

All Axiom Skills share the same credential configuration. Create a configuration file at ~/.axiom.toml:

~/.axiom.toml

[deployments.dev]
url = "https://api.axiom.co"
token = "API_TOKEN"
org_id = "ORGANIZATION_ID"
edge_url = "AXIOM_DOMAIN"

Replace API_TOKEN with the Axiom API token you have generated. For added security, store the API token in an environment variable.Replace ORGANIZATION_ID with your organization ID. For more information, see Determine organization ID.Replace AXIOM_DOMAIN with the base domain of your edge deployment. For more information, see Edge deployments.For token creation and scoping guidance, see Token hygiene for AI agents.

Use the skill

The Write evaluations skill activates automatically when you ask your AI agent to:

Write evaluations for an AI feature
Create scorers for testing AI output
Set up flag schemas for experiment configuration
Generate test data for AI capabilities

Example prompts:

“Write evaluations for the support agent’s message categorization function”
“Add adversarial test cases to the existing eval suite”
“Set up flag schemas to compare GPT-4o and GPT-5 across all evals”
“Create an online eval to monitor response quality in production”

How it works

The Write evaluations skill follows a structured workflow:

Understand the feature: The agent reads your AI function’s code, traces inputs and outputs, identifies the model call, and checks for existing evaluations.
Determine evaluation type: Based on the output type (classification, free-form text, structured object, tool calls, retrieval results), the agent selects the appropriate evaluation pattern and scorers.
Generate evaluation file: The agent creates a .eval.ts file colocated with the source, imports the actual function, and wires up typed scorers.
Design test data: The agent generates test cases covering happy path, adversarial, boundary, and negative categories.
Validate and run: The agent validates the evaluation structure and runs it locally.

Supported evaluation types and templates

The Write evaluations skill includes a pre-built template for each supported evaluation type:

Output type	Evaluation type	Scorer pattern	Use case
String category or label	Classification	Exact match	Category classification with adversarial cases
Free-form text	Text quality	Contains keywords or LLM-as-judge	Open-ended response quality
Array of items	Retrieval	Set match	RAG and document retrieval
Structured object	Structured output	Field-by-field match	Complex object validation
Agent result with tool calls	Tool use	Tool name presence	Agent tool usage validation

​What the Write evaluations skill does

​Prerequisites

​Install Axiom Skills

​Configure Axiom credentials

​Use the skill

​How it works

​Supported evaluation types and templates

What the Write evaluations skill does

Prerequisites

Install Axiom Skills

Configure Axiom credentials

Use the skill

How it works

Supported evaluation types and templates