What the Write evaluations skill does
Offline evaluations- Evaluation files: Generates complete evaluation files with scorers, test data, and task definitions ready to run
- Scoring methods: Supports a range of scoring strategies, from simple exact matching to structured output and tool-use validation
- Experiment configuration: Sets up typed configuration so you can sweep models and parameters without changing code
- Project setup: Scaffolds the config file that wires your evaluations into the Axiom AI SDK
- Test data: Designs test cases that cover normal use, edge cases, and failure scenarios
- Live traffic scoring: Scores real production responses without needing a reference answer to compare against
- Sampling controls: Lets you score a fraction of traffic or apply conditional logic to decide what to evaluate
- Trace linking: Connects evaluation results back to the traces that produced them
Prerequisites
- Create an Axiom account.
- Create a dataset in Axiom where you send your data.
- Create an API token in Axiom with permissions to ingest data to the dataset you have created.
- Follow the procedure in Quickstart to set up Axiom AI SDK in your TypeScript project. Authenticate with environment variables.
- Install vitest as a dev dependency by running
npm install vitest --save-dev.
Install Axiom Skills
Install all Axiom skills at once for Claude Code, Cursor, and other Claude-compatible agents:Configure Axiom credentials
All Axiom Skills share the same credential configuration. Create a configuration file at~/.axiom.toml:
~/.axiom.toml
Replace
API_TOKEN with the Axiom API token you have generated. For added security, store the API token in an environment variable.Replace ORGANIZATION_ID with your organization ID. For more information, see Determine organization ID.Replace AXIOM_DOMAIN with the base domain of your edge deployment. For more information, see Edge deployments.For token creation and scoping guidance, see Token hygiene for AI agents.Use the skill
The Write evaluations skill activates automatically when you ask your AI agent to:- Write evaluations for an AI feature
- Create scorers for testing AI output
- Set up flag schemas for experiment configuration
- Generate test data for AI capabilities
- “Write evaluations for the support agent’s message categorization function”
- “Add adversarial test cases to the existing eval suite”
- “Set up flag schemas to compare GPT-4o and GPT-5 across all evals”
- “Create an online eval to monitor response quality in production”
How it works
The Write evaluations skill follows a structured workflow:- Understand the feature: The agent reads your AI function’s code, traces inputs and outputs, identifies the model call, and checks for existing evaluations.
- Determine evaluation type: Based on the output type (classification, free-form text, structured object, tool calls, retrieval results), the agent selects the appropriate evaluation pattern and scorers.
- Generate evaluation file: The agent creates a
.eval.tsfile colocated with the source, imports the actual function, and wires up typed scorers. - Design test data: The agent generates test cases covering happy path, adversarial, boundary, and negative categories.
- Validate and run: The agent validates the evaluation structure and runs it locally.
Supported evaluation types and templates
The Write evaluations skill includes a pre-built template for each supported evaluation type:| Output type | Evaluation type | Scorer pattern | Use case |
|---|---|---|---|
| String category or label | Classification | Exact match | Category classification with adversarial cases |
| Free-form text | Text quality | Contains keywords or LLM-as-judge | Open-ended response quality |
| Array of items | Retrieval | Set match | RAG and document retrieval |
| Structured object | Structured output | Field-by-field match | Complex object validation |
| Agent result with tool calls | Tool use | Tool name presence | Agent tool usage validation |