- Review: A queue-based workflow for going through AI conversations and describing what went wrong.
- Issues: As you review, Axiom matches observations to existing issues or creates new ones. Categories aren’t defined upfront.
- Evidence: Each issue collects every conversation assigned to it, along with reviewer notes and links to the full trace.
- Signal to action: Feedback, online evaluations, and error traces surface problems. Review and issues help you understand them.
If you’re using Axiom’s AI SDK, you’ve probably got GenAI traces flowing, feedback coming in, maybe online evaluations running. The dashboards might look fine, but when an output goes wrong, what happens? Someone probably spots it, mentions it in Slack, maybe files an issue. The next bad output gets handled differently by someone else. Two weeks later you’re not sure if this is a new problem or one that’s been quietly recurring.
The missing piece is the bit between “that output was bad” and “we understand the problem well enough to fix it.” A domain expert on your team reading a conversation trace can see things no automated scorer will catch, like the agent confirming an action it never actually completed, or referencing a policy that was retired months ago. But those observations need somewhere to go, otherwise they stay in people’s heads and are never codified.
Today we’re releasing Review and Issues for AI capabilities in Axiom to address all this.
How review works
You pick a queue and work through the conversations in it.
There are five built-in queues:
- Flagged for traces your team marked for attention
- Recent for latest conversations, for routine checks
- Negative feedback showing thumbs-down from users
- Errored for traces with exceptions
- Reviewed for past decisions you want to revisit
For each conversation, you make a binary call: whether it’s good or bad. Then you write a note about what you saw.
In qualitative research this would be called open coding; observations in your own words without fitting them into predefined categories. Something like “agent claimed upgrade succeeded but tool output shows it didn’t” or “proceeded with return without asking about refund method.” Honest notes. The categorization happens after, and Axiom advances to the next trace when you submit to keep you in flow.
From reviews to issues
When you submit a review, Axiom checks your review against existing issue categories for that capability. If it matches something that’s come up before, the conversation gets added to that issue. New kinds of problems become new issues. Vague reviews get logged but don’t create anything.
You don’t need to define your failure taxonomy in advance. It builds itself from what reviewers write down. Every categorization decision is logged with its reasoning, so you can see why a conversation ended up where it did and correct it if needed.
Issues
After a week of reviews, your issues list might have entries like “confirmed action not completed” with 12 conversations, “skipped required confirmation” with 8, and “referenced data not in tool response” with 3. All of these started as freeform reviews from different people on the team, with Axiom helping you find the patterns.
When you open an issue, you see every conversation assigned to it with the reviewer’s note and a link to the full trace. You might notice that “action not completed” keeps showing up around one specific tool integration, or that “skipped confirmation” is limited to multi-step requests. That tells you where to look and what test cases to write.
If you resolve an issue and new conversations get assigned to it later, it comes back.
Where this fits
This is the last piece we’ve been building toward in the toolkit. You instrument capabilities, run evaluations against test cases, collect feedback in production, score live traffic with online evaluations, and now you can review what needs human attention and track the issues that come out of it.
Get started with review workflows in the Review conversations and Track issues docs.