onlineEval.
Each onlineEval call creates spans tagged with eval.tags: ["online"] that appear alongside your production traces in the Console. The evaluation interface helps you answer three core questions about production quality:
- How well is your capability performing on real traffic?
- Is quality improving or regressing across deployments?
- Which production edge cases need attention?
Find online evaluation scores
Online evaluation scores appear in the Console as spans linked to the originating generation span. To view only online evaluation results, filter spans byeval.tags: ["online"].
For details on all emitted attributes, see Scorers.
Monitor score trends
Use online evaluation scores to track production quality over time. Look for:- Score drops that correlate with deployments, prompt changes, or upstream API updates.
- Scorer disagreement where heuristic scorers pass but LLM judges flag quality issues, or vice versa. This often reveals that a heuristic is too coarse or that a judge prompt needs refinement.
- Sampling gaps where low sampling rates on expensive scorers leave blind spots in coverage. If a scorer runs on only 1% of traffic, a brief quality regression could go undetected.
Investigate score drops
When a scorer’s average drops, click into the failing spans to see:- The exact input that triggered the low score
- What your capability returned
- The full trace of LLM calls and tool executions that produced the output
- Any metadata the scorer attached to explain the result
- Do failures cluster around specific input types or user segments?
- Are certain scorers failing consistently while others pass?
- Did a deployment or model update coincide with the drop?
- Is high token usage or latency correlated with lower scores?
Compare across deployments
Because online evaluations run continuously, you can compare score distributions before and after a deployment to understand its impact on production quality.- Note the timestamp of your deployment.
- In the Console, filter online evaluation spans to a time window before the deployment and note the average scores.
- Compare against the same time window after the deployment.
gpt-4o-mini to gpt-4o in production might show:
- Relevance score: 0.82 → 0.94
- Format score: 0.97 → 0.95
- Latency: 800 ms → 1.6 s
For controlled experiments where you test changes against a fixed collection of test cases before deploying, use offline evaluations. Online evaluations complement offline evaluations by confirming that improvements hold on real traffic.
Adjust sampling based on results
Review your sampling rates periodically as your system matures:- Increase sampling on scorers that have recently detected regressions. More data points help you understand the scope and severity of an issue.
- Decrease sampling on scorers that consistently pass. A scorer at 100% pass rate over weeks of traffic may not need to run on every request.
- Add conditional sampling for scorers that should focus on specific traffic segments. For example, sample more aggressively on long inputs where your capability is more likely to struggle.
Feed insights back to offline evaluations
Online evaluations often surface edge cases that your offline test collections don’t cover. When you spot a pattern of failures in production:- Add the failing inputs to your offline collection as new test cases with expected outputs.
- Run offline evaluations to reproduce the issue and verify your fix.
- Deploy the fix and confirm that online evaluation scores recover.
What’s next?
- Set up user feedback for human-in-the-loop signals that complement automated scoring.
- Write offline evaluations to test against known-good answers before shipping.
- To iterate on your capability based on evaluation results, see Iterate.