Opening the dashboard
Click the Flask icon in the sidebar to open the Eval Studio. The Dashboard view opens by default.Dashboard tabs
Conversations
A card for each captured agent response. Each card shows:- The conversation input (truncated)
- A pass/fail summary across all eval definitions that fired on that response
- A timestamp
Eval Health
Per-eval-definition pass rate trends over time. Use this tab to see whether a specific eval definition is consistently passing or regressing, independent of which conversation triggered it.Compare
A/B comparison view for comparing results across two conversations or time windows. See A/B Comparison for details.Run detail view
Opening a conversation card shows a breakdown per eval definition:| Field | Description |
|---|---|
| Eval name | The eval definition that ran |
| Type | rule, llm_judge, similarity, or metric |
| Score | Numerical score (0–1 for llm_judge/similarity; 1/0 for rule; ms/count for metric) |
| Pass / Fail | Whether the score met the eval’s pass threshold |
| Reasoning | For LLM judge evals, the judge model’s explanation |
| Human override | If a reviewer submitted a manual score, shown here |
Exporting
Click Export in the dashboard to download results as CSV or JSON. See Exporting Results.Edit this page — Open a pull
request