Skip to main content
The Eval Dashboard surfaces all automatically captured eval results so you can track quality over time and quickly identify regressions.

Opening the dashboard

Click the Flask icon in the sidebar to open the Eval Studio. The Dashboard view opens by default.

Dashboard tabs

Conversations

A card for each captured agent response. Each card shows:
  • The conversation input (truncated)
  • A pass/fail summary across all eval definitions that fired on that response
  • A timestamp
Click a conversation card to open the run detail view, which lists every individual eval result for that response (score, pass/fail, reasoning for LLM judge evals, human override if present).

Eval Health

Per-eval-definition pass rate trends over time. Use this tab to see whether a specific eval definition is consistently passing or regressing, independent of which conversation triggered it.

Compare

A/B comparison view for comparing results across two conversations or time windows. See A/B Comparison for details.

Run detail view

Opening a conversation card shows a breakdown per eval definition:
FieldDescription
Eval nameThe eval definition that ran
Typerule, llm_judge, similarity, or metric
ScoreNumerical score (0–1 for llm_judge/similarity; 1/0 for rule; ms/count for metric)
Pass / FailWhether the score met the eval’s pass threshold
ReasoningFor LLM judge evals, the judge model’s explanation
Human overrideIf a reviewer submitted a manual score, shown here
Click Review on any row to submit or update a human review for that result.

Exporting

Click Export in the dashboard to download results as CSV or JSON. See Exporting Results.
Edit this page — Open a pull request