Eval Dashboard

The Eval Dashboard surfaces all automatically captured eval results so you can track quality over time and quickly identify regressions.

Opening the dashboard

Click the Flask icon in the sidebar to open the Eval Studio. The Dashboard view opens by default.

Dashboard tabs

Conversations

A card for each captured agent response. Each card shows:

The conversation input (truncated)
A pass/fail summary across all eval definitions that fired on that response
A timestamp

Click a conversation card to open the run detail view, which lists every individual eval result for that response (score, pass/fail, reasoning for LLM judge evals, human override if present).

Eval Health

Per-eval-definition pass rate trends over time. Use this tab to see whether a specific eval definition is consistently passing or regressing, independent of which conversation triggered it.

Compare

A/B comparison view for comparing results across two conversations or time windows. See A/B Comparison for details.

Run detail view

Opening a conversation card shows a breakdown per eval definition:

Field	Description
Eval name	The eval definition that ran
Type	`rule`, `llm_judge`, `similarity`, or `metric`
Score	Numerical score (0–1 for llm_judge/similarity; 1/0 for rule; ms/count for metric)
Pass / Fail	Whether the score met the eval’s pass threshold
Reasoning	For LLM judge evals, the judge model’s explanation
Human override	If a reviewer submitted a manual score, shown here

Click Review on any row to submit or update a human review for that result.

Exporting

Click Export in the dashboard to download results as CSV or JSON. See Exporting Results.

Edit this page — Open a pull request

Running Evals A/B Comparison

​Opening the dashboard

​Dashboard tabs

​Conversations

​Eval Health

​Compare

​Run detail view

​Exporting

Opening the dashboard

Dashboard tabs

Conversations

Eval Health

Compare

Run detail view

Exporting