Why evaluate?
- Catch regressions after changing a provider, skill, or prompt
- Compare engines — NAE vs RLM — on your specific workloads
- Tune skills by seeing how instruction changes affect output quality
- Build confidence before deploying MIRA to a wider team
Core concepts
Eval Cases
Individual test scenarios — an input, optional documents, and expected output criteria.
Eval Profiles
Named collections of eval cases that represent a repeatable test suite.
Runs
An execution of a profile against an engine configuration. Each run stores all inputs and
outputs.
Scores
Automated and human-reviewed scores attached to each case output.
Eval case types
| Type | What it tests |
|---|---|
Rule (rule) | Output matches a string, regex, keyword, JSON schema, or satisfies a length constraint — fully deterministic, no LLM calls |
Semantic similarity (similarity) | Output is semantically close to a reference answer (embedding-based cosine similarity) |
LLM judge (llm_judge) | A second LLM call grades the output against a rubric you write |
Metric (metric) | A performance measurement (latency, token count, or tool call count) compared against a threshold |
Human review is not a separate eval type. It is a post-run override mechanism: after a run
completes, any result can receive a human score via the Dashboard → Review button. Human
scores are stored alongside (and can override) automated scores.
How evals run
Evals are automatic. When you enable the Eval Framework, every agent response is captured and scored against your active eval definitions. There is no manual “run” button — results appear in the dashboard as you use MIRA.Workflow
- Create eval definitions in the Eval Studio
- Create a profile and assign evals to it
- Activate the profile so it evaluates future responses
- Use MIRA normally — results stream into the Eval Dashboard
- Optionally compare two conversation runs (A/B) or submit human review overrides
- Export results for reporting
Opening the Eval Framework
Click the Flask icon in the left sidebar.Edit this page — Open a pull
request