Skip to main content
MIRA’s Eval Framework lets you measure and track the quality of reasoning engine outputs over time. Instead of eyeballing responses, you define test cases, run them against one or both engines, and get structured scores.

Why evaluate?

  • Catch regressions after changing a provider, skill, or prompt
  • Compare engines — NAE vs RLM — on your specific workloads
  • Tune skills by seeing how instruction changes affect output quality
  • Build confidence before deploying MIRA to a wider team

Core concepts

Eval Cases

Individual test scenarios — an input, optional documents, and expected output criteria.

Eval Profiles

Named collections of eval cases that represent a repeatable test suite.

Runs

An execution of a profile against an engine configuration. Each run stores all inputs and outputs.

Scores

Automated and human-reviewed scores attached to each case output.

Eval case types

TypeWhat it tests
Rule (rule)Output matches a string, regex, keyword, JSON schema, or satisfies a length constraint — fully deterministic, no LLM calls
Semantic similarity (similarity)Output is semantically close to a reference answer (embedding-based cosine similarity)
LLM judge (llm_judge)A second LLM call grades the output against a rubric you write
Metric (metric)A performance measurement (latency, token count, or tool call count) compared against a threshold
Human review is not a separate eval type. It is a post-run override mechanism: after a run completes, any result can receive a human score via the Dashboard → Review button. Human scores are stored alongside (and can override) automated scores.

How evals run

Evals are automatic. When you enable the Eval Framework, every agent response is captured and scored against your active eval definitions. There is no manual “run” button — results appear in the dashboard as you use MIRA.

Workflow

Create evals → Activate evals → Assign to profiles → Chat → Review dashboard
  1. Create eval definitions in the Eval Studio
  2. Create a profile and assign evals to it
  3. Activate the profile so it evaluates future responses
  4. Use MIRA normally — results stream into the Eval Dashboard
  5. Optionally compare two conversation runs (A/B) or submit human review overrides
  6. Export results for reporting

Opening the Eval Framework

Click the Flask icon in the left sidebar.
Edit this page — Open a pull request