Eval Framework Overview

MIRA’s Eval Framework lets you measure and track the quality of reasoning engine outputs over time. Instead of eyeballing responses, you define test cases, run them against one or both engines, and get structured scores.

Why evaluate?

Catch regressions after changing a provider, skill, or prompt
Compare engines — NAE vs RLM — on your specific workloads
Tune skills by seeing how instruction changes affect output quality
Build confidence before deploying MIRA to a wider team

Core concepts

Eval Cases

Individual test scenarios — an input, optional documents, and expected output criteria.

Eval Profiles

Named collections of eval cases that represent a repeatable test suite.

Runs

An execution of a profile against an engine configuration. Each run stores all inputs and outputs.

Scores

Automated and human-reviewed scores attached to each case output.

Eval case types

Type	What it tests
Rule (`rule`)	Output matches a string, regex, keyword, JSON schema, or satisfies a length constraint — fully deterministic, no LLM calls
Semantic similarity (`similarity`)	Output is semantically close to a reference answer (embedding-based cosine similarity)
LLM judge (`llm_judge`)	A second LLM call grades the output against a rubric you write
Metric (`metric`)	A performance measurement (latency, token count, or tool call count) compared against a threshold

Human review is not a separate eval type. It is a post-run override mechanism: after a run completes, any result can receive a human score via the Dashboard → Review button. Human scores are stored alongside (and can override) automated scores.

How evals run

Evals are automatic. When you enable the Eval Framework, every agent response is captured and scored against your active eval definitions. There is no manual “run” button — results appear in the dashboard as you use MIRA.

Workflow

Create evals → Activate evals → Assign to profiles → Chat → Review dashboard

Create eval definitions in the Eval Studio
Create a profile and assign evals to it
Activate the profile so it evaluates future responses
Use MIRA normally — results stream into the Eval Dashboard
Optionally compare two conversation runs (A/B) or submit human review overrides
Export results for reporting

Opening the Eval Framework

Click the Flask icon in the left sidebar.

Edit this page — Open a pull request

Activity Center Rule Evals

​Why evaluate?

​Core concepts

Eval Cases

Eval Profiles

Runs

Scores

​Eval case types

​How evals run

​Workflow

​Opening the Eval Framework

Why evaluate?

Core concepts

Eval case types

How evals run

Workflow

Opening the Eval Framework