Semantic Similarity Evals

A Similarity eval compares the engine’s output against a reference answer and returns a cosine similarity score (0–1). This catches correct answers phrased differently — something a rule-based check cannot do.

How it works

MIRA computes TF-IDF bag-of-words cosine similarity entirely locally — no external embedding model or API call is required. Both the output and the reference are converted into TF-IDF vectors, then their cosine similarity is computed in-process. This means:

No embedding provider configuration needed
No network latency or API cost
Works fully offline / in Local-Only Mode

The trade-off vs. neural embeddings is that pure synonym or paraphrase shifts (e.g. “begin” vs. “commence”) may score lower than with a semantic embedding model, but common vocabulary paraphrases are handled well.

When to use

The correct answer can be expressed in many ways (synonyms, paraphrases with overlapping vocabulary)
You have a gold-standard reference answer to compare against
You want a continuous quality score (0–1) rather than binary pass/fail

Configuring a semantic similarity eval

Open the Eval Studio and click + New

Click the Flask icon in the sidebar, then click + New in the left panel.

Select Similarity type

Click the Similarity type card.

Enter a name and reference answer

Give the eval a name. In the Reference Answer field, type the ideal output. This does not need to be the exact wording — just the correct content with overlapping key terms.

Set the pass threshold

Set the minimum cosine similarity score to consider a pass. Default: 0.80.

Threshold	Strictness
0.90+	Very strict — near-identical phrasing required
0.80–0.89	Strict — same content, different words acceptable
0.70–0.79	Moderate — paraphrasing and some omissions acceptable
< 0.70	Lenient — general topic alignment

Test and activate

Enter a sample input and expected output in the inline tester and click Run Test. Once a test passes, click Activate.

Scoring

Each run produces a similarity score between 0 and 1:

≥ threshold → ✅ Pass
< threshold → ❌ Fail

The raw score is stored alongside the pass/fail result so you can track quality trends over time.

Edit this page — Open a pull request

Rule Evals LLM Judge Evals

​How it works

​When to use

​Configuring a semantic similarity eval

​Scoring

How it works

When to use

Configuring a semantic similarity eval

Scoring