Skip to main content
A Similarity eval compares the engine’s output against a reference answer and returns a cosine similarity score (0–1). This catches correct answers phrased differently — something a rule-based check cannot do.

How it works

MIRA computes TF-IDF bag-of-words cosine similarity entirely locally — no external embedding model or API call is required. Both the output and the reference are converted into TF-IDF vectors, then their cosine similarity is computed in-process. This means:
  • No embedding provider configuration needed
  • No network latency or API cost
  • Works fully offline / in Local-Only Mode
The trade-off vs. neural embeddings is that pure synonym or paraphrase shifts (e.g. “begin” vs. “commence”) may score lower than with a semantic embedding model, but common vocabulary paraphrases are handled well.

When to use

  • The correct answer can be expressed in many ways (synonyms, paraphrases with overlapping vocabulary)
  • You have a gold-standard reference answer to compare against
  • You want a continuous quality score (0–1) rather than binary pass/fail

Configuring a semantic similarity eval

1

Open the Eval Studio and click + New

Click the Flask icon in the sidebar, then click + New in the left panel.
2

Select Similarity type

Click the Similarity type card.
3

Enter a name and reference answer

Give the eval a name. In the Reference Answer field, type the ideal output. This does not need to be the exact wording — just the correct content with overlapping key terms.
4

Set the pass threshold

Set the minimum cosine similarity score to consider a pass. Default: 0.80.
ThresholdStrictness
0.90+Very strict — near-identical phrasing required
0.80–0.89Strict — same content, different words acceptable
0.70–0.79Moderate — paraphrasing and some omissions acceptable
< 0.70Lenient — general topic alignment
5

Test and activate

Enter a sample input and expected output in the inline tester and click Run Test. Once a test passes, click Activate.

Scoring

Each run produces a similarity score between 0 and 1:
  • ≥ threshold → ✅ Pass
  • < threshold → ❌ Fail
The raw score is stored alongside the pass/fail result so you can track quality trends over time.
Edit this page — Open a pull request