LLM Judge Evals

An LLM Judge eval sends the engine’s output to a second model along with a judge prompt. The judge model returns a score (0.0–1.0) and a reasoning explanation. This is the most flexible eval type — you can grade on dimensions like accuracy, tone, completeness, or safety.

When to use

Outputs are subjective or long-form (essays, reports, explanations)
Exact match and semantic similarity are too coarse for your quality bar
You need nuanced grading with a written explanation per result

How it works

Engine output + Judge prompt → Judge LLM → { score: 0.0–1.0, reasoning: "...", confidence: 0.0–1.0 }

MIRA sends a structured prompt to the judge model that includes:

Your system prompt (context for the judge)
Your judge prompt with the original input and engine output wrapped in XML tags
Instructions to return JSON: {"score": <0.0–1.0>, "reasoning": "<explanation>", "confidence": <0.0–1.0>}

The XML wrapping is a prompt injection mitigation — the judge is instructed to ignore any directives inside the tagged content.

Supported judge providers

LLM judge evals support all four providers: AWS Bedrock (default), Anthropic, OpenAI, and Ollama. The judge provider can differ from the primary engine provider.

Provider	Credential
AWS Bedrock	AWS credentials (Bedrock / AWS tab)
Anthropic	`ANTHROPIC_API_KEY`
OpenAI	`OPENAI_API_KEY`
Ollama	None (local)

Configuring an LLM judge eval

Open the Eval Studio

Click the Flask icon in the sidebar, then click + New in the left panel.

Select LLM Judge type

Click the LLM Judge type card.

Enter a name and system prompt

Give the eval a name. The System Prompt sets the judge’s persona, e.g.:

You are a fair, impartial evaluator of AI assistant responses.

Write the judge prompt

The Judge Prompt describes what to grade. Instruct the judge to return JSON with score (0.0–1.0), reasoning, and confidence. Example:

Rate the response on a scale of 0.0–1.0 for accuracy, helpfulness, and clarity.
Respond only as JSON: {"score": <0.0-1.0>, "reasoning": "<brief reason>", "confidence": <0.0-1.0>}

Set the pass threshold

Scores ≥ threshold count as a pass. Default: 0.70 (on a 0–1 scale).

Select the judge provider and model

Choose the provider and enter the model ID. A capable model produces more reliable grading. The default is anthropic.claude-3-5-haiku-20241022-v1:0 via Bedrock (cost-effective and fast).

Test and activate

Enter a sample input and output in the inline tester, run the test, and click Activate once a test passes.

Scoring

Score range	Meaning
0.9–1.0	Excellent — fully meets the judge prompt criteria
0.7–0.89	Good — minor issues
0.5–0.69	Acceptable — below default pass threshold
0.1–0.49	Poor — significant issues
0.0	Fails completely or judge parse error

The judge’s reasoning and confidence are stored for each run and are visible in the run detail view.

Cost implications

Each LLM judge eval triggers an additional API call to the judge provider. For high-volume eval captures, this can be significant. Consider:

Using a smaller/cheaper model for the judge (e.g. claude-3-5-haiku or gpt-4o-mini)
Enabling Local-Only Mode in Settings → Evals to suspend LLM judge evals when not needed

Edit this page — Open a pull request

​When to use

​How it works

​Supported judge providers

​Configuring an LLM judge eval

​Scoring

​Cost implications

When to use

How it works

Supported judge providers

Configuring an LLM judge eval

Scoring

Cost implications