Skip to main content
An LLM Judge eval sends the engine’s output to a second model along with a judge prompt. The judge model returns a score (0.0–1.0) and a reasoning explanation. This is the most flexible eval type — you can grade on dimensions like accuracy, tone, completeness, or safety.

When to use

  • Outputs are subjective or long-form (essays, reports, explanations)
  • Exact match and semantic similarity are too coarse for your quality bar
  • You need nuanced grading with a written explanation per result

How it works

Engine output + Judge prompt → Judge LLM → { score: 0.0–1.0, reasoning: "...", confidence: 0.0–1.0 }
MIRA sends a structured prompt to the judge model that includes:
  1. Your system prompt (context for the judge)
  2. Your judge prompt with the original input and engine output wrapped in XML tags
  3. Instructions to return JSON: {"score": <0.0–1.0>, "reasoning": "<explanation>", "confidence": <0.0–1.0>}
The XML wrapping is a prompt injection mitigation — the judge is instructed to ignore any directives inside the tagged content.

Supported judge providers

LLM judge evals support all four providers: AWS Bedrock (default), Anthropic, OpenAI, and Ollama. The judge provider can differ from the primary engine provider.
ProviderCredential
AWS BedrockAWS credentials (Bedrock / AWS tab)
AnthropicANTHROPIC_API_KEY
OpenAIOPENAI_API_KEY
OllamaNone (local)

Configuring an LLM judge eval

1

Open the Eval Studio

Click the Flask icon in the sidebar, then click + New in the left panel.
2

Select LLM Judge type

Click the LLM Judge type card.
3

Enter a name and system prompt

Give the eval a name. The System Prompt sets the judge’s persona, e.g.:
You are a fair, impartial evaluator of AI assistant responses.
4

Write the judge prompt

The Judge Prompt describes what to grade. Instruct the judge to return JSON with score (0.0–1.0), reasoning, and confidence. Example:
Rate the response on a scale of 0.0–1.0 for accuracy, helpfulness, and clarity.
Respond only as JSON: {"score": <0.0-1.0>, "reasoning": "<brief reason>", "confidence": <0.0-1.0>}
5

Set the pass threshold

Scores ≥ threshold count as a pass. Default: 0.70 (on a 0–1 scale).
6

Select the judge provider and model

Choose the provider and enter the model ID. A capable model produces more reliable grading. The default is anthropic.claude-3-5-haiku-20241022-v1:0 via Bedrock (cost-effective and fast).
7

Test and activate

Enter a sample input and output in the inline tester, run the test, and click Activate once a test passes.

Scoring

Score rangeMeaning
0.9–1.0Excellent — fully meets the judge prompt criteria
0.7–0.89Good — minor issues
0.5–0.69Acceptable — below default pass threshold
0.1–0.49Poor — significant issues
0.0Fails completely or judge parse error
The judge’s reasoning and confidence are stored for each run and are visible in the run detail view.

Cost implications

Each LLM judge eval triggers an additional API call to the judge provider. For high-volume eval captures, this can be significant. Consider:
  • Using a smaller/cheaper model for the judge (e.g. claude-3-5-haiku or gpt-4o-mini)
  • Enabling Local-Only Mode in Settings → Evals to suspend LLM judge evals when not needed
Edit this page — Open a pull request