When to use
- Outputs are subjective or long-form (essays, reports, explanations)
- Exact match and semantic similarity are too coarse for your quality bar
- You need nuanced grading with a written explanation per result
How it works
- Your system prompt (context for the judge)
- Your judge prompt with the original input and engine output wrapped in XML tags
- Instructions to return JSON:
{"score": <0.0–1.0>, "reasoning": "<explanation>", "confidence": <0.0–1.0>}
Supported judge providers
LLM judge evals support all four providers: AWS Bedrock (default), Anthropic, OpenAI, and Ollama. The judge provider can differ from the primary engine provider.| Provider | Credential |
|---|---|
| AWS Bedrock | AWS credentials (Bedrock / AWS tab) |
| Anthropic | ANTHROPIC_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Ollama | None (local) |
Configuring an LLM judge eval
Enter a name and system prompt
Give the eval a name. The System Prompt sets the judge’s persona, e.g.:
Write the judge prompt
The Judge Prompt describes what to grade. Instruct the judge to return JSON with
score (0.0–1.0), reasoning, and confidence. Example:Select the judge provider and model
Choose the provider and enter the model ID. A capable model produces more reliable grading.
The default is
anthropic.claude-3-5-haiku-20241022-v1:0 via Bedrock (cost-effective and fast).Scoring
| Score range | Meaning |
|---|---|
| 0.9–1.0 | Excellent — fully meets the judge prompt criteria |
| 0.7–0.89 | Good — minor issues |
| 0.5–0.69 | Acceptable — below default pass threshold |
| 0.1–0.49 | Poor — significant issues |
| 0.0 | Fails completely or judge parse error |
reasoning and confidence are stored for each run and are visible in the run detail view.
Cost implications
Each LLM judge eval triggers an additional API call to the judge provider. For high-volume eval captures, this can be significant. Consider:- Using a smaller/cheaper model for the judge (e.g.
claude-3-5-haikuorgpt-4o-mini) - Enabling Local-Only Mode in Settings → Evals to suspend LLM judge evals when not needed
Edit this page — Open a pull
request