LLM as a Judge - Albert Masoliver's learning site

## Definition **LLM as a judge** (also called AI as a judge) is the practice of using a language model to automatically evaluate the outputs of another AI system. The evaluating model is called the AI judge. Rather than relying solely on human annotators or hand-designed metrics, an AI judge is prompted with an evaluation criterion and asked to score or compare responses. The approach became practically viable around 2020 with GPT-3, and by 2023–2024 it had become the dominant automatic evaluation strategy in production AI applications. LangChain's 2023 State of AI report noted that 58% of evaluations on their platform were AI-judge-based. ## Modes of Judgment An AI judge can operate in three primary modes: 1. **Pointwise scoring.** Given a question and a single response, the judge outputs a score (e.g., 1–5) or a binary label (good/bad). Useful for monitoring a single system over time. 2. **Reference comparison.** Given a question, a reference answer, and a generated answer, the judge decides whether the generated answer matches the reference. An alternative to hand-designed similarity metrics. 3. **Pairwise (comparative) judgment.** Given a question and two candidate responses, the judge decides which is better. This is the core mechanism behind [[Comparative Evaluation]] and is used to generate preference data for [[RLHF]]. ## Prompting an AI Judge The judge prompt should specify: - The **task**: what quality dimension is being evaluated (relevance, factual consistency, coherence, toxicity, etc.). - The **criteria**: a detailed description of what constitutes a good vs. bad response. - The **scoring system**: classification (good/bad), discrete scale (1–5), or continuous (0–1). Discrete scales with few values (1–5) outperform continuous scoring empirically; wider discrete ranges degrade reliability. - **Examples** (few-shot): including annotated examples of each score level improves consistency from ~65% to ~77.5% for GPT-4 (Zheng et al., 2023). ## Reliability and Agreement with Humans Studies show that strong judge models can match human annotators closely. Zheng et al. (2023) found GPT-4-as-judge achieved 85% agreement with humans on the MT-Bench dataset, exceeding human–human agreement (81%). AlpacaEval's AI judges correlated 0.98 with LMSYS Chatbot Arena's human-voted leaderboard. However, AI judge scores are not standardised across tools. MLflow, Ragas, and LlamaIndex define "faithfulness" differently and use incompatible scoring ranges, so their scores are not directly comparable. This criteria ambiguity is a core limitation. ## Limitations - **Inconsistency.** A probabilistic model may produce different scores on identical inputs across runs. Setting temperature to 0 reduces but does not eliminate variance. - **Criteria ambiguity.** Scores depend on the model, the prompt, and the scoring rubric. Criteria with the same name across tools are not equivalent. - **Cost and latency.** Each evaluation is an additional model call. Evaluating N criteria multiplies API costs by N+1. Spot-checking (evaluating a random subset) is a common mitigation. - **Biases.** See [[AI Judge Biases]] for self-bias, position bias, and verbosity bias. - **Evolving judges.** The judge is itself an AI application and can change over time, making longitudinal tracking unreliable unless the exact judge model and prompt are frozen. ## Specialised Judge Types | Judge type | Input | Output | |---|---|---| | Reward model | (prompt, response) | Scalar quality score | | Reference-based judge | (prompt, response, reference) | Similarity or quality score | | Preference model | (prompt, response A, response B) | Which response users prefer | Small, specialised judges trained on specific criteria can outperform general-purpose large judges for targeted tasks (Zheng et al., 2023). ## Related - [[AI Judge Biases]] - [[Comparative Evaluation]] - [[RLHF]] - [[Hallucination]] - [[Evaluation-Driven Development]] - [[Prompt Engineering]] - [[Temperature]] ## Sources - [[AI Engineering - Chip Huyen]]