AI Judge Biases - Albert Masoliver's learning site

## Definition **AI judge biases** are systematic distortions in the scores or preference signals produced by an [[LLM as a Judge|LLM-as-a-judge]] system that are unrelated to the actual quality of the evaluated response. Like human evaluators, AI judges are not neutral: their outputs reflect both the statistical patterns in their training data and the structural properties of how comparisons are presented. ## Three Core Biases ### Self-Bias (Egocentric Bias) A model tends to score its own outputs higher than outputs from other models. The same internal mechanism that makes a token likely to be generated also makes that token look "good" to the model. In Zheng et al. (2023), GPT-4 favoured its own responses with a 10% higher win rate, while Claude-v1 favoured itself with a 25% higher win rate. Mitigation: use a different model as the judge than the one generating responses, or average scores across multiple judges. ### Position Bias (First-Position Bias) In pairwise comparisons, AI judges systematically favour the response that appears first in the prompt. This is the inverse of human evaluators, who tend to favour the response they read last (recency bias). Mitigation: repeat each comparison twice with the candidates in reversed order and average the results. Carefully crafted prompts that de-emphasise ordering can also reduce the effect. ### Verbosity Bias Many AI judges prefer longer responses regardless of quality. Wu and Aji (2023) found that both GPT-4 and Claude-1 preferred longer responses (~100 words) containing factual errors over shorter, correct responses (~50 words). When the length difference is sufficiently large (e.g., one response twice as long as the other), the judge almost always prefers the longer one (Saito et al., 2023). GPT-4 exhibits less verbosity bias than GPT-3.5, suggesting the bias weakens in more capable models, but it remains present. Mitigation: normalise response length before presenting to the judge, or explicitly instruct the judge to ignore length as a criterion. ## Other Biases - **Sycophancy toward the user's framing.** If the prompt implies a preferred answer, the judge may align with that framing regardless of content quality. - **IP and privacy.** Using a proprietary judge requires sending data to the model provider, raising data-lineage concerns (see [[LLM as a Judge]]). - **Training-data alignment.** A judge trained on a specific cultural corpus will inherit the biases of that corpus when scoring for attributes like tone, style, or appropriateness. ## Practical Implications Bias awareness is necessary for interpreting AI judge scores correctly: - Scores are relative to the specific judge model and prompt; they cannot be meaningfully compared across different judges without normalisation. - Longitudinal tracking of an application using AI judge scores requires freezing both the judge model and the judge prompt. - When using comparative evaluation for model selection, position bias should be mitigated by symmetric pair ordering before drawing conclusions. ## Related - [[LLM as a Judge]] - [[Comparative Evaluation]] - [[Alignment]] - [[RLHF]] ## Sources - [[AI Engineering - Chip Huyen]]