Comparative Evaluation - Albert Masoliver's learning site

## Definition **Comparative evaluation** (also called pairwise evaluation) is a model ranking methodology in which models are assessed by comparing their outputs directly against each other rather than scoring each model independently. An evaluator — human or AI — is presented with outputs from two models for the same prompt and selects the preferred response. Rankings are then computed from the aggregate comparison results using a rating algorithm. This contrasts with **pointwise evaluation**, in which each model is scored in isolation (e.g., with a Likert scale) and then ranked by score. Comparative evaluation was first used in AI by Anthropic in 2021 and powers the LMSYS Chatbot Arena leaderboard, which became a widely trusted community benchmark. ## Why Comparative Over Pointwise? For subjective outputs, it is psychologically easier to say which of two responses is better than to assign an absolute score. As models surpass human-expert-level performance, annotators may be unable to give reliable absolute scores but can still detect quality differences between responses. Comparative evaluation remains useful even when the task exceeds human capability (Touvron et al., 2023, Llama 2 paper). Comparative evaluation is also harder to game: models cannot be trained to exploit a specific metric; they must actually outperform competitors in head-to-head matches. ## Process 1. For each prompt, two models are selected and each generates a response. 2. An evaluator (human or [[LLM as a Judge]]) selects the winner, or declares a tie. 3. The outcome is recorded as a **match**: (Model A, Model B, winner). 4. A **rating algorithm** processes all matches to compute a score per model, then ranks models by score. Common rating algorithms adapted from sports and games: - **Elo** (original Chatbot Arena approach): sensitive to match ordering. - **Bradley–Terry**: less sensitive to ordering; adopted by Chatbot Arena after Elo's limitations were found. - **TrueSkill**: handles ties and partial information well. A ranking is considered correct if, for any pair, the higher-ranked model wins more than 50% of head-to-head matches. ## Win Rate and Transitivity The win rate of model A over model B is the fraction of their matches where A is preferred. For many-model tournaments, transitivity is typically assumed (if A > B and B > C then A > C) to reduce the number of required comparisons. However, human preference is not necessarily transitive, and different model pairs are evaluated on different prompts, so this assumption is not always valid. ## Comparison to A/B Testing Comparative evaluation differs from A/B testing: in A/B testing a user sees only one model's output at a time; in comparative evaluation the user sees both simultaneously. Comparative evaluation captures relative preference directly; A/B testing captures absolute behavioural outcomes (e.g., click-through rate). ## Challenges - **Scalability.** The number of model pairs grows quadratically ($\binom{n}{2}$). LMSYS needed 244,000 comparisons for 57 models (early 2024). Efficient matching algorithms that prioritise uncertain pairs reduce this burden. - **Lack of standardisation.** Open crowdsourced leaderboards receive low-quality or off-topic prompts (e.g., "hello" was submitted 180 times in one LMSYS dataset). Enforcing prompt quality reduces coverage; relaxing it introduces noise. - **Relative vs. absolute.** Comparative evaluation tells you which model is better, not whether either model is good enough. A model can win comparisons while still being inadequate for a specific application's requirements. - **Private model evaluation.** Comparing a private internal model against public models requires either publishing its outputs or paying for a private evaluation service. ## Preference Models as Scalable Judges Collecting comparative signals from humans is expensive. **Preference models** — small, specialised classifiers trained to predict human preference given (prompt, response A, response B) — aim to make pairwise evaluation scalable. Examples include PandaLM (Wang et al., 2023) and JudgeLM (Zhu et al., 2023). They also generate preference data for alignment training without requiring new human annotations. ## Related - [[LLM as a Judge]] - [[AI Judge Biases]] - [[RLHF]] - [[Evaluation-Driven Development]] - [[Confusion Matrix]] - [[Precision and Recall]] ## Sources - [[AI Engineering - Chip Huyen]]