How to Build a Verifier: A Short Survey

As is well known, verifiers play a crucial role in at least two scenarios. First, in agent architectures, a stop condition is required to exit the loop. Second, in RLVR (Reinforcement Learning with Verifiable Rewards) architectures, a verifier is needed to determine rewards within the RL environment.

Thus, in this blog post, I present a survey of verifiers in the LLM era.

Industry

Baichuan-M2: Scaling Medical Capability with Large Verifier System

The overall process is illustrated below:

The most interesting component is the Clinical Rubrics Generator, which consists of three parts:

Prompt Collection and Processing

This work constructs prompts from three major sources: medical record-driven prompts, knowledge base-driven prompts, and synthetic scenario prompts.
Rubric Collection

Based on the aforementioned prompts, LLMs are utilized for rubric construction to derive actionable quantitative metrics.
Training of Rubrics Generator

After training, the Rubrics Generator can produce dynamic evaluation standards in real-time, providing AI physicians with continuous, reliable feedback while effectively managing computational costs.
QuarkMed Medical Foundation Model Technical Report

In this work, a General Verifier is trained as an instruction-following model. By providing it with explicit evaluation principles, the model can score responses based on their adherence to these rules.

Academia

Compass Verifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

The contributions of this paper are two-fold: VerifierBench (a benchmark) and CompassVerifier (a learnable model).

The VerifierBench pipeline consists of three stages:

Stage 1: Multi-expert voting
Stage 2: Multi-prompt voting
Stage 3: Annotation and analysis

To obtain the model, CompassVerifier employs three data augmentation methods, including complex formula augmentation and error-driven adversarial augmentation. Furthermore, generalizability augmentation is applied to enhance data diversity. A training sample is structured as follows:

[Question, Response, GT]

This model is compared with xVerify. Both methods suggest that a learnable verifier is plausible.

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

This work is similar to the one above, sharing the emphasis on the importance of data augmentation. Ultimately, this method yields a 0.5B model that outperforms other methods.

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

This work follows a similar paradigm to the aforementioned studies.

EasyJudge: An Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

This work is based on pointwise and pairwise evaluation methods.

Generative Judge For Evaluating Alignment

Similar to the works above, this paper focuses on generality, flexibility, and interpretability.

Generative Verifiers: Reward Modeling As Next-Token Prediction (ICLR 2025)

This paper differs from other works. While LLM-based verifiers are typically trained as discriminative classifiers, this paper trains a generative model with a next-token prediction objective.

JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges

This work follows a similar paradigm.

M-Prometheus: A Suite of Open Multilingual LLM Judges

As a branch of “LLM-as-a-Judge,” this paper trains a multilingual model using a learning paradigm similar to the works mentioned above.

This paper is related to “PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS.” The interesting part is “score rubric mining,” a method similar to one found in Baichuan-M2.

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Similar to the above, a training example for PandaLM is as follows:

The prompt used for training PandaLM is as follows:

Conclusion

Finally, we need to consider the following idea: Can we build a general evaluator? Is an evaluation rubric necessary in the evaluation process?

this blog talked more about verifier in DeepSeekMath, certainly including meta-verfier.

How to Build a Verifier: A Short Survey

目录

Preface

Industry

Baichuan-M2: Scaling Medical Capability with Large Verifier System

QuarkMed Medical Foundation Model Technical Report

Academia

Compass Verifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

EasyJudge: An Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

Generative Judge For Evaluating Alignment

Generative Verifiers: Reward Modeling As Next-Token Prediction (ICLR 2025)

JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges

M-Prometheus: A Suite of Open Multilingual LLM Judges

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Conclusion

目录

Preface

Industry

Baichuan-M2: Scaling Medical Capability with Large Verifier System

QuarkMed Medical Foundation Model Technical Report

Academia

Compass Verifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

EasyJudge: An Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

Generative Judge For Evaluating Alignment

Generative Verifiers: Reward Modeling As Next-Token Prediction (ICLR 2025)

JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges

M-Prometheus: A Suite of Open Multilingual LLM Judges

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Conclusion

Related Materials