目录

Preface

As is well known, verifiers play a crucial role in at least two scenarios. First, in agent architectures, a stop condition is required to exit the loop. Second, in RLVR (Reinforcement Learning with Verifiable Rewards) architectures, a verifier is needed to determine rewards within the RL environment.

Thus, in this blog post, I present a survey of verifiers in the LLM era.

Industry

The overall process is illustrated below:

The most interesting component is the Clinical Rubrics Generator, which consists of three parts:

  • Prompt Collection and Processing

    This work constructs prompts from three major sources: medical record-driven prompts, knowledge base-driven prompts, and synthetic scenario prompts.

  • Rubric Collection

    Based on the aforementioned prompts, LLMs are utilized for rubric construction to derive actionable quantitative metrics.

  • Training of Rubrics Generator

    After training, the Rubrics Generator can produce dynamic evaluation standards in real-time, providing AI physicians with continuous, reliable feedback while effectively managing computational costs.

  • QuarkMed Medical Foundation Model Technical Report

In this work, a General Verifier is trained as an instruction-following model. By providing it with explicit evaluation principles, the model can score responses based on their adherence to these rules.

Academia

  • Compass Verifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

The contributions of this paper are two-fold: VerifierBench (a benchmark) and CompassVerifier (a learnable model).

The VerifierBench pipeline consists of three stages:

  • Stage 1: Multi-expert voting
  • Stage 2: Multi-prompt voting
  • Stage 3: Annotation and analysis

To obtain the model, CompassVerifier employs three data augmentation methods, including complex formula augmentation and error-driven adversarial augmentation. Furthermore, generalizability augmentation is applied to enhance data diversity. A training sample is structured as follows:

[Question, Response, GT]

This model is compared with xVerify. Both methods suggest that a learnable verifier is plausible.

  • xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

This work is similar to the one above, sharing the emphasis on the importance of data augmentation. Ultimately, this method yields a 0.5B model that outperforms other methods.

  • CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

This work follows a similar paradigm to the aforementioned studies.

  • EasyJudge: An Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

This work is based on pointwise and pairwise evaluation methods.

  • Generative Judge For Evaluating Alignment

Similar to the works above, this paper focuses on generality, flexibility, and interpretability.

  • Generative Verifiers: Reward Modeling As Next-Token Prediction (ICLR 2025)

This paper differs from other works. While LLM-based verifiers are typically trained as discriminative classifiers, this paper trains a generative model with a next-token prediction objective.

  • JudgeLM: Fine-Tuned Large Language Models Are Scalable Judges

This work follows a similar paradigm.

  • M-Prometheus: A Suite of Open Multilingual LLM Judges

As a branch of “LLM-as-a-Judge,” this paper trains a multilingual model using a learning paradigm similar to the works mentioned above.

This paper is related to “PROMETHEUS: INDUCING FINE-GRAINED EVALUATION CAPABILITY IN LANGUAGE MODELS.” The interesting part is “score rubric mining,” a method similar to one found in Baichuan-M2.

  • PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Similar to the above, a training example for PandaLM is as follows:

The prompt used for training PandaLM is as follows:

Conclusion

Finally, we need to consider the following idea: Can we build a general evaluator? Is an evaluation rubric necessary in the evaluation process?

this blog talked more about verifier in DeepSeekMath, certainly including meta-verfier.

扫码加笔者好友,茶已备好,等你来聊~