CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics
Introduction
Dataset | CriticLeanBench |
---|---|
Modalities | Text, Image |
Formats | parquet |
Languages | English |
Size | 393kB |
Release Date | 2025-07-08 |
Domain | Mathematical Reasoning, Formal Verification |
License | Apache license 2.0 |
CriticLeanBench is a specialized benchmark designed to evaluate the critical reasoning of AI models, specifically on the task of validating the translation of natural language mathematics into formal Lean 4 theorem statements.
- Its core purpose is to assess a model's ability to act as a "critic," determining whether a piece of Lean 4 code faithfully captures the semantic and logical intent of the original mathematical problem, going beyond simple syntactic correctness.
- The dataset consists of 500 human-verified pairs, critically balanced between 250 correct formalizations and 250 incorrect ones that feature common and representative error patterns.