CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics image

CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics

Introduction

DatasetCriticLeanBench
ModalitiesText, Image
Formatsparquet
LanguagesEnglish
Size393kB
Release Date2025-07-08
DomainMathematical Reasoning, Formal Verification
LicenseApache license 2.0

CriticLeanBench is a specialized benchmark designed to evaluate the critical reasoning of AI models, specifically on the task of validating the translation of natural language mathematics into formal Lean 4 theorem statements.

  • Its core purpose is to assess a model's ability to act as a "critic," determining whether a piece of Lean 4 code faithfully captures the semantic and logical intent of the original mathematical problem, going beyond simple syntactic correctness.
  • The dataset consists of 500 human-verified pairs, critically balanced between 250 correct formalizations and 250 incorrect ones that feature common and representative error patterns.

Data Sample