CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics

Introduction

Dataset

CriticLeanBench

Modalities

Text, Image

Formats

parquet

Languages

English

Size

393kB

Release Date

2025-07-08

Domain

Mathematical Reasoning, Formal Verification

License

Apache license 2.0

CriticLeanBench is a specialized benchmark designed to evaluate the critical reasoning of AI models, specifically on the task of validating the translation of natural language mathematics into formal Lean 4 theorem statements.

  • Its core purpose is to assess a model's ability to act as a "critic," determining whether a piece of Lean 4 code faithfully captures the semantic and logical intent of the original mathematical problem, going beyond simple syntactic correctness.

  • The dataset consists of 500 human-verified pairs, critically balanced between 250 correct formalizations and 250 incorrect ones that feature common and representative error patterns.

Data Sample

Designed by 2077AI Team