CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics

Introduction

Dataset	CriticLeanBench
Modalities	Text, Image
Formats	parquet
Languages	English
Size	393kB
Release Date	2025-07-08
Domain	Mathematical Reasoning, Formal Verification
License	Apache license 2.0

CriticLeanBench is a specialized benchmark designed to evaluate the critical reasoning of AI models, specifically on the task of validating the translation of natural language mathematics into formal Lean 4 theorem statements.

Its core purpose is to assess a model's ability to act as a "critic," determining whether a piece of Lean 4 code faithfully captures the semantic and logical intent of the original mathematical problem, going beyond simple syntactic correctness.
The dataset consists of 500 human-verified pairs, critically balanced between 250 correct formalizations and 250 incorrect ones that feature common and representative error patterns.

About

Mission

Events

News

Opportunities

Partnerships

Research

Datasets

Projects

EVA

Campus Program

Challenges

Ventures

CriticLeanBench

A Benchmark for Evaluating Mathematical Formalization Critics

CriticLeanBench: A Benchmark for Evaluating Mathematical Formalization Critics

Introduction

Data Sample