FormalMATH: A Large-Scale Benchmark for Formal Mathematical Reasoning in Lean4
Introduction
Dataset | FormalMATH |
---|---|
Modalities | Text |
Formats | json |
Languages | English, Chinese |
Size | 5.26MB |
Release Date | 2025-05-05 |
Domain | Mathematics |
License | mit |
FormalMATH is a large-scale benchmark designed to evaluate and advance the capabilities of Large Language Models in the challenging domain of formal mathematical reasoning. It contains 5,560 formally verified problems, all expressed as theorem statements in the Lean4 proof assistant.
- The dataset is comprehensive in scope, featuring problems that range from high-school Olympiad challenges to undergraduate-level theorems. It covers diverse mathematical fields, including algebra, applied mathematics, calculus, number theory, and discrete mathematics. Each entry consists of a natural language problem paired with its corresponding formal Lean4 statement, along with solutions and other metadata.
- Created using a novel human-in-the-loop autoformalization pipeline, FormalMATH serves as a robust and difficult testbed for current and future LLM-based theorem provers. Evaluations show that even state-of-the-art models struggle with this benchmark, highlighting significant room for improvement in automated reasoning.