FormalMATH Benchmark: A Formal Mathematics Benchmark for Pushing the Limits of AI

Homepage: https://spherelab.ai/FormalMATH/
Github: https://github.com/Sphere-AI-Lab/FormalMATH-Bench
Hugging Face: https://huggingface.co/SphereLab
Link: https://arxiv.org/pdf/2505.02735

As large language models (LLMs) have made breakthroughs in tasks like natural language processing and code generation, formalized mathematics has become crucial for testing their logical reasoning limits.

Unlike natural language processing and code generation, formalized mathematics demands absolute logical rigor, making it a paradigm for testing LLMs' reasoning. It requires converting math propositions into precise formal language and deriving them within a strict axiomatic system. Any semantic ambiguity or reasoning gap leads to proof failure. Unlike traditional tasks, its verification isn't subjective but is done by formal tools via symbolic consistency. This zero - tolerance structure exposes model flaws in multi - step reasoning, abstract symbol management, and proof strategy planning. Currently, creating benchmarks that incorporate advanced logical reasoning, cross - domain knowledge transfer, and counterintuitive proof exploration is key to overcoming LLMs' mathematical reasoning limitations.

On May 5, the 2077AI open-source community and institutions like The Chinese University of Hong Kong, Westlake University, M-A-P, Zhejiang University, and Max Planck jointly released FormalMATH. It's the largest and most extensive Lean4 formal mathematics benchmark library to date, with 5,560 verified math propositions. These cover Olympiad and undergraduate topics across 12 areas, including algebra, calculus, number theory, and discrete mathematics.

The team's systematic evaluation reveals that even state - of - the - art theorem - proving LLMs achieve merely a 16.46% Pass@32 (success rate with up to 32 attempts) on this benchmark. This highlights the performance limitations of current automated theorem - proving systems when handling multi - domain, complex problems.

1. Challenges in Formal Mathematics Automated Reasoning

In the quest for general-purpose mathematical reasoning, several benchmarks for evaluating LLMs' automated theorem-proving performance have emerged. But as models swiftly "score - chase" on these benchmarks, their limitations in scale and difficulty have become evident. This has driven research teams to revisit the design criteria for automated reasoning evaluation systems, especially in aspects like cross - domain generalization and handling complex logical structures.

Formal Mathematical Reasoning (FMR) demands that models rigorously formalize and automatically prove mathematical propositions within strict logical systems like Lean4. This task imposes high demands on AI systems' reasoning, generalization, and symbolic manipulation abilities.

Existing benchmarks like MiniF2F (244 problems) and ProofNet (186 problems) are widely used but have the following shortcomings:

Limited domain coverage: They mainly focus on basic areas like Olympiad math and number theory, offering little coverage of other mathematical fields and failing to comprehensively test models' knowledge structures.
Small scale: The limited number of problems makes it easy for models to "learn routines" and boost scores, without effectively assessing their broad generalization ability for new problems.
Performance saturation: Some models have already achieved over 80% success rates on these benchmarks, urgently needing more challenging evaluation systems to drive technological breakthroughs.

2. FormalMATH Benchmark Design and Data Integration

To overcome the inherent limitations of traditional benchmarks in terms of scale, question types, and difficulty, the FormalMATH project has been designed with the goal of creating a high - quality, large - scale, and cross - domain benchmark from the outset. In terms of construction strategy, FormalMATH breaks away from the traditional paradigm of relying on manual sampling and labeling, innovatively building a multi - stage AI - automated pipeline. This pipeline first mines high-value mathematical propositions, generates formalized propositions, and forms a preliminary benchmark library. At the same time, the project introduces an expert-reviewed process to conduct in-depth reviews of the logical rigor and symbolic standardization of AI-generated content. Through a human - computer collaborative model, it ensures both the efficiency of benchmark construction and the accuracy of content - each proposition meets the strict standards of formalized mathematics, achieving a dual breakthrough in efficiency and quality.

2.1 Data Sources and Domain Distribution

During the actual construction process, one of the biggest challenges the team faced was how to expand the formalized versions of thousands of proposition samples with high quality and consistency. To this end, the development team designed an automated generation pipeline that integrates LLM writing, logical verification, cross-checking by multiple models, and proof-by-contradiction mechanisms, supplemented by expert manual review, which greatly improves data processing efficiency and overall consistency. With the support of this mechanism, the team has successfully built a clear - structured and well - layered question bank system. To cover the broad challenges of formalized mathematics tasks, the dataset has been systematically designed and optimized in terms of both the number of questions and domain distribution, striving to restore real - world and diverse reasoning scenarios.

Domain Distribution of FormalMATH

Number of Problems: Contains 5,560 Lean4 formalized propositions, approximately 22.8 times the size of the classic MiniF2F benchmark.
Domain Coverage: Spans 12 major areas, including algebra, applied mathematics, calculus, number theory, precalculus, discrete structures, and geometry. The difficulty ranges from Olympiad - level to advanced undergraduate courses.
Difficulty Stratification: The data sources are diverse, including international math resources and competitions such as Omni - math, Numina - Olympiad, AIME, BlueMO, U - Math, HardMath, and DEMIMATH.

2.2 Multi-stage Automation and Human-Assisted AI Pipeline

To support a question system that is wide-ranging and diverse in difficulty, the team not only faces the engineering challenge of scaling up processing, but also needs to ensure that each proposition maintains logical rigor and semantic consistency during the formalization process. To this end, the research team has designed and implemented a multi-stage data generation and verification pipeline that deeply integrates automation and human review, ensuring the quality and semantic accuracy of the entire process from natural language to Lean4 propositions.

Human-AI collaboration pipeline for formal mathematical statement creation and filtering

Core Highlights and Innovative Mechanisms from Key Dimensions:

Customized Formalization Models: The project uses general - purpose LLMs to generate formalized data and filters data that can be compiled by Lean4. All candidate propositions must be automatically verified by the Lean4 compiler to ensure strict correctness in type and syntax, filtering out illegal expressions. The collected data is used to train customized Autoformalization Models.
LLM Multi - Model Semantic Consistency Judgment: Based on various large models such as o1 - mini, Claude - 3.5, and Sonnet, the project employs a Chain - of - Thought reverse translation verification strategy to conduct cross - model semantic checks on mathematical propositions. It requires that all passing samples have complete semantic consistency between natural language and Lean4 across multiple models. Experimental data shows that this step can directly filter out over 60% of samples that are syntactically correct but semantically inconsistent. Particularly in high - difficulty areas, the semantic consistency rate is even lower, highlighting the necessity of this step in complex scenarios.
Logical Negation - Proof - by - Contradiction Filtering: For propositions to be tested, the system automatically constructs their negations and calls existing LLM provers to attempt automatic "proof - by - contradiction." If a successful proof is obtained, the original proposition is automatically removed. This further reduces human workload and enhances the final validity.
Olympiad / Professional - Level Human Final Review: Twelve experts with Olympiad gold medals or high-level math backgrounds conduct strict semantic consistency reviews on the remaining samples. The final pass rate is 72.09%, significantly optimizing labeling efficiency.

This hybrid generation strategy not only significantly reduces expert labeling costs but also provides a replicable reference for building high - quality benchmarks in other formal systems tasks in the future.

3. Quantitative Evaluation of Large Language Models on FormalMATH

After building the benchmark, the research team systematically evaluated the performance of current mainstream LLM provers. To fully restore their ability to handle advanced mathematical tasks in real - world applications, the team used the same settings when testing different models and observed their general - purpose reasoning ability in different fields and difficulty levels.

3.1 Overall performance

On the full - version FormalMATH dataset, the overall pass rate of mainstream LLM automatic theorem provers is much lower than that of traditional benchmarks. Even with higher sampling budgets and integrated methods, the relative improvement in pass rate is limited and far from meeting practical standards.

Performance Comparison of Existing Theorem Provers on the FormalMATH Benchmark

3.2 Domain-specific Accuracy Analysis

Although the overall evaluation shows that mainstream LLMs generally have limited performance on FormalMATH, there are differences in model performance when it comes to different mathematical subfields. Considering the finely divided domain labels of FormalMATH, the team conducted independent statistics on the success rates of each subfield, attempting to reveal the "local limits" and "structural weaknesses" of model capabilities, and to provide guidance for subsequent model optimization and training strategies.

Algebra/Applied Mathematics: Some high school/undergraduate algebra and applied mathematics problems can achieve a success rate of 10% - 20%.
Advanced Calculus/Discrete Structures/Higher-order Number Theory: In these fields, the performance of almost all models drops significantly, with many tasks having a success rate below 5%, and some fields (such as advanced analysis, complex analysis) are almost at 0.

The significant domain bias is caused by the fact that the training data is concentrated on low-difficulty problem types, and the training data for high-level fields is highly sparse. This analysis further indicates that the current large models are not completely "ignorant" in specific subfields, but rather gradually "become unstable" under the combined pressure of training distribution, symbol complexity, and structural depth. This not only reflects the structural bias in the design of training data, but also suggests that if we want to achieve a leap in cross-domain generalization ability in the future, we need to systematically introduce learning mechanisms and expression paradigms for high-level reasoning tasks.

3.3 Main Error Types and Model “Failure” Mechanisms

The experiments further revealed that in complex task scenarios, large models have a series of systematic failure modes, indicating that the current reasoning architecture and training paradigm have not yet truly adapted to the high standards of formal systems for logical chains, detailed expression, and semantic consistency. The team conducted a qualitative attribution and structured classification of the model’s failure samples in FormalMATH tasks and summarized the following main failure modes:

Distribution of Different Lean4 Error Modes

Incomplete reasoning chains: A large proportion of the generation process is missing steps or interrupted, failing to provide a complete proof process.
Overuse of automatic TACTIC: The model tends to frequently call automated tactics such as aesop and linarith, which leads to multi - step derivations being incorrectly "compressed" into one step. This results in the omission of key information or sub - goal analysis.
Introduction of redundant or incorrect assumptions: For example, redundant repeated definitions of variables and assumptions, and the introduction of irrelevant information that disrupts the proof structure.
Insufficient ability to handle complex inequalities and higher - order reasoning: The automated inequality - proving module fails when encountering multivariable higher - order inequalities and symmetric structures.
Inconsistent semantics between target and original task: During automatic formalization, key information is often lost or logic is simplified, leading to inconsistencies between the proposition description and the semantics of the target problem.

These failure mechanisms point to the structural difficulties that current LLMs face in the mapping process from natural language to formal language. They also provide clear breakthrough points for future reasoning capability modeling and training set optimization.

3.4 Reasoning Prompt Experiment Results

To further explore ways to improve the model's reasoning performance, the research team introduced different forms of reasoning prompts and compared their impact on the final proof quality. The results show that the choice of prompting strategy has a significant impact on the model's logical flow and semantic consistency, making the prompt design itself an important variable:

Naive CoT (chain - of - thought step - by - step reasoning): Performs the best and can better guide the model to provide fine - grained reasoning.
NL - augmented CoT (natural - language - augmented reasoning): Surprisingly, this leads to a decrease in the model's pass rate and an increase in reasoning confusion.

The preliminary conclusion is that there is a mismatch between the abstract problem - solving ideas of humans and the strict, step - by - step logical reasoning under formal systems. Vague expressions can cause the model's logical chain to be interrupted or even erroneous. This series of prompting strategy experiments not only reveals the structural gap between "human thinking" and "formal logical deduction" but also points out the direction for optimizing prompting engineering. The general prompting paradigms used in traditional natural language tasks are no longer effective when facing symbolic reasoning tasks. Future prompting design needs to be deeply integrated with the logical characteristics of formalized mathematics and build exclusive strategies that meet the strict rules of the symbolic system and the requirements of structured reasoning. Only in this way can the model's potential in the field of mathematical reasoning be truly activated, and the ability leap from "natural language understanding" to "formal logical deduction" be realized.

4. Summary and Outlook

The FormalMATH project has verified the capability limitations of existing automatic theorem - proving systems in high-level, multi-domain mathematical tasks from multiple dimensions not only provide a stricter standard for evaluation but also clearly points out the key research gaps in reasoning strategies, data distribution, and symbolic expression. The release of FormalMATH provides the academic and industrial communities with the latest generation of a rigorous testing platform, promoting the continuous evolution of AI's automatic mathematical reasoning capabilities towards broader and deeper knowledge challenges. This collaboration also reflects 2077AI's continuous commitment and mission vision in promoting breakthroughs in open - source AI's basic capabilities. In the future, 2077AI will continue to promote the deep integration of formalized mathematics and large-model research, laying the foundation for AI systems to move towards higher-level logical capabilities and scientific cognition.

Blog