Headline
  • 1. Breaking Cognitive Boundaries: Innovative Framework for Evaluating Knowledge-Orthogonal Models' Reasoning Abilities
  • 2. Deep Deconstruction and Reconstruction: A Precise Evaluation Framework Spanning Five Dimensions
  • 3. Innovative Evaluation Methods and In-Depth Performance Analysis

Breaking Traditional Knowledge Dependency: KOR-Bench for Evaluating Intrinsic Reasoning Abilities of Models

1. Breaking Cognitive Boundaries: Innovative Framework for Evaluating Knowledge-Orthogonal Models' Reasoning Abilities

In the field of AI evaluation, models' reasoning abilities have long been obscured by the "noise" of pre-trained knowledge. Existing evaluation benchmarks often fail to distinguish whether models truly possess reasoning capabilities or are merely repeating patterns from training data, relying on the accumulation of traditional prior knowledge.

In March 2025, led by the M-A-P research team and jointly open-sourced with organizations such as 2077AI, KOR-Bench introduced the innovative concept of "knowledge orthogonality," completely changing the status quo of model evaluation benchmarks that struggle to assess models' reasoning abilities. The knowledge orthogonality of the KOR-Bench dataset ensures that evaluation tasks are independent of pre-trained knowledge, requiring models to rely on their understanding of new rules and pure reasoning capabilities to solve problems.

Overview of KOR-Bench

Overview of KOR-Bench

Through its meticulously designed rule system, KOR-Bench not only establishes a testing environment for accurately evaluating models' intrinsic reasoning abilities, but also pioneers a new paradigm for assessing artificial intelligence capabilities.

2. Deep Deconstruction and Reconstruction: A Precise Evaluation Framework Spanning Five Dimensions

KOR-Bench constructs a comprehensive evaluation system covering five core dimensions, each meticulously designed to test different aspects of reasoning abilities:

The five core evaluation dimensions of KOR-Bench

The five core evaluation dimensions of KOR-Bench

  1. Operation: By redefining mathematical symbols and rules, the model's abstract computational ability is tested. For example, a new operator "※" is designed such that when a is a multiple of b, a※b=ba+2, and there are different calculation rules otherwise.
  2. Logic: Innovative logical symbol systems and reasoning rules are introduced to examine the model's formal logic reasoning ability. This includes multiple levels such as complex propositional logic, predicate logic, and modal logic.
  3. Cipher: Entirely new encryption and decryption rules are designed to test the model's ability to apply rules and transform information. This covers a range from simple substitution to complex multi-step encryption algorithms.
  4. Puzzle: Complex problems requiring multi-step reasoning are constructed to assess the model's problem-solving and strategy planning abilities. This includes variants of Sudoku, mazes, and combinatorial optimization problems.
  5. Counterfactual: Virtual scenarios and rules are created to test the model's reasoning ability in hypothetical situations, with a particular focus on whether the model can break free from the constraints of real-world knowledge.

3. Innovative Evaluation Methods and In-Depth Performance Analysis

KOR-Bench ensures the independence of evaluation tasks from pre-trained knowledge through rigorous mathematical definitions and experimental verification. In the evaluation framework, the research team introduced the Knowledge Impact Factor (β) to quantify the degree of knowledge interference. The purity of the evaluation is ensured through rule-knowledge decoupling measurement and rule centrality verification. This innovative evaluation method not only focuses on the accuracy of task completion, but also deeply analyzes the rationality of the reasoning process, the depth of rule understanding, and the innovativeness of the solution strategy. Through multi-level performance analysis, KOR-Bench can comprehensively evaluate the model's rule learning efficiency, reasoning chain integrity, and result reliability.

The data construction process of KOR-Bench

The data construction process of KOR-Bench

In practical evaluation, the current optimal models *o1-Preview and o1-Mini achieved accuracy rates of 72.88% and 70.16%, respectively. Meanwhile, the performance of Claude-3.5-Sonnet (58.96%) and GPT-4o (58.00%) revealed the limitations of existing technologies. Particularly in high-difficulty tasks such as encryption and puzzle reasoning, even top-tier models showed significant capability bottlenecks. ( *As of the latest models at the time of the paper's release on October 9, 2024) These results not only quantify the boundaries of current AI systems' reasoning capabilities but also point the way for future improvements.

KOR-Bench provides a unified standard for reasoning capability evaluation and a reproducible evaluation process, offering a reliable basis for performance comparison among models. In terms of technological development, KOR-Bench helps researchers accurately identify model capability weaknesses, guiding algorithm optimization and effectively promoting the improvement of pure reasoning capabilities. Meanwhile, its potential applications are gradually emerging in model selection decisions, educational training evaluations, and academic research innovations.

Looking to the future, KOR-Bench will continue to evolve. By expanding the scale and diversity of the dataset, introducing parametric rule generation, and deepening the evaluation of reasoning layers, it will continuously enhance its evaluation capabilities. With the development of multimodal evaluation capabilities, KOR-Bench will play an assessment value in a wider range of fields.

As a participant in this pioneering project, 2077AI played a significant role in the construction and validation of the evaluation framework. Our technical team was deeply involved in the formulation and optimization of evaluation standards, making important contributions, especially in verifying model performance and analyzing results. By open-sourcing and sharing this innovative achievement, 2077AI looks forward to working with the entire AI community to further advance reasoning capability evaluation and contribute to building more powerful artificial intelligence systems.