A Novel Paradigm for Model Evaluation: The Innovative Multi-source Document Parsing Evaluation Framework OmniDocBench

Project Homepage: https://github.com/opendatalab/OmniDocBench
Hugging Face: https://huggingface.co/datasets/opendatalab/OmniDocBench
Paper: https://arxiv.org/abs/2412.07626

The performance of different models in handling various text types on OmniDocBench.

Document content extraction, as one of the fundamental tasks in computer vision, plays a crucial role in acquiring training data for large language models (LLMs) and in retrieval-augmented generation technologies. However, we observed that while LLMs heavily rely on data from academic papers and journals, high-quality but complexly formatted documents such as newspapers and magazines remain underutilized. To address this gap, Shanghai AI Laboratory, in collaboration with 2077AI and other institutions, has open-sourced the OmniDocBench project. By constructing a multi-source document parsing evaluation benchmark that covers nine types of documents (academic papers, textbooks, exam papers, magazines, books, notes, financial reports, newspapers, and slides), this project successfully addresses the limitations of existing evaluation systems in terms of document type diversity and assessment dimension completeness. This innovative evaluation framework not only provides a reliable standard for the development of document parsing technologies but also pioneers a new paradigm for document intelligence evaluation.

the Comparison of Related Work

Deep Deconstruction and Reconstruction: The Meticulous Design of an All-Dimensional Evaluation System

OmniDocBench pioneers a new paradigm for document parsing evaluation through a systematic data construction process.

During the data acquisition phase, the project team started with 200,000 initial PDF documents. Using ResNet-50 and Faiss clustering sampling, they extracted 6,000 pages to ensure a reasonable and diverse distribution, allowing the most "representative" data to form the evaluation dataset. These pages were annotated by professional annotators and underwent rigorous screening and balancing, resulting in a high-quality evaluation dataset of 981 pages. This dataset covers nine types of documents, ranging from academic papers to exam papers.

An Overview of the Document Types and Annotation Information in OmniDocBench

In the design of the annotation system, OmniDocBench has constructed an unprecedented multi-level annotation framework. At the layout detection level, the project has not only completed boundary box annotations for 19 types of regions but also innovatively introduced annotations for layout attributes, reading order, and hierarchical relationships. This three-dimensional annotation system enables the dataset to comprehensively evaluate model performance across different scenarios. Particularly in content recognition annotations, differential annotation strategies are applied to different types of content areas: plain text content is annotated directly with text, formula content is annotated in LaTeX format, and tables are provided with annotations in both HTML and LaTeX formats to ensure the comprehensiveness and accuracy of the evaluation.

To ensure annotation quality, OmniDocBench has implemented a rigorous three-tier quality control mechanism. Initially, advanced AI models are used for intelligent pre-annotation, including LayoutLMv3 fine-tuned for layout detection, and PaddleOCR, UniMERNet, and GPT-4o for recognizing text, formulas, and tables, respectively. Subsequently, a professional annotation team conducts a comprehensive review of the pre-annotation results, refining each detection box and supplementing annotations for reading order and hierarchical relationships. In the final expert quality inspection phase, the project innovatively employs CDM rendering technology to identify non-renderable elements, and three domain experts conduct the final review to ensure the utmost reliability of the dataset.

Innovative Evaluation Methods and In-depth Performance Analysis

Innovativeness and systematicness are the prominent strengths of OmniDocBench's evaluation framework.

In the extraction module, the project has designed a complete processing workflow. Starting from the preprocessing stage, it focuses on the standardization of details, including basic tasks such as removing images and normalizing markdown tags. In the extraction of special components, a carefully designed extraction sequence is adopted to ensure that different types of content can be accurately identified and extracted. Particularly in the processing of inline formulas, an innovative solution of converting to a unified Unicode format is used to address the issue of inconsistent output formats from different models.

Performance Evaluation of Various Document Parsing and Recognition Methods Based on OmniDocBench

In practical evaluations, OmniDocBench has demonstrated excellent model differentiation capabilities. In specific tasks, different models have shown their respective strengths: DocLayout-YOLO excels in layout detection across diverse documents, RapidTable stands out in the language adaptability of table recognition, and PaddleOCR maintains a leading position in traditional OCR tasks. Particularly in the highly challenging task of formula recognition, the outstanding performance of GPT-4o, Mathpix, and UniMERNet in the CDM metric showcases breakthrough progress in this particular field.

As an important member of the open-source community, 2077AI has been deeply involved in the development process of the OmniDocBench project, making significant contributions in multiple aspects such as dataset construction, evaluation framework design, and result validation. Looking ahead, OmniDocBench will continue to expand its evaluation dimensions and application scenarios. By introducing innovative methods such as parametric rule generation and deepening reasoning-level assessment, the completeness of the evaluation system will be further enhanced. Meanwhile, with the development of multimodal evaluation capabilities, OmniDocBench is expected to play a more extensive role in a wider range of fields. 2077AI will continue to work hand in hand with the open-source community to advance document intelligence technology and contribute to building more powerful artificial intelligence systems.

Blog

Blog

Dataset

About

Resources

Paper

A Novel Paradigm for Model Evaluation: The Innovative Multi-source Document Parsing Evaluation Framework OmniDocBench

Deep Deconstruction and Reconstruction: The Meticulous Design of an All-Dimensional Evaluation System

Innovative Evaluation Methods and In-depth Performance Analysis