Headline
  • VeriGUI's Paradigm Shift: From "Process Mimicry" to "Goal-Oriented"
  • The Core Design Philosophy of VeriGUI

A Novel Paradigm for Model Evaluation: The Innovative Multi-source Document Parsing Evaluation Framework OmniDocBench

The world of artificial intelligence is undergoing a profound revolution. In the past, we were accustomed to using static benchmarks like MMLU or GPQA as a measure of an AI's "IQ." For large language models, these static benchmarks were like closed-book exams, effectively assessing how much knowledge a model "knew." However, as AI has moved beyond just "knowing" and has begun to take the form of "agents"—autonomously perceiving, reasoning, planning, and interacting with the real world—a more fundamental question has emerged: what is the actual value an agent can create?

This shift from "knowledge reserves" to "action capabilities" has given rise to a new generation of dynamic, interactive benchmarks. Instead of static test questions, they are noisy, real-world simulators designed to further evaluate an agent's true execution capabilities in complex, open, and unpredictable environments.

For an agent to be truly "capable," it must overcome several major challenges:

  • Long-Horizon Planning: Real-world tasks are rarely "one-and-done." An agent needs to handle complex tasks spanning hundreds or thousands of steps, which requires strategic abilities like long-term memory, dynamic plan adjustments, and recovery from errors.
  • Robust Tool Use: An agent's power lies in its ability to call upon various tools (browsers, code interpreters, APIs). The challenge isn't just "knowing how to use them," but also selecting the right tool at the right time, handling failures, and coordinating multiple tools within a single workflow.
  • Operation in Open Worlds: Agents must leave the "sandbox" and operate in the constantly changing internet or on a user's desktop full of "surprises." This demands a high degree of generalization and adaptability.

To address these challenges, a number of groundbreaking benchmarks have emerged from both academia and industry, including influential works like BrowseComp and GAIA. Against this backdrop, VeriGUI—a new open-source project led by the 2077AI Open Source Foundation—was created. It aims to provide a novel solution that is more complex, verifiable, generalizable, and closer to real-world value.

Overview of VeriGUI Dataset

Overview of VeriGUI Dataset

VeriGUI's Paradigm Shift: From "Process Mimicry" to "Goal-Oriented"

VeriGUI is not just another incremental improvement; it is a fundamental re-imagining of how we evaluate agents. It is built on two core pillars: Long-Chain Complexity and Subtask-Level Verifiability.

Many current benchmarks are essentially tests of an agent's information retrieval capabilities. The agent must find and verify a pre-existing, static fact within the vast, unstructured information of the internet through continuous and complex searching. The essence of these tasks is not to create or change a state, but to use a series of operations to know a piece of information.

For example, a typical BrowserComp problem is: "Find the title of a scientific paper published at the EMNLP conference between 2018 and 2023, whose first author graduated from Dartmouth College and whose fourth author graduated from the University of Pennsylvania." This problem can be formally deconstructed as:

Proposition P: The paper was published at EMNLP between 2018 and 2023.
Proposition Q: The paper's first author has a bachelor's degree from Dartmouth College.
Proposition R: The paper's fourth author has a bachelor's degree from the University of Pennsylvania.

The agent's goal is to find the unique paper that satisfies P∩Q∩R. This is purely an information retrieval and filtering task, involving no programmatic modification of the environment's state. It's like an informational treasure hunt where the goal is to know a static fact; while the path may be complex, the destination is unique and predetermined.

However, a hallmark of human intelligence is its flexibility. Faced with the same goal, different people will take different paths but still achieve the objective. Traditional evaluation methods limit this kind of creativity.

VeriGUI vs. Existing GUI Benchmarks

VeriGUI vs. Existing GUI Benchmarks

VeriGUI is the first benchmark to systematically integrate this "many paths, one destination" philosophy into its evaluation. It no longer forces the agent to follow a single "correct path," but instead defines a series of clear, verifiable intermediate goals (subtasks).

The VeriGUI paradigm has the following key characteristics:

  • Long-chain tasks with step-by-step verification
  • Measurable process and attributable failures

As long as the agent achieves these goals, regardless of the strategy it uses, it is considered successful.

The Core Design Philosophy of VeriGUI

VeriGUI Dataset Design Overview

VeriGUI Dataset Design Overview

1. Subtask-Level Verifiability: Making Reward Signals More Effective

VeriGUI's design philosophy is "divergent processes, convergent outcomes." To achieve this, it introduces its most revolutionary feature: subtask-level verifiability.

A complete task: $τ$

is decomposed into a sequence of K subtasks: $\tau = \tau^{(1)} \circ \tau^{(2)} \circ \cdots \circ \tau^{(K)}$

For each subtask:$\tau^{(k)}$

we define a programmatically verifiable objective function: $G(k)$

We only check if the Agent has achieved a meaningful intermediate result, not the specific operations. This is profoundly significant for Reinforcement Learning (RL) because it enables:

  • Structured Exploration: Decomposing tasks into clear intermediate sub-goals guides the agent to explore large state spaces systematically, avoiding blind trial-and-error.
  • Stable and Verifiable Feedback: By using programmatically verifiable intermediate states as the basis for rewards, it provides the RL agent with a more stable, low-noise learning signal, accelerating convergence.
  • Enhanced Robustness and Adaptability: By learning skills for various sub-tasks independently and then combining them, the agent demonstrates greater adaptability and robustness against task variations and environmental disturbances.

2. Unprecedented Long-Chain Complexity

One of VeriGUI's core design principles is its unprecedented task complexity. This complexity is reflected not just in the length of the tasks, but also in their inherent logical dependencies.

VeriGUI's tasks are meticulously designed to consist of 2 to 15 interdependent subtasks. On average, human experts require 214.4 GUI operations to complete a single task—a number that far exceeds existing mainstream benchmarks, which typically only require a dozen or so steps. This means that completing a single VeriGUI subtask can be more complex than completing an entire task in other benchmarks.

3. Sticking to a Real Environment

VeriGUI insists on testing in a completely real, non-simplified environment (a genuine operating system and browser). This ensures that research findings are directly applicable to the real world. Agents must handle dynamically loaded web pages, the varying UI styles of different applications, and even desktop applications that lack structured information.

VeriGUI Task Domain Distribution

VeriGUI Task Domain Distribution

4. Objective and Strict Verifiability

To guarantee objective and reproducible evaluation, another cornerstone of VeriGUI is its extremely strict definition of "correctness."

Each subtask's "verifiable objective" is defined by human experts based on objective facts. For instance, in a financial analysis task, the goal might be to extract a precise numerical value from a financial report. In a desktop task, the objective might be to successfully move and rename a file to a specific location.

The verification standard is outcome-oriented and objective. It checks:

  • Did the agent extract verifiably correct information?
  • Did the agent achieve a verifiably correct system state?

This is fundamentally different from benchmarks that simply check if an agent "visited a certain URL" or "clicked a certain button." This strict mechanism completely eliminates the ambiguity of "mostly correct" or "almost there," providing a gold standard for agent evaluation.

5. Flexible Trajectory Execution Framework

VeriGUI provides a powerful, built-in mechanism to control task difficulty and enable targeted training through its flexible trajectory execution framework.

Any subtask node in the dataset can serve as a starting point for training or evaluation. VeriGUI allows execution to begin from any subtask node. Researchers can provide the agent with the ground truth results for the first k−1 subtasks and then require it to complete the remaining tasks starting from the k-th subtask. This Curriculum Learning design offers clear advantages:

  • Smoothly Adjusting Difficulty: Start with a single subtask and gradually transition to the full long-chain task, implementing a curriculum learning approach.
  • Performing Fine-Grained Attribution: Instead of just knowing an agent "failed," you can pinpoint that it failed on "the 3rd of 5 subtasks."
  • Focusing on Specific Capabilities: You can skip initial data collection steps and directly train the agent on data synthesis and comprehension.

Case Study: Agent Performance in Real-World Tasks

The effectiveness of any theoretical advancement must ultimately be proven through practical application. VeriGUI's tasks are challenging and compelling not just because they involve long processes and many operations, but because their subtasks are broken down according to a rigorous internal logic—what we call the Principle of Minimal Sufficiency.

This principle ensures the logical rigor of the task decomposition and the effectiveness of the evaluation. It includes three core conditions:

  • Necessity: Every subtask is an indispensable step toward the final conclusion. Missing any link would break the entire task chain.
  • Sufficiency: The combination of all subtask results is entirely sufficient to answer the final question completely, no more and no less.
  • Atomicity: The subtask division has reached the smallest possible independent logical unit. Any further subdivision would compromise its integrity as a complete thought step.

This principle guarantees that every VeriGUI task is not a random collection of steps but a logically tight, interconnected chain of reasoning. With this design framework in mind, let's look at two specific case studies to understand the challenges VeriGUI presents in real-world scenarios.

Case Analysis

Case Analysis

Case 1: Streaming Service Provider Research

Task Instruction: Find the streaming service with the most new subscribers between Q1 2022 and Q4 2023. Identify the original series with the highest per-episode production budget and list seven associated pieces of information.

Subtask Breakdown:

  1. Gather subscriber growth data for all streaming providers to identify the leader.
    • Conclusion: Providers: Netflix, Paramount+, HBO Max, Disney+. Growth: 38.64 million, 27.90 million, 20.80 million, 20.40 million.
  2. Find all original series released by Netflix during this period, collect their per-episode production budgets, and identify the series with the highest budget.
    • Conclusion: Series: Stranger Things Season 4. Cost: $30 million per episode.
  3. Find the names of the showrunners or creators (e.g., directors or head writers) for Stranger Things Season 4.
    • Conclusion: The Duffer Brothers (Matt Duffer and Ross Duffer).
  4. Identify the main filming location for Stranger Things Season 4 and the country where it's located.
    • Conclusion: Georgia, USA.
  5. Count how many episodes each of the VFX companies contributed to Stranger Things Season 4 and find the company that contributed to the most episodes.
    • Conclusion: Rodeo FX, Important Looking Pirates (ILP), Digital Domain, DNEG, Lola VFX, Crafty Apes, and Scanline VFX.

Agent Performance (OpenAI-o3 DeepResearch): 70% completion rate. The agent successfully identified Netflix and Stranger Things but made two typical errors:

  • Misinformation: It reported the new subscriber count as "around 39 million" instead of the precise 38.64 million, incorrectly citing a rounded figure from a news report.
  • Incomplete Result: When listing the VFX companies, it mentioned only one, missing the other six.

Case 2: Movie Data Comparison and Analysis

Task Instruction: Among movies that grossed over $1 billion from 2020 to 2024, find the one with the highest return on production cost and list relevant information.

Subtask Breakdown:

  1. Collect all movies with global box office revenue exceeding $1 billion between 2020 and 2024, and record their box office data.
    • Conclusion: List of movies: Avatar: The Way of Water: $2,320,250,281; Inside Out 2: $1,698,863,816; Spider-Man: No Way Home: $1,922,598,800; Top Gun: Maverick: $1,495,696,292; Barbie: $1,447,038,421; The Super Mario Bros. Movie: $1,360,847,665; Deadpool & Wolverine: $1,338,073,645; Moana 2: $1,059,242,164.
  2. Search for the production cost of each movie and calculate the box office-to-production cost ratio to find the movie with the highest return.
    • Conclusion: The movie with the highest return on production cost is The Super Mario Bros. Movie, with a return rate of 13.61.
  3. Find the director's names, specific production cost, and global box office for The Super Mario Bros. Movie.
    • Conclusion: Directors: Aaron Horvath and Michael Jelenic. Production cost: $100,000,000. Global box office: $1,360,847,665.
  4. Search for the main filming location of The Super Mario Bros. Movie.
    • Conclusion: Main filming location: Paris, France.
  5. Find all film awards won by The Super Mario Bros. Movie, identify the highest-level award, and find the city where the ceremony was held.
    • Conclusion: The highest award won by The Super Mario Bros. Movie is "Festival Film Bandung - Film Impor Terpuji / Commendable Imported Film," and the ceremony was held in Bandung, Indonesia.

Agent Performance (Browser-Use with GPT-4o): Task failed. The agent correctly found all movies with over $1 billion in box office revenue, but it made an Analysis Error in the subsequent steps. It incorrectly assumed that the highest-grossing film, Avatar: The Way of Water, also had the highest return rate, failing to account for its high production cost. This led to an incorrect final conclusion. The correct answer should have been The Super Mario Bros. Movie.

Experimental Results: The Capability Boundary of Today's Top Agents

We conducted a comprehensive test on a range of top-performing AI agents, and the results confirm the high level of challenge posed by VeriGUI.

Agent Performances on VeriGUI

Agent Performances on VeriGUI

As can be seen from the table, even the most advanced models have generally low success rates (SR). Through in-depth analysis of failed cases, we identified several typical failure modes:

  • Incorrect information and retrieval failures: The most common type of failure, where agents extract wrong data or fail to find information.
  • Shallow search behavior: Many agents tend to "scratch the surface," giving up prematurely even without obtaining complete answers after only a few search attempts.
  • Inadequate planning capabilities: In multi-step dependent tasks, agents often "lose their way" and fail to follow the correct logical sequence.
  • Irrelevant results: The returned information does not match the precise goals of subtasks.

Notably, the "task completion rate (CR)" for most tasks is greater than zero, meaning these tasks are sufficiently challenging for current models but not entirely impossible to make progress on.

Conclusion: A New Starting Point for Inspiring Agent Innovation

VeriGUI's low scores are not a discouraging endpoint but a starting point for inspiring innovation. It clearly exposes the bottlenecks of current agent technology and provides clear optimization directions and measurable progress ladders for future research. Through the design of long-chain complexity, it forces agents to develop long-term planning capabilities; through subtask-level verifiability, it offers unprecedented dense supervision signals, transforming the paradigms of evaluation and reinforcement learning.

We believe that VeriGUI paves the way for more powerful and general-purpose GUI agents, driving the entire field to truly shift from focusing on "what models know" to "what models can do." We have fully open-sourced VeriGUI to the community and sincerely invite developers, researchers, and technology enthusiasts worldwide to join us in exploring, tackling challenges, and innovating based on it, collectively defining and building the future of artificial general intelligence.

>
>