VeriGUI: Verifiable Long-Chain GUI Dataset
Introduction
Dataset | VeriGUI |
Modalities | Text, Video |
Formats | json |
Languages | English |
Size | 41.6 kB |
Release Date | 2025-08-06 |
Domain | GUI Agent, Benchmark |
License | apache-2.0 |
VeriGUI is a large-scale benchmark designed to evaluate and advance the capabilities of GUI agents in the challenging domain of long-horizon, verifiable task automation. It features tasks across both web and desktop environments and introduces subtask-level verification to move beyond traditional outcome-only validation.
- The dataset is the most complex and fine-grained of its kind to date, containing 130 tasks that comprise 27,873 GUI steps (averaging 214 steps per task) and 587 verifiable subtasks. Its detailed structure provides a multi-stage framework for evaluating an agent's performance throughout the entire task lifecycle.
- Created as a robust and difficult testbed, VeriGUI reveals the severe limitations of current state-of-the-art models. Evaluations show that even leading agents like GPT-4o fail to exceed a 10% success rate, with over 80% of tasks resulting in a 0% success rate. This highlights an urgent need for improvement in the long-horizon reasoning and planning capabilities of GUI agents.