VeriGUI: Verifiable Long-Chain GUI Dataset image

VeriGUI: Verifiable Long-Chain GUI Dataset

Introduction

Dataset VeriGUI
Modalities Text, Video
Formats json
Languages English
Size 41.6 kB
Release Date 2025-08-06
Domain GUI Agent, Benchmark
License apache-2.0

VeriGUI is a large-scale benchmark designed to evaluate and advance the capabilities of GUI agents in the challenging domain of long-horizon, verifiable task automation. It features tasks across both web and desktop environments and introduces subtask-level verification to move beyond traditional outcome-only validation.

  • The dataset is the most complex and fine-grained of its kind to date, containing 130 tasks that comprise 27,873 GUI steps (averaging 214 steps per task) and 587 verifiable subtasks. Its detailed structure provides a multi-stage framework for evaluating an agent's performance throughout the entire task lifecycle.
  • Created as a robust and difficult testbed, VeriGUI reveals the severe limitations of current state-of-the-art models. Evaluations show that even leading agents like GPT-4o fail to exceed a 10% success rate, with over 80% of tasks resulting in a 0% success rate. This highlights an urgent need for improvement in the long-horizon reasoning and planning capabilities of GUI agents.

Data Sample