[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"dataset":3},{"title":4,"desc":5,"bannerImg":6,"date":7,"orgImgLinks":8,"bannerLinks":9,"category":10,"weight":11,"description":5,"content":12,"metaBannerImg":253},"VeriWeb: Evaluating Long-Chain Web Agents with Subtask Verification","Discover VeriWeb, a pioneering benchmark for long-horizon web agents. It offers a reproducible environment and 302 real-world tasks with subtask-level verification, advancing research in complex information-seeking.","\u002Fdatasets-banner-images\u002Fveriweb-banner.jpg","2025-09-03","[{\"logourl\":\"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002Ficons\u002F2077ai.png\",\"orgname\":\"2077AI\",\"url\":\"https:\u002F\u002Fwww.2077ai.com\u002F\"},{\"logourl\":\"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002Ficons\u002Fntu.png\",\"orgname\":\"Nanyang Technological University\",\"url\":\"https:\u002F\u002Fwww.ntu.edu.sg\u002F\"},{\"logourl\":\"\",\"orgname\":\"\",\"url\":\"\"}]","{\"Blog\":\"https:\u002F\u002Fwww.2077ai.com\u002Fblog\u002Fverigui-benchmark-ai-agents\",\"HuggingFace\":\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2508.04026\"}","Agent",5,{"data":13,"body":16,"toc":250},{"title":14,"description":15},"Introduction","VeriWeb is a pioneering benchmark designed to evaluate and advance the capabilities of web agents in long-horizon reasoning and information-seeking tasks. Unlike existing benchmarks that often rely on intermediate HTML states or are limited to specific domains, VeriWeb is constructed using a fully-functional, open-source search engine, searxng, which aggregates results from multiple real-world search engines. This approach provides a realistic, diverse, and reproducible web environment, mitigating issues like link decay and content changes that plague traditional web benchmarks.",{"type":17,"children":18},"root",[19,34,44,52,61,70,118,127,134,144,153,160,167,174,180,186,196,204,214,220,230,243],{"type":20,"tag":21,"props":22,"children":26},"element","h1",{"className":23,"id":25},[24],"heading__h1","introduction",[27],{"type":20,"tag":28,"props":29,"children":31},"span",{"style":30},"white-space: pre-wrap;",[32],{"type":33,"value":14},"text",{"type":20,"tag":35,"props":36,"children":39},"p",{"className":37},[38],"doxhub-editor-paragraph",[40],{"type":20,"tag":28,"props":41,"children":42},{"style":30},[43],{"type":33,"value":15},{"type":20,"tag":35,"props":45,"children":47},{"className":46},[38],[48],{"type":20,"tag":49,"props":50,"children":51},"br",{},[],{"type":20,"tag":35,"props":53,"children":55},{"className":54},[38],[56,58],{"type":33,"value":57},"::DoxhubImage{src=\"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251125\u002Fveriweb_overview.png\" caption=\"VeriWeb Overview\"} \n::",{"type":20,"tag":49,"props":59,"children":60},{},[],{"type":20,"tag":35,"props":62,"children":64},{"className":63},[38],[65],{"type":20,"tag":28,"props":66,"children":67},{"style":30},[68],{"type":33,"value":69},"VeriWeb distinguishes itself with the following key features:",{"type":20,"tag":71,"props":72,"children":75},"ul",{"className":73},[74],"doxhub-editor-ul",[76,88,98,108],{"type":20,"tag":77,"props":78,"children":82},"li",{"value":79,"className":80},"1",[81],"doxhub-editor-list-item",[83],{"type":20,"tag":28,"props":84,"children":85},{"style":30},[86],{"type":33,"value":87},"Complex, Long-Chain Tasks: Tasks are decomposed into multiple subtasks, requiring agents to execute hundreds of steps, perform intricate reasoning, and synthesize information from various sources",{"type":20,"tag":77,"props":89,"children":92},{"value":90,"className":91},"2",[81],[93],{"type":20,"tag":28,"props":94,"children":95},{"style":30},[96],{"type":33,"value":97},"Subtask-Level Verification: Each task is structured as a sequence of subtasks with corresponding verified answers (subtask-answer items). This granular structure enables intermediate evaluation, providing insights into agent performance at each stage and preventing \"reward sparsity\" for long tasks.",{"type":20,"tag":77,"props":99,"children":102},{"value":100,"className":101},"3",[81],[103],{"type":20,"tag":28,"props":104,"children":105},{"style":30},[106],{"type":33,"value":107},"Reproducible & Controllable Environment:** By using searxng and caching search results, VeriWeb offers a stable and reproducible testbed, free from the dynamic nature of the live web.",{"type":20,"tag":77,"props":109,"children":112},{"value":110,"className":111},"4",[81],[113],{"type":20,"tag":28,"props":114,"children":115},{"style":30},[116],{"type":33,"value":117},"Diverse Domain Coverage: The dataset spans a wide range of domains, including Science, Finance, Technology, Arts, and Social, ensuring a comprehensive evaluation of agent capabilities.",{"type":20,"tag":35,"props":119,"children":121},{"className":120},[38],[122],{"type":20,"tag":28,"props":123,"children":124},{"style":30},[125],{"type":33,"value":126},"VeriWeb provides a challenging and realistic platform to push the boundaries of web agents, fostering the development of systems that can effectively navigate, reason, and gather information on the web over extended periods.",{"type":20,"tag":35,"props":128,"children":130},{"className":129},[38],[131],{"type":20,"tag":49,"props":132,"children":133},{},[],{"type":20,"tag":21,"props":135,"children":138},{"className":136,"id":137},[24],"dataset-overview",[139],{"type":20,"tag":28,"props":140,"children":141},{"style":30},[142],{"type":33,"value":143},"Dataset Overview",{"type":20,"tag":35,"props":145,"children":147},{"className":146},[38],[148],{"type":20,"tag":28,"props":149,"children":150},{"style":30},[151],{"type":33,"value":152},"VeriWeb is composed of 302 high-quality tasks collected from real-world scenarios. These tasks are distributed across five major domains, ensuring broad coverage of web-based activities. Each task is further decomposed into multiple subtasks, with human demonstrations collected for each. The dataset is characterized by its long-horizon nature, with tasks requiring an average of 272.5 steps and 4.3 subtasks to complete. This demonstrates the dataset's focus on complex, multi-step reasoning.",{"type":20,"tag":35,"props":154,"children":156},{"className":155},[38],[157],{"type":20,"tag":49,"props":158,"children":159},{},[],{"type":20,"tag":35,"props":161,"children":163},{"className":162},[38],[164],{"type":20,"tag":49,"props":165,"children":166},{},[],{"type":20,"tag":35,"props":168,"children":170},{"className":169},[38],[171],{"type":20,"tag":49,"props":172,"children":173},{},[],{"type":20,"tag":35,"props":175,"children":177},{"className":176},[38],[178],{"type":33,"value":179},"::DoxhubDonutChart{title=\"Distribution of tasks across different domains\" data=\"Arts,89,Social,65,Finance,57,Technology,54,Scientific,37\"} \n::",{"type":20,"tag":35,"props":181,"children":183},{"className":182},[38],[184],{"type":33,"value":185},"::DoxhubDonutChart{title=\"Distribution of GUI actions\" data=\"scroll,19991,left_click,19780,drag,17732,key_down,15793,input,7966,result_state,778,right_click,245\"} \n::",{"type":20,"tag":21,"props":187,"children":190},{"className":188,"id":189},[24],"data-samples",[191],{"type":20,"tag":28,"props":192,"children":193},{"style":30},[194],{"type":33,"value":195},"Data Samples",{"type":20,"tag":197,"props":198,"children":203},"iframe",{"frameBorder":199,"allowFullScreen":200,"loading":201,"src":202},"0",true,"lazy","https:\u002F\u002Fdataset.data4o.xyz\u002Fshare\u002Fcgat\u002Fpreview?datasetId=68c7ddc24bb3791abac314ed&env=en",[],{"type":20,"tag":21,"props":205,"children":208},{"className":206,"id":207},[24],"leaderboard",[209],{"type":20,"tag":28,"props":210,"children":211},{"style":30},[212],{"type":33,"value":213},"Leaderboard",{"type":20,"tag":35,"props":215,"children":217},{"className":216},[38],[218],{"type":33,"value":219},"::DoxhubMultiCategoryGroupedScatterPlot{defaultSelection=\"\"  data=\"\"} \n::",{"type":20,"tag":21,"props":221,"children":224},{"className":222,"id":223},[24],"bibtex",[225],{"type":20,"tag":28,"props":226,"children":227},{"style":30},[228],{"type":33,"value":229},"BibTeX",{"type":20,"tag":231,"props":232,"children":237},"pre",{"className":233,"code":235,"language":223,"meta":236},[234],"language-bibtex","@article{verigui2025,\n  title   =   {VeriGUI: Verifiable Long-Chain GUI Dataset},\n  author  =   {Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao},\n  journal =   {arXiv preprint arXiv:2508.04026},\n  year    =   {2025}\n}\n","",[238],{"type":20,"tag":239,"props":240,"children":241},"code",{"__ignoreMap":236},[242],{"type":33,"value":235},{"type":20,"tag":35,"props":244,"children":246},{"className":245},[38],[247],{"type":20,"tag":49,"props":248,"children":249},{},[],{"title":236,"searchDepth":251,"depth":251,"links":252},2,[],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002FBanner_dataset\u002Fdatasets_verigui.png"]