[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog":3},{"title":4,"desc":5,"bannerImg":6,"date":7,"links":8,"description":5,"content":9,"tag1":417,"tag2":418,"resLinks":423},"IWR-Bench: Can AI Rebuild an Interactive Website Just by Watching a Video?","Today‘s AI can turn screenshots into code, but what about dynamic, interactive websites? Introducing IWR-Bench, a new benchmark that tests if AI can reconstruct functional websites from a video of user interactions. Discover the surprising results.","https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002FBanner_blog\u002Fbanner_iwrbench.png","2025-11-05","{\"github\":\"https:\u002F\u002Fgithub.com\u002FL-O-I\u002FIWR-Bench\",\"huggingface\":\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FIWR-Bench\u002FIWR-Bench\", \"arxiv\":\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.26346\",\"homepage\":\"https:\u002F\u002Fl-o-i.github.io\u002FIWR-Bench\u002F\"}",{"data":10,"body":12,"toc":409},{"title":4,"description":11},"We’ve seen incredible progress in AI’s ability to understand the visual world. Large Vision-Language Models (LVLMs) can look at a static screenshot of a webpage and generate the corresponding HTML and CSS with surprising accuracy. It’s a bit like a graphic designer who can perfectly replicate a layout they’ve seen.",{"type":13,"children":14},"root",[15,23,28,41,54,81,88,93,110,115,140,146,151,168,173,235,246,251,257,269,281,304,309,315,325,341,359,376,381,393,399,404],{"type":16,"tag":17,"props":18,"children":20},"element","h1",{"id":19},"iwr-bench-can-ai-rebuild-an-interactive-website-just-by-watching-a-video",[21],{"type":22,"value":4},"text",{"type":16,"tag":24,"props":25,"children":26},"p",{},[27],{"type":22,"value":11},{"type":16,"tag":24,"props":29,"children":30},{},[31,33,39],{"type":22,"value":32},"But a modern website is defined by more than its static appearance. Its core value lies in its dynamic components and interactive functionalities—like menus and complex user flows—which a simple screenshot cannot represent.This raises a much bigger, more exciting question: Can an AI move beyond static replication and act like a true front-end developer? Can it watch a video of someone ",{"type":16,"tag":34,"props":35,"children":36},"em",{},[37],{"type":22,"value":38},"using",{"type":22,"value":40}," a website and rebuild the entire interactive experience from scratch?",{"type":16,"tag":24,"props":42,"children":43},{},[44,46,52],{"type":22,"value":45},"To answer this, we’re introducing ",{"type":16,"tag":47,"props":48,"children":49},"strong",{},[50],{"type":22,"value":51},"IWR-Bench",{"type":22,"value":53},", the first benchmark designed to test this next frontier of AI capability. The process challenges an AI to act like a developer, taking visual inputs and producing functional code that is then rigorously tested.",{"type":16,"tag":55,"props":56,"children":62},"div",{"className":57,"style":61},[58,59,60],"img-wrap","has-caption","center","width: 100%; position: relative; margin-bottom: 62px",[63,65,72,73],{"type":22,"value":64},"\n  ",{"type":16,"tag":66,"props":67,"children":71},"img",{"src":68,"alt":69,"style":70},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251105\u002FIWR01.webp","The IWR-Bench workflow","width: 100%; max-height: 60vh; object-fit: contain; background: #141414; border-radius: 8px",[],{"type":22,"value":64},{"type":16,"tag":24,"props":74,"children":78},{"className":75,"style":77},[76],"img-text","position: absolute; top: calc(100% + 16px); left: 0; right: 0;text-align: center; overflow: hidden; white-space: nowrap; text-overflow: ellipsis; line-height: 22px; color: #A1A1A1; font-size: 14px",[79],{"type":22,"value":80},"\n    The IWR-Bench workflow\n  ",{"type":16,"tag":82,"props":83,"children":85},"h2",{"id":84},"the-leap-to-true-functionality",[86],{"type":22,"value":87},"The Leap to True Functionality",{"type":16,"tag":24,"props":89,"children":90},{},[91],{"type":22,"value":92},"Going from a static screenshot to a dynamic video is like going from understanding a photograph of a car to understanding how its engine works. Unlike previous benchmarks that focused on static images or lacked the necessary assets, IWR-Bench creates a truly realistic test environment that presents two massive challenges for today’s AI.",{"type":16,"tag":55,"props":94,"children":96},{"className":95,"style":61},[58,59,60],[97,98,103,104],{"type":22,"value":64},{"type":16,"tag":66,"props":99,"children":102},{"src":100,"alt":101,"style":70},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251105\u002FIWR02.webp","Comparison with existing benchmarks",[],{"type":22,"value":64},{"type":16,"tag":24,"props":105,"children":107},{"className":106,"style":77},[76],[108],{"type":22,"value":109},"\n    Comparison with existing benchmarks\n  ",{"type":16,"tag":24,"props":111,"children":112},{},[113],{"type":22,"value":114},"IWR-Bench is the first benchmark of its kind, bridging the gap between static webpage reconstruction and general video understanding benchmarks.",{"type":16,"tag":116,"props":117,"children":118},"ol",{},[119,130],{"type":16,"tag":120,"props":121,"children":122},"li",{},[123,128],{"type":16,"tag":47,"props":124,"children":125},{},[126],{"type":22,"value":127},"Seeing Isn't Enough—It Has to Understand Logic:",{"type":22,"value":129}," The AI can’t just perceive pixels; it must perform multi-modal reasoning. It needs to watch the video, see a mouse click on a button, observe the UI change that follows, and connect that cause-and-effect to the specific button in the provided static assets. It’s a complex detective game of inferring the hidden logic behind the visuals.",{"type":16,"tag":120,"props":131,"children":132},{},[133,138],{"type":16,"tag":47,"props":134,"children":135},{},[136],{"type":22,"value":137},"From Logic to Living Code:",{"type":22,"value":139}," After inferring the logic, the model must translate it into functional code (HTML, CSS, and event-driven JavaScript). This is trivial for a simple \"About Us\" page but becomes incredibly difficult for complex applications like e-commerce filtering, multi-step booking forms, or even web-based games like 2048, which are included in our benchmark.",{"type":16,"tag":82,"props":141,"children":143},{"id":142},"building-a-fair-and-challenging-test",[144],{"type":22,"value":145},"Building a Fair and Challenging Test",{"type":16,"tag":24,"props":147,"children":148},{},[149],{"type":22,"value":150},"To truly test these capabilities, IWR-Bench was built from the ground up through a meticulous, multi-stage curation process to be realistic and demanding.",{"type":16,"tag":55,"props":152,"children":154},{"className":153,"style":61},[58,59,60],[155,156,161,162],{"type":22,"value":64},{"type":16,"tag":66,"props":157,"children":160},{"src":158,"alt":159,"style":70},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251105\u002FIWR03.webp","Benchmark construction overview",[],{"type":22,"value":64},{"type":16,"tag":24,"props":163,"children":165},{"className":164,"style":77},[76],[166],{"type":22,"value":167},"\n    Benchmark construction overview\n  ",{"type":16,"tag":24,"props":169,"children":170},{},[171],{"type":22,"value":172},"Our four-stage construction pipeline ensures that every task in IWR-Bench is realistic, verifiable, and of high quality.",{"type":16,"tag":174,"props":175,"children":176},"ul",{},[177,187,204],{"type":16,"tag":120,"props":178,"children":179},{},[180,185],{"type":16,"tag":47,"props":181,"children":182},{},[183],{"type":22,"value":184},"Real-World Scenarios:",{"type":22,"value":186}," We curated 113 tasks from 100 real-world websites, from e-commerce and booking sites to productivity tools and games.",{"type":16,"tag":120,"props":188,"children":189},{},[190,195,197,202],{"type":16,"tag":47,"props":191,"children":192},{},[193],{"type":22,"value":194},"Complete Information:",{"type":22,"value":196}," Unlike other benchmarks, we provide everything a real developer would have: the user interaction video ",{"type":16,"tag":34,"props":198,"children":199},{},[200],{"type":22,"value":201},"and",{"type":22,"value":203}," all the static assets (images, icons, etc.) crawled from the site.",{"type":16,"tag":120,"props":205,"children":206},{},[207,212,214,219,221,226,228,233],{"type":16,"tag":47,"props":208,"children":209},{},[210],{"type":22,"value":211},"Multi-Dimensional Difficulty:",{"type":22,"value":213}," We taxonomized tasks across three axes: ",{"type":16,"tag":47,"props":215,"children":216},{},[217],{"type":22,"value":218},"visual complexity",{"type":22,"value":220}," (is it a simple blog or a data-dense dashboard?), ",{"type":16,"tag":47,"props":222,"children":223},{},[224],{"type":22,"value":225},"interaction complexity",{"type":22,"value":227}," (is it just scrolling, or complex game logic?), and ",{"type":16,"tag":47,"props":229,"children":230},{},[231],{"type":22,"value":232},"application domain",{"type":22,"value":234},".",{"type":16,"tag":55,"props":236,"children":239},{"className":237,"style":238},[58,60],"width: 100%; position: relative",[240,241],{"type":22,"value":64},{"type":16,"tag":66,"props":242,"children":245},{"src":243,"alt":244,"style":70},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251105\u002FIWR04.webp","",[],{"type":16,"tag":24,"props":247,"children":248},{},[249],{"type":22,"value":250},"Tasks in IWR-Bench are meticulously organized by domain, visual complexity, and interaction logic to ensure a comprehensive evaluation of model capabilities.",{"type":16,"tag":82,"props":252,"children":254},{"id":253},"the-ultimate-referee-agent-as-a-judge",[255],{"type":22,"value":256},"The Ultimate Referee: \"Agent-as-a-Judge\"",{"type":16,"tag":24,"props":258,"children":259},{},[260,262,267],{"type":22,"value":261},"How do you grade something this complex? You can't just compare pixels. You have to actually ",{"type":16,"tag":34,"props":263,"children":264},{},[265],{"type":22,"value":266},"use",{"type":22,"value":268}," the website.",{"type":16,"tag":24,"props":270,"children":271},{},[272,274,279],{"type":22,"value":273},"So, we built an automated evaluation framework we call the ",{"type":16,"tag":47,"props":275,"children":276},{},[277],{"type":22,"value":278},"\"agent-as-a-judge.\"",{"type":22,"value":280}," This AI agent acts like a QA tester. It programmatically visits the webpage generated by the model and attempts to perform the exact same sequence of actions shown in the original video. It then calculates two key scores:",{"type":16,"tag":174,"props":282,"children":283},{},[284,294],{"type":16,"tag":120,"props":285,"children":286},{},[287,292],{"type":16,"tag":47,"props":288,"children":289},{},[290],{"type":22,"value":291},"Interactive Functionality Score (IFS):",{"type":22,"value":293}," A measure of what works. Did the button clicks, text inputs, and other actions execute correctly?",{"type":16,"tag":120,"props":295,"children":296},{},[297,302],{"type":16,"tag":47,"props":298,"children":299},{},[300],{"type":22,"value":301},"Visual Fidelity Score (VFS):",{"type":22,"value":303}," A measure of what it looks like. Does the generated page visually match the original at key checkpoints?",{"type":16,"tag":24,"props":305,"children":306},{},[307],{"type":22,"value":308},"The final score is a weighted combination of these two, with a heavy emphasis on functionality. After all, a beautiful button that doesn't work isn't very useful.",{"type":16,"tag":82,"props":310,"children":312},{"id":311},"results",[313],{"type":22,"value":314},"Results",{"type":16,"tag":24,"props":316,"children":317},{},[318,320],{"type":22,"value":319},"We tested 28 of the world's leading LVLMs, including proprietary models like GPT-5 and Claude-4, and the results were stunning. The main takeaway: ",{"type":16,"tag":47,"props":321,"children":322},{},[323],{"type":22,"value":324},"reconstructing interactive functionality is still an incredibly difficult challenge for AI.",{"type":16,"tag":55,"props":326,"children":328},{"className":327,"style":61},[58,59,60],[329,330,334,335],{"type":22,"value":64},{"type":16,"tag":66,"props":331,"children":333},{"src":243,"alt":332,"style":70},"Performance of 10 representative models",[],{"type":22,"value":64},{"type":16,"tag":24,"props":336,"children":338},{"className":337,"style":77},[76],[339],{"type":22,"value":340},"\n    Performance of 10 representative models\n  ",{"type":16,"tag":24,"props":342,"children":343},{},[344,346,351,353,358],{"type":22,"value":345},"The performance of top AI models on IWR-Bench highlights the immense difficulty of the task, with the best-performing model, GPT-5, achieving an overall score of only 36.35 out of 100. More telling, as the detailed results show, is the massive gap between looks and logic: while top models achieved a respectable ",{"type":16,"tag":47,"props":347,"children":348},{},[349],{"type":22,"value":350},"64.25 on Visual Fidelity",{"type":22,"value":352},", their score for ",{"type":16,"tag":47,"props":354,"children":355},{},[356],{"type":22,"value":357},"Functional Correctness was a mere 24.39",{"type":22,"value":234},{"type":16,"tag":55,"props":360,"children":362},{"className":361,"style":61},[58,59,60],[363,364,369,370],{"type":22,"value":64},{"type":16,"tag":66,"props":365,"children":368},{"src":366,"alt":367,"style":70},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251105\u002FIWR05.webp","Full results table",[],{"type":22,"value":64},{"type":16,"tag":24,"props":371,"children":373},{"className":372,"style":77},[76],[374],{"type":22,"value":375},"\n    Full results table\n  ",{"type":16,"tag":24,"props":377,"children":378},{},[379],{"type":22,"value":380},"A detailed breakdown of the results reveals a critical insight: models are far better at replicating visual fidelity than they are at implementing functional logic.",{"type":16,"tag":24,"props":382,"children":383},{},[384,386,391],{"type":22,"value":385},"This is the core finding of our work. In simple terms: ",{"type":16,"tag":47,"props":387,"children":388},{},[389],{"type":22,"value":390},"AI is getting good at painting the car, but it's still struggling to build the engine.",{"type":22,"value":392}," Models can replicate the static appearance of a webpage, but synthesizing the event-driven logic that makes it work remains a major hurdle.",{"type":16,"tag":82,"props":394,"children":396},{"id":395},"roadmap-for-the-future-of-ai",[397],{"type":22,"value":398},"Roadmap for the Future of AI",{"type":16,"tag":24,"props":400,"children":401},{},[402],{"type":22,"value":403},"By highlighting this critical gap between appearance and functionality, IWR-Bench provides a clear and challenging new direction for vision-language research.",{"type":16,"tag":24,"props":405,"children":406},{},[407],{"type":22,"value":408},"The next great leap for LVLMs will be to move beyond static perception and master temporal reasoning, dynamic logic, and the synthesis of functional code. By open-sourcing IWR-Bench, we hope to provide the community with the tools to measure progress and accelerate innovation on this exciting frontier.",{"title":244,"searchDepth":410,"depth":410,"links":411},2,[412,413,414,415,416],{"id":84,"depth":410,"text":87},{"id":142,"depth":410,"text":145},{"id":253,"depth":410,"text":256},{"id":311,"depth":410,"text":314},{"id":395,"depth":410,"text":398},"benchmark",[419,420,421,422],"video","agent","coding","multimodal",{"homepage":244,"arxiv":424,"github":425,"huggingface":426},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24709","https:\u002F\u002Fgithub.com\u002FSIGMME\u002FIWR-Bench","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FIWR-Bench\u002FIWR-Bench"]