[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"papers-list":3},[4,20,36,45,59,67,76,90,105,119,130,142,155,164,177,190,203,217,231,241,253,266],{"title":5,"date":6,"description":7,"tags":8,"recognizedBy":11,"highlighted":-1,"link":14,"slug":15,"resLinks":16},"Justified or Just Convincing? Why \"Show Your Work\" Is No Longer Enough","2026-04-17","Explores Error Verifiability in LLMs, revealing why “show your work” is no longer enough and how DPO\u002FRLHF improve accuracy while weakening auditability, introducing the$$\\mathscr{v}_{bal}$$metric. ",[9,10],"model","llm",[12,13],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002Fsouth-california.png","https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002Fcmu.png","\u002Fresearch\u002Ferror_verifiability","error_verifiability",{"homepage":17,"arxiv":18,"github":19,"huggingface":17},"","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.04418","https:\u002F\u002Fgithub.com\u002Fxyzhu123\u002FVerifiability",{"title":21,"date":22,"description":23,"tags":24,"recognizedBy":28,"highlighted":-1,"link":29,"slug":30,"resLinks":31},"Human-Aligned Reward Modeling for AI: EditReward's 200K-Pair Dataset","2026-1-16","EditReward advances instruction-guided image editing (IGIE) with a generative reward model. See how it outperforms GPT-4o as a judge, improves dataset quality, and sets a new SOTA for human-AI alignment in editing tasks.",[25,26,27],"dataset","image","multimodal",[],"\u002Fresearch\u002Feditreward-the-power-of-human-aligned-reward-modeling","editreward-the-power-of-human-aligned-reward-modeling",{"homepage":32,"arxiv":33,"github":34,"huggingface":35},"https:\u002F\u002Ftiger-ai-lab.github.io\u002FEditReward\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.26346","https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FEditReward","https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FTIGER-Lab\u002Feditreward",{"title":37,"date":38,"description":39,"tags":40,"recognizedBy":41,"highlighted":-1,"link":42,"slug":43,"resLinks":44},"Data Curation Beats Scaling: Why 20K High-Quality Samples Outperform 46K Noisy Ones in AI Image Editing","2026-01-14","Break the \"scaling trap\" in generative AI. This article details how the 2077AI team used the EditReward reward model and a meticulous multi-dimensional scoring rubric to curate high-fidelity data. Learn how this \"Digital Data Curator\" enables automatic synthesis pipelines, proving that quality is the new scale for building state-of-the-art, open-source image editing models.",[25,26,27],[],"\u002Fresearch\u002Fedit-reward-data-curation","edit-reward-data-curation",{"homepage":32,"arxiv":33,"github":34,"huggingface":35},{"title":46,"date":47,"description":48,"tags":49,"recognizedBy":51,"highlighted":-1,"link":52,"slug":53,"resLinks":54},"Beyond Crowdsourcing: How SuperGPQA Uses PhD Experts to Solve LLM Data Leakage","2025-12-23","Evaluation of graduate-level AI requires graduate-level expertise. Learn how 2077AI's SuperGPQA benchmark utilizes 80+ PhD experts to eliminate data leakage, refine complex distractors, and identify systematic AI hallucinations across 26,000+ questions.",[50,10],"benchmark",[],"\u002Fresearch\u002Fexpert-driven-benchmark-supergpqa","expert-driven-benchmark-supergpqa",{"homepage":55,"arxiv":56,"github":57,"huggingface":58},"https:\u002F\u002Fsupergpqa.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739","https:\u002F\u002Fgithub.com\u002FSuperGPQA\u002FSuperGPQA","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FSuperGPQA",{"title":60,"date":47,"description":61,"tags":62,"recognizedBy":63,"highlighted":-1,"link":64,"slug":65,"resLinks":66},"Have LLMs Hit a Ceiling? Why SuperGPQA Proves the AGI Journey is Just Beginning","Experts thought LLMs were maxing out, but SuperGPQA says otherwise. Learn why state-of-the-art models only score 74% on this new expert-level test and what this means for the future of AI reasoning and specialized knowledge.",[50,10],[],"\u002Fresearch\u002Fsupergpqa-ai-boundary","supergpqa-ai-boundary",{"homepage":55,"arxiv":56,"github":57,"huggingface":58},{"title":68,"date":69,"description":70,"tags":71,"recognizedBy":72,"highlighted":-1,"link":73,"slug":74,"resLinks":75},"GPT-5 Series vs. Gemini 3 Pro: The Verdict from SuperGPQA","2025-12-16","Detailed benchmark results from SuperGPQA revealing how Google's Gemini 3 Pro compares to OpenAI's GPT-5.2 and GPT-5.1-Thinking across 285 graduate-level disciplines.",[50,10],[],"\u002Fresearch\u002Fgpt-5-series-vs-gemini-3-pro-supergpqa-verdict","gpt-5-series-vs-gemini-3-pro-supergpqa-verdict",{"homepage":55,"arxiv":56,"github":57,"huggingface":58},{"title":77,"date":78,"description":79,"tags":80,"recognizedBy":81,"highlighted":-1,"link":83,"slug":84,"resLinks":85},"Scaling Test-Time Compute: How CriticLean Anticipated DeepSeekMath","2025-12-15","DeepSeekMath-V2's success confirms 2077AI's vision. Explore how CriticLean pioneered self-verification and the shift from outcome rewards to System 2 reasoning.",[9,10],[82],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002Fbytedance-seedream.png","\u002Fresearch\u002Fcriticlean-deepseekmath-self-verification","criticlean-deepseekmath-self-verification",{"homepage":86,"arxiv":87,"github":88,"huggingface":89},"https:\u002F\u002Fwww.2077ai.com\u002Fdatasets\u002Fdataset-criticlean","https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06181","https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FCriticLean","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FCriticLeanInstruct",{"title":91,"date":92,"description":93,"tags":94,"recognizedBy":96,"highlighted":-1,"link":98,"slug":99,"resLinks":100},"Meet VideoScore2: The AI Film Critic That Thinks Before It Scores","2025-11-11","As AI-generated video explodes, how do we judge it? Discover VideoScore2, a new framework that acts like an expert film critic, providing detailed reasoning before its final verdict, and setting a new standard for AI evaluation.",[9,95,27],"video",[97],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002FUniversity_of_Illinois_at_Urbana-Champaign_Wordmark.png","\u002Fresearch\u002Fvideoscore2-the-ai-film-critic","videoscore2-the-ai-film-critic",{"homepage":101,"arxiv":102,"github":103,"huggingface":104},"https:\u002F\u002Ftiger-ai-lab.github.io\u002FVideoScore2\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.22799","https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FVideoScore2\u002F","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTIGER-Lab\u002FVideoFeedback2",{"title":106,"date":107,"description":108,"tags":109,"recognizedBy":112,"highlighted":-1,"link":113,"slug":114,"resLinks":115},"IWR-Bench: Can AI Rebuild an Interactive Website Just by Watching a Video?","2025-11-05","Today‘s AI can turn screenshots into code, but what about dynamic, interactive websites? Introducing IWR-Bench, a new benchmark that tests if AI can reconstruct functional websites from a video of user interactions. Discover the surprising results.",[50,95,110,111,27],"agent","coding",[],"\u002Fresearch\u002Fiwr-bench-ai","iwr-bench-ai",{"homepage":17,"arxiv":116,"github":117,"huggingface":118},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24709","https:\u002F\u002Fgithub.com\u002FSIGMME\u002FIWR-Bench","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FIWR-Bench\u002FIWR-Bench",{"title":120,"date":121,"description":122,"tags":123,"recognizedBy":124,"highlighted":126,"link":127,"slug":128,"resLinks":129},"Introducing EDITREWARD: The AI Judge That’s Closing the Gap in Open-Source Image Editing","2025-10-31","Discover how EditReward, a new human-aligned reward model, is solving the biggest bottleneck in AI image editing. Learn how 2077AI‘s high-fidelity data is empowering open-source models to compete with giants like GPT-5.",[9,26,27],[125],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002Fwaterloo.png",2,"\u002Fresearch\u002Fintroducing-editreward-human-aligned-ai-for-image-editing","introducing-editreward-human-aligned-ai-for-image-editing",{"homepage":32,"arxiv":33,"github":34,"huggingface":35},{"title":131,"date":132,"description":133,"tags":134,"recognizedBy":135,"highlighted":-1,"link":136,"slug":137,"resLinks":138},"Unlocking Deeper Multimodal Understanding: Introducing PIN-200M, A Massive Dataset for Next-Gen LMMs","2025-09-30","Discover PIN, a novel data format for training powerful Large Multimodal Models. Explore PIN-200M, our new 200-million-document dataset designed to eliminate perceptual and reasoning errors in AI. Open-source and ready for research.",[25,27,26],[],"\u002Fresearch\u002Fintroducing-pin-200m-multimodal-dataset","introducing-pin-200m-multimodal-dataset",{"homepage":139,"arxiv":140,"github":17,"huggingface":141},"https:\u002F\u002Fwww.2077ai.com\u002Fdatasets\u002Fdataset-pin200","https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.13923","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FPIN-14M",{"title":143,"date":144,"description":145,"tags":146,"recognizedBy":147,"highlighted":-1,"link":148,"slug":149,"resLinks":150},"Introducing Chain-of-Agents: A New Paradigm for Agent Foundation Model","2025-09-03","Discover Chain-of-Agents (CoA), a breakthrough framework for training powerful and efficient Agent Foundation Models. Learn how AFMs achieve state-of-the-art results on 20+ benchmarks while cutting costs. Explore the open-source models and code.",[9,110,10],[],"\u002Fresearch\u002Fchain-of-agents-foundation-models","chain-of-agents-foundation-models",{"homepage":151,"arxiv":152,"github":153,"huggingface":154},"https:\u002F\u002Fchain-of-agents-afm.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.13167","https:\u002F\u002Fgithub.com\u002FOPPO-PersonalAI\u002FAgent_Foundation_Models","https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FPersonalAILab\u002Fafm-datasets",{"title":156,"date":157,"description":158,"tags":159,"recognizedBy":160,"highlighted":-1,"link":161,"slug":162,"resLinks":163},"Unveiling GPT-5’s Two Faces: SuperGPQA Benchmark Analysis","2025-08-19","SuperGPQA benchmark reveals GPT-5 base model leads at 66.7% accuracy vs ChatGPT's weakened 58.2%. See why GPT-5 Mini outperforms chat version. Join open evaluation",[25,10],[],"\u002Fresearch\u002Fgpt-5-performance-supergpqa-test","gpt-5-performance-supergpqa-test",{"homepage":55,"arxiv":56,"github":57,"huggingface":58},{"title":165,"date":157,"description":166,"tags":167,"recognizedBy":168,"highlighted":-1,"link":170,"slug":171,"resLinks":172},"VeriGUI: The Open-Source Benchmark Testing AI Agents Real-World Capabilities","VeriGUI, an open-source benchmark by 2077AI, evaluates AI agents real-world execution through long-chain tasks (up to 15 subtasks), subtask-level verifiability, and real-environment testing (OS\u002Fbrowser). Test planning, tool use, and adaptability with GitHub and Hugging Face integration.",[50,110,27],[169],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002Fntu.png","\u002Fresearch\u002Fverigui-benchmark-ai-agents","verigui-benchmark-ai-agents",{"homepage":173,"arxiv":174,"github":175,"huggingface":176},"https:\u002F\u002Fwww.2077ai.com\u002Fdataset\u002Fdataset-veriweb","https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04026","https:\u002F\u002Fgithub.com\u002FVeriGUI-Team\u002FVeriWeb","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002F2077AIDataFoundation\u002FVeriWeb",{"title":178,"date":179,"description":180,"tags":181,"recognizedBy":182,"highlighted":-1,"link":183,"slug":184,"resLinks":185},"Creative Writing Dataset with Thought Processes: Unleashing Human-like Creativity in AI","2025-07-07","introducing M-A-P and 2077AI latest open-source work: A High-Quality Chinese Creative Writing with Thought Process Dataset. This revolutionary project is designed to help language models move beyond a generic \"sense of AI\" and truly capture the depth and nuance of human creativity",[25,10],[],"\u002Fresearch\u002Fcoig-writer_dataset","coig-writer_dataset",{"homepage":186,"arxiv":187,"github":188,"huggingface":189},"https:\u002F\u002Fcoig-writer.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.14763","https:\u002F\u002Fgithub.com\u002FCOIG-Writer\u002FCOIG-Writer","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FCOIG-Writer",{"title":191,"date":192,"description":193,"tags":194,"recognizedBy":195,"highlighted":-1,"link":196,"slug":197,"resLinks":198},"FormalMATH Benchmark: A Formal Mathematics Benchmark for Pushing the Limits of AI","2025-05-29","As large language models (LLMs) have made breakthroughs in tasks like natural language processing and code generation, formalized mathematics has become crucial for testing their logical reasoning limits.",[50,10],[],"\u002Fresearch\u002Fformalmath-benchmark","formalmath-benchmark",{"homepage":199,"arxiv":200,"github":201,"huggingface":202},"https:\u002F\u002Fspherelab.ai\u002FFormalMATH\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02735","https:\u002F\u002Fgithub.com\u002FSphere-AI-Lab\u002FFormalMATH-Bench","https:\u002F\u002Fhuggingface.co\u002FSphereLab",{"title":204,"date":205,"description":206,"tags":207,"recognizedBy":208,"highlighted":-1,"link":210,"slug":211,"resLinks":212},"Breaking Traditional Knowledge Dependency: KOR-Bench for Evaluating Intrinsic Reasoning Abilities of Models","2025-05-07","The knowledge orthogonality of the KOR-Bench dataset ensures that evaluation tasks are independent of pre-trained knowledge, requiring models to rely on their understanding of new rules and pure reasoning capabilities to solve problems.",[50,10],[209],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002Fdocs-hub\u002F2077ai\u002Forg-logo\u002FMAP.png","\u002Fresearch\u002Fkor-bench","kor-bench",{"homepage":213,"arxiv":214,"github":215,"huggingface":216},"https:\u002F\u002Fkor-bench.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06526","https:\u002F\u002Fgithub.com\u002FKOR-Bench\u002FKOR-Bench","https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2410.06526",{"title":218,"date":219,"description":220,"tags":221,"recognizedBy":222,"highlighted":223,"link":224,"slug":225,"resLinks":226},"A Novel Paradigm for Model Evaluation: The Innovative Multi-source Document Parsing Evaluation Framework OmniDocBench","2025-04-10","This innovative evaluation framework not only provides a reliable standard for the development of document parsing technologies but also pioneers a new paradigm for document intelligence evaluation.",[50,10,27],[],3,"\u002Fresearch\u002Fomnidocbench","omnidocbench",{"homepage":227,"arxiv":228,"github":229,"huggingface":230},"https:\u002F\u002Fwww.2077ai.com\u002Fdatasets\u002Fdataset-omnidocbench","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.07626","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FOmniDocBench","https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2412.07626",{"title":232,"date":233,"description":234,"tags":235,"recognizedBy":236,"highlighted":237,"link":238,"slug":239,"resLinks":240},"SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines","2025-02-20","SuperGPQA redefines AI benchmarking with 26,529 specialized questions across 285 graduate-level subjects. Discover how DeepSeek-R1, GPT-4o, and 51 mainstream models perform when pushed to the boundaries of human knowledge and interdisciplinary reasoning.",[50,10],[82],1,"\u002Fresearch\u002F2077ai-supergpqa","2077ai-supergpqa",{"homepage":55,"arxiv":56,"github":57,"huggingface":58},{"title":242,"date":243,"description":244,"tags":245,"recognizedBy":246,"highlighted":-1,"link":247,"slug":248,"resLinks":249},"OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving","2025-01-23","The 2077AI Foundation has released a new generation of datasets for the autonomous driving. These datasets provide large-scale, more modern, and more realistic data that offer perspectives not previously available.",[25,27,26],[],"\u002Fresearch\u002Fomni-hd-scenes","omni-hd-scenes",{"homepage":250,"arxiv":251,"github":252,"huggingface":17},"https:\u002F\u002Fwww.2077ai.com\u002Fdatasets\u002Fdataset-omnihdscenes","https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10734","https:\u002F\u002Fgithub.com\u002FTJRadarLab\u002FOmniHD-Scenes",{"title":254,"date":255,"description":256,"tags":257,"recognizedBy":258,"highlighted":-1,"link":259,"slug":260,"resLinks":261},"Matrix Dataset: A Revolutionary Bilingual AI Pre-training Corpus","2025-01-01","As pioneers in AI data standardization and advancement, we are committed to unlocking AI potential through high-quality data, accelerating AI development, and nurturing an efficient, thriving AI data ecosystem. The Matrix Dataset is a crucial component of this vision.",[25,10],[],"\u002Fresearch\u002Fmatrix-dataset","matrix-dataset",{"homepage":262,"arxiv":263,"github":264,"huggingface":265},"https:\u002F\u002Fmap-neo.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.19327","https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FMAP-NEO","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FMatrix",{"title":267,"date":268,"description":269,"tags":270,"recognizedBy":271,"highlighted":-1,"link":272,"slug":273,"resLinks":274},"PIN Dataset: A Unified Paradigm for Multimodal Learning","2024-12-01","2077AI Foundation is proud to introduce our new project, the PIN Multimodal Dataset Document. ",[25,27,26],[209],"\u002Fresearch\u002Fpin-dataset","pin-dataset",{"homepage":139,"arxiv":140,"github":17,"huggingface":141}]