[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog":3},{"title":4,"desc":5,"bannerImg":6,"date":7,"orgImgLinks":8,"bannerLinks":9,"blogCategory":10,"category":11,"weight":12,"externalUrl":13,"links":14,"description":5,"content":15,"tag1":565,"tag2":566,"resLinks":568},"Have LLMs Hit a Ceiling? Why SuperGPQA Proves the AGI Journey is Just Beginning","Experts thought LLMs were maxing out, but SuperGPQA says otherwise. Learn why state-of-the-art models only score 74% on this new expert-level test and what this means for the future of AI reasoning and specialized knowledge.","https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002FBanner_blog\u002Fbanner_gpqaseo.png","2025-12-23","[]","{}","Research","undefined",0,"","{\"homepage\":\"\",\"github\":\"\",\"huggingface\":\"\",\"x\":\"\",\"discord\":\"\",\"arxiv\":\"\"}",{"data":16,"body":19,"toc":554},{"title":17,"description":18},"Have LLMs Hit Their Capability Ceiling? What SuperGPQA Reveals About AGI Boundaries","In the rapid ascent of Large Language Models (LLMs), we have reached a pivotal moment. Models like GPT-5 and Gemini 3 have begun to \"max out\" traditional benchmarks, achieving scores on tests like MMLU that rival human experts. This performance suggests they have mastered the requirements of the current human benchmark test, but also leads to a growing whisper in the AI community: Have we hit the ceiling? Is this as smart as they get?",{"type":20,"children":21},"root",[22,37,48,57,66,78,87,96,105,115,124,133,158,167,184,193,210,233,241,256,266,283,290,302,314,331,340,347,359,366,383,392,409,418,428,437,446,453,465,475,484,502,511,547],{"type":23,"tag":24,"props":25,"children":29},"element","h1",{"className":26,"id":28},[27],"heading__h1","have-llms-hit-their-capability-ceiling-what-supergpqa-reveals-about-agi-boundaries",[30],{"type":23,"tag":31,"props":32,"children":34},"span",{"style":33},"white-space: pre-wrap;",[35],{"type":36,"value":17},"text",{"type":23,"tag":38,"props":39,"children":43},"p",{"className":40,"style":42},[41],"doxhub-editor-paragraph","text-align: left;",[44],{"type":23,"tag":31,"props":45,"children":46},{"style":33},[47],{"type":36,"value":18},{"type":23,"tag":38,"props":49,"children":51},{"className":50},[41],[52],{"type":23,"tag":31,"props":53,"children":54},{"style":33},[55],{"type":36,"value":56},"The answer, according to 2077AI research team, is a resounding no. The ceiling hasn't been reached; the measuring stick was just too short testing the true capacity of AI.",{"type":23,"tag":38,"props":58,"children":60},{"className":59},[41],[61],{"type":23,"tag":31,"props":62,"children":63},{"style":33},[64],{"type":36,"value":65},"Enter SuperGPQA, a comprehensive benchmark developed by 2077AI research team. Designed to probe the upper bounds of model intelligence, SuperGPQA reveals that while LLMs are knowledgeable, a significant gap remains between current capabilities and true Artificial General Intelligence (AGI).",{"type":23,"tag":67,"props":68,"children":72},"h2",{"className":69,"style":42,"id":71},[70],"heading__h2","the-problem-with-current-benchmarks",[73],{"type":23,"tag":31,"props":74,"children":75},{"style":33},[76],{"type":36,"value":77},"The Problem with Current Benchmarks",{"type":23,"tag":38,"props":79,"children":81},{"className":80},[41],[82],{"type":23,"tag":31,"props":83,"children":84},{"style":33},[85],{"type":36,"value":86},"Existing LLM benchmarks have served us well, but they are facing a saturation crisis. Models have demonstrated remarkable proficiency in mainstream subjects like mathematics, physics, and computer science. However, human knowledge extends far beyond these core STEM fields into over 200 specialized disciplines—from light industry and agriculture to specific service-oriented fields.",{"type":23,"tag":38,"props":88,"children":90},{"className":89},[41],[91],{"type":23,"tag":31,"props":92,"children":93},{"style":33},[94],{"type":36,"value":95},"Because models have achieved such high scores on older datasets, these benchmarks are losing their value as \"challenging frontiers\".",{"type":23,"tag":38,"props":97,"children":99},{"className":98},[41],[100],{"type":23,"tag":31,"props":101,"children":102},{"style":33},[103],{"type":36,"value":104},"To truly test if an AI can reason like a human expert, we need harder benchmark testing.",{"type":23,"tag":67,"props":106,"children":109},{"className":107,"style":42,"id":108},[70],"enter-supergpqa-the-new-litmus-test",[110],{"type":23,"tag":31,"props":111,"children":112},{"style":33},[113],{"type":36,"value":114},"Enter SuperGPQA: The New Litmus Test",{"type":23,"tag":38,"props":116,"children":118},{"className":117},[41],[119],{"type":23,"tag":31,"props":120,"children":121},{"style":33},[122],{"type":36,"value":123},"SuperGPQA represents a massive leap in evaluation scale and taxonomic depth. Unlike previous \"hard\" benchmarks like GPQA which has only 448 questions, SuperGPQA contains 26,529 questions spanning 285 graduate-level subfields from 13 disciplines.",{"type":23,"tag":38,"props":125,"children":127},{"className":126},[41],[128],{"type":23,"tag":31,"props":129,"children":130},{"style":33},[131],{"type":36,"value":132},"Here is how SuperGPQA redefines the standard for discriminative AI benchmarking:",{"type":23,"tag":134,"props":135,"children":138},"ul",{"className":136},[137],"doxhub-editor-ul",[139],{"type":23,"tag":140,"props":141,"children":145},"li",{"value":142,"className":143},"1",[144],"doxhub-editor-list-item",[146],{"type":23,"tag":147,"props":148,"children":149},"b",{},[150],{"type":23,"tag":151,"props":152,"children":155},"strong",{"className":153,"style":33},[154],"text__bold",[156],{"type":36,"value":157},"Graduate-Level Depth",{"type":23,"tag":38,"props":159,"children":161},{"className":160},[41],[162],{"type":23,"tag":31,"props":163,"children":164},{"style":33},[165],{"type":36,"value":166},"Every question in SuperGPQA is designed to evaluate graduate-level knowledge and reasoning capabilities, moving beyond general trivia. The raw questions prioritize example problems with solutions from textbooks and calculation\u002Freasoning-needed problems from verified websites.",{"type":23,"tag":134,"props":168,"children":170},{"className":169},[137],[171],{"type":23,"tag":140,"props":172,"children":174},{"value":142,"className":173},[144],[175],{"type":23,"tag":147,"props":176,"children":177},{},[178],{"type":23,"tag":151,"props":179,"children":181},{"className":180,"style":33},[154],[182],{"type":36,"value":183},"Higher Difficulty Ceiling",{"type":23,"tag":38,"props":185,"children":187},{"className":186},[41],[188],{"type":23,"tag":31,"props":189,"children":190},{"style":33},[191],{"type":36,"value":192},"SuperGPQA raises the difficulty bar significantly. While traditional tests often use 4 options per question, SuperGPQA averages 9.67 options per question, which significantly reduces the chance of a model guessing the right answer by luck. Adding to the challenge, 42.33% of all questions require mathematical calculations or formal reasoning, testing complex logical deduction over simple factual recall.",{"type":23,"tag":134,"props":194,"children":196},{"className":195},[137],[197],{"type":23,"tag":140,"props":198,"children":200},{"value":142,"className":199},[144],[201],{"type":23,"tag":147,"props":202,"children":203},{},[204],{"type":23,"tag":151,"props":205,"children":207},{"className":206,"style":33},[154],[208],{"type":36,"value":209},"Rigorous Quality Control",{"type":23,"tag":38,"props":211,"children":213},{"className":212},[41],[214,219,228],{"type":23,"tag":31,"props":215,"children":216},{"style":33},[217],{"type":36,"value":218},"The dataset's integrity and difficulty are guaranteed by a novel Human-LLM Collaborative System involving over 80 expert annotators. This system utilizes a ",{"type":23,"tag":147,"props":220,"children":221},{},[222],{"type":23,"tag":151,"props":223,"children":225},{"className":224,"style":33},[154],[226],{"type":36,"value":227},"rigorous three-stage pipeline,",{"type":23,"tag":31,"props":229,"children":230},{"style":33},[231],{"type":36,"value":232}," including source screening, transcription and quality inspection, where expert annotators source high-quality material, ensuring the benchmark acts as a discriminative AI tool.",{"type":23,"tag":38,"props":234,"children":236},{"className":235},[41],[237],{"type":23,"tag":238,"props":239,"children":240},"br",{},[],{"type":23,"tag":242,"props":243,"children":244},"figure",{},[245,251],{"type":23,"tag":246,"props":247,"children":250},"img",{"src":248,"alt":249},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251223%EF%BC%881%EF%BC%89\u002FThe%20collaborative%20data%20collection%20process%20of%20SuperGPQA.webp","The collaborative data collection process of SuperGPQA",[],{"type":23,"tag":252,"props":253,"children":254},"figcaption",{},[255],{"type":36,"value":249},{"type":23,"tag":67,"props":257,"children":260},{"className":258,"style":42,"id":259},[70],"the-verdict-the-ceiling-is-still-far-away",[261],{"type":23,"tag":31,"props":262,"children":263},{"style":33},[264],{"type":36,"value":265},"The Verdict: The \"Ceiling\" is Still Far Away",{"type":23,"tag":38,"props":267,"children":269},{"className":268},[41],[270,275,278],{"type":23,"tag":31,"props":271,"children":272},{"style":33},[273],{"type":36,"value":274},"So, how do the world's best models perform on this new standard? The results offer a reality check for the industry.",{"type":23,"tag":238,"props":276,"children":277},{},[],{"type":23,"tag":31,"props":279,"children":280},{"style":33},[281],{"type":36,"value":282},"While models often score 80 - 90% on older benchmarks, the highest accuracy achieved on SuperGPQA by a state-of-the-art model (Gemini-3-pro-preview) was only 73.75%. This gap highlights that considerable room for AI optimization remains before we reach AGI.",{"type":23,"tag":38,"props":284,"children":286},{"className":285},[41],[287],{"type":23,"tag":238,"props":288,"children":289},{},[],{"type":23,"tag":242,"props":291,"children":292},{},[293,298],{"type":23,"tag":246,"props":294,"children":297},{"src":295,"alt":296},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251223%EF%BC%881%EF%BC%89\u002FBenchmark%20Comparison.webp","Benchmark Comparison",[],{"type":23,"tag":252,"props":299,"children":300},{},[301],{"type":36,"value":296},{"type":23,"tag":303,"props":304,"children":308},"h3",{"className":305,"style":42,"id":307},[306],"heading__h3","key-findings-from-the-leaderboard",[309],{"type":23,"tag":31,"props":310,"children":311},{"style":33},[312],{"type":36,"value":313},"Key Findings from the Leaderboard",{"type":23,"tag":134,"props":315,"children":317},{"className":316},[137],[318],{"type":23,"tag":140,"props":319,"children":321},{"value":142,"className":320},[144],[322],{"type":23,"tag":147,"props":323,"children":324},{},[325],{"type":23,"tag":151,"props":326,"children":328},{"className":327,"style":33},[154],[329],{"type":36,"value":330},"Reasoning is King",{"type":23,"tag":38,"props":332,"children":334},{"className":333},[41],[335],{"type":23,"tag":31,"props":336,"children":337},{"style":33},[338],{"type":36,"value":339},"The best-performing models were those specialized in reasoning. gemini-3-pro-preview and gpt-5.2-pro achieved scores of 73.75% and 67.13% respectively, outperforming standard chat models. This proves that rote knowledge recall is no longer enough; AGI requires complex logical deduction.",{"type":23,"tag":38,"props":341,"children":343},{"className":342},[41],[344],{"type":23,"tag":238,"props":345,"children":346},{},[],{"type":23,"tag":242,"props":348,"children":349},{},[350,355],{"type":23,"tag":246,"props":351,"children":354},{"src":352,"alt":353},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251223%EF%BC%881%EF%BC%89\u002FLatest%20performance%20evaluation%20of%20AI%20models%20on%20SuperGPQA.webp","Latest performance evaluation of AI models on SuperGPQA",[],{"type":23,"tag":252,"props":356,"children":357},{},[358],{"type":36,"value":353},{"type":23,"tag":38,"props":360,"children":362},{"className":361},[41],[363],{"type":23,"tag":238,"props":364,"children":365},{},[],{"type":23,"tag":134,"props":367,"children":369},{"className":368},[137],[370],{"type":23,"tag":140,"props":371,"children":373},{"value":142,"className":372},[144],[374],{"type":23,"tag":147,"props":375,"children":376},{},[377],{"type":23,"tag":151,"props":378,"children":380},{"className":379,"style":33},[154],[381],{"type":36,"value":382},"The \"Hard\" Question Cliff",{"type":23,"tag":38,"props":384,"children":386},{"className":385},[41],[387],{"type":23,"tag":31,"props":388,"children":389},{"style":33},[390],{"type":36,"value":391},"When we break down the questions by difficulty, the weakness of current LLMs becomes glaring. On \"Easy\" and \"Middle\" difficulty questions, chat-oriented models (like Doubao-1.5-pro) perform admirably, showing they have mastered factual knowledge. However, in the \"Hard\" split—which tests deep reasoning—their performance collapses. Only reasoning-specialized models maintained competence in the hard category.",{"type":23,"tag":134,"props":393,"children":395},{"className":394},[137],[396],{"type":23,"tag":140,"props":397,"children":399},{"value":142,"className":398},[144],[400],{"type":23,"tag":147,"props":401,"children":402},{},[403],{"type":23,"tag":151,"props":404,"children":406},{"className":405,"style":33},[154],[407],{"type":36,"value":408},"Instruction Tuning Matters",{"type":23,"tag":38,"props":410,"children":412},{"className":411},[41],[413],{"type":23,"tag":31,"props":414,"children":415},{"style":33},[416],{"type":36,"value":417},"Instruction-tuned models significantly outperformed their base counterparts. For example, Qwen2.5-72B-Instruct (40.75%) showed a notable improvement over Qwen2.5-72B (34.33%).",{"type":23,"tag":67,"props":419,"children":422},{"className":420,"style":42,"id":421},[70],"discrimination-power-humanities-vs-stem",[423],{"type":23,"tag":31,"props":424,"children":425},{"style":33},[426],{"type":36,"value":427},"Discrimination Power: Humanities vs. STEM",{"type":23,"tag":38,"props":429,"children":431},{"className":430},[41],[432],{"type":23,"tag":31,"props":433,"children":434},{"style":33},[435],{"type":36,"value":436},"Interestingly, SuperGPQA reveals that the path to AGI isn't just about solving harder math problems. Our analysis of \"Disciplinary Discrimination Power\" shows that humanities disciplines, particularly history and law, are actually better at differentiating between top-tier models than STEM fields.",{"type":23,"tag":38,"props":438,"children":440},{"className":439},[41],[441],{"type":23,"tag":31,"props":442,"children":443},{"style":33},[444],{"type":36,"value":445},"Why? STEM questions often rely on standardized problem-solving patterns that models can memorize. Humanities questions, however, require context-dependent reasoning, interpretation of nuance, and the synthesis of real-world knowledge—areas where even the best models still struggle.",{"type":23,"tag":38,"props":447,"children":449},{"className":448},[41],[450],{"type":23,"tag":238,"props":451,"children":452},{},[],{"type":23,"tag":242,"props":454,"children":455},{},[456,461],{"type":23,"tag":246,"props":457,"children":460},{"src":458,"alt":459},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251223%EF%BC%881%EF%BC%89\u002FComprehensive%20Discrimination%20Analysis%20with%20Several%20Key%20Evaluation%20Metrics%20in%20which%20law%20and%20history%20show%20strong%20differentiation.webp","Comprehensive Discrimination Analysis with Several Key Evaluation Metrics in which law and history show strong differentiation",[],{"type":23,"tag":252,"props":462,"children":463},{},[464],{"type":36,"value":459},{"type":23,"tag":67,"props":466,"children":469},{"className":467,"style":42,"id":468},[70],"conclusion-a-roadmap-to-agi",[470],{"type":23,"tag":31,"props":471,"children":472},{"style":33},[473],{"type":36,"value":474},"Conclusion: A Roadmap to AGI",{"type":23,"tag":38,"props":476,"children":478},{"className":477},[41],[479],{"type":23,"tag":31,"props":480,"children":481},{"style":33},[482],{"type":36,"value":483},"SuperGPQA proves that LLMs have not hit a capability ceiling; they resemble students who have simply graduated from high school and are now grappling with their PhDs.",{"type":23,"tag":38,"props":485,"children":487},{"className":486},[41],[488,493],{"type":23,"tag":31,"props":489,"children":490},{"style":33},[491],{"type":36,"value":492},"By scaling evaluation across 285 disciplines and forcing models to confront long-tail, expert-level scenarios, SuperGPQA provides the roadmap we need. It moves the goalposts from \"Can AI pass a test?\" to ",{"type":23,"tag":147,"props":494,"children":495},{},[496],{"type":23,"tag":151,"props":497,"children":499},{"className":498,"style":33},[154],[500],{"type":36,"value":501},"\"Can AI think like an expert in any field?\"",{"type":23,"tag":38,"props":503,"children":505},{"className":504},[41],[506],{"type":23,"tag":31,"props":507,"children":508},{"style":33},[509],{"type":36,"value":510},"The 73.75% high score is not discouragement; it is a challenge. It delineates the boundary of the unknown that the next generation of AI models must conquer.",{"type":23,"tag":38,"props":512,"children":514},{"className":513},[41],[515,520,538],{"type":23,"tag":31,"props":516,"children":517},{"style":33},[518],{"type":36,"value":519},"If you want to learn more about SuperGPQA, read the previous blog of ",{"type":23,"tag":521,"props":522,"children":528},"a",{"href":523,"rel":524,"className":526},"https:\u002F\u002Fwww.2077ai.com\u002Fblog\u002F2077AI-SuperGPQA?utm_source=officialwebsite&utm_medium=blog&utm_campaign=2077ai&utm_id=supergpqaseo2",[525],"noreferrer",[527],"text__link",[529],{"type":23,"tag":147,"props":530,"children":531},{},[532],{"type":23,"tag":151,"props":533,"children":535},{"className":534,"style":33},[154],[536],{"type":36,"value":537},"the general instruction to SuperGPQA",{"type":23,"tag":147,"props":539,"children":540},{},[541],{"type":23,"tag":151,"props":542,"children":544},{"className":543,"style":33},[154],[545],{"type":36,"value":546},".",{"type":23,"tag":38,"props":548,"children":550},{"className":549},[41],[551],{"type":23,"tag":238,"props":552,"children":553},{},[],{"title":13,"searchDepth":555,"depth":555,"links":556},2,[557,558,559,563,564],{"id":71,"depth":555,"text":77},{"id":108,"depth":555,"text":114},{"id":259,"depth":555,"text":265,"children":560},[561],{"id":307,"depth":562,"text":313},3,{"id":421,"depth":555,"text":427},{"id":468,"depth":555,"text":474},"benchmark",[567],"llm",{"homepage":569,"arxiv":570,"github":571,"huggingface":572},"https:\u002F\u002Fsupergpqa.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739","https:\u002F\u002Fgithub.com\u002FSuperGPQA\u002FSuperGPQA","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FSuperGPQA"]