[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog":3},{"title":4,"desc":5,"bannerImg":6,"date":7,"orgImgLinks":8,"bannerLinks":9,"blogCategory":10,"category":11,"weight":12,"externalUrl":13,"links":14,"description":5,"content":15,"tag1":487,"tag2":488,"resLinks":490},"Beyond Crowdsourcing: How SuperGPQA Uses PhD Experts to Solve LLM Data Leakage","Evaluation of graduate-level AI requires graduate-level expertise. Learn how 2077AI's SuperGPQA benchmark utilizes 80+ PhD experts to eliminate data leakage, refine complex distractors, and identify systematic AI hallucinations across 26,000+ questions.","https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002FBanner_blog\u002Fbanner_gpqaseo2.png","2025-12-23","[]","{}","Research","undefined",0,"","{\"homepage\":\"\",\"github\":\"\",\"huggingface\":\"\",\"x\":\"\",\"discord\":\"\",\"arxiv\":\"\"}",{"data":16,"body":19,"toc":479},{"title":17,"description":18},"Raising the Bar in AI Annotation: Why SuperGPQA Prioritizes Experts for the External Validation of AI Models","SuperGPQA, one of our 2077AI's comprehensive benchmark evaluating graduate - level capabilities across 285 disciplines, including Science, Engineering, Medicine, and Humanities. It contains over 26,000 questions and reveals that SOTA models up to now is Gemini 3 pro achieving ~73% accuracy, with the second coming GPT5.2 pro reaches only ~67% accuracy.",{"type":20,"children":21},"root",[22,37,55,67,76,102,111,119,134,143,152,162,180,198,230,237,249,259,276,293,310,320,364,394,425,432,444,454,463,472],{"type":23,"tag":24,"props":25,"children":29},"element","h1",{"className":26,"id":28},[27],"heading__h1","raising-the-bar-in-ai-annotation-why-supergpqa-prioritizes-experts-for-the-external-validation-of-ai-models",[30],{"type":23,"tag":31,"props":32,"children":34},"span",{"style":33},"white-space: pre-wrap;",[35],{"type":36,"value":17},"text",{"type":23,"tag":38,"props":39,"children":43},"p",{"className":40,"style":42},[41],"doxhub-editor-paragraph","text-align: left;",[44],{"type":23,"tag":45,"props":46,"children":47},"i",{},[48],{"type":23,"tag":49,"props":50,"children":53},"em",{"className":51,"style":33},[52],"text__italic",[54],{"type":36,"value":18},{"type":23,"tag":56,"props":57,"children":61},"h2",{"className":58,"id":60},[59],"heading__h2","the-challenge-of-evaluating-graduate-level-ai",[62],{"type":23,"tag":31,"props":63,"children":64},{"style":33},[65],{"type":36,"value":66},"The Challenge of Evaluating Graduate-Level AI",{"type":23,"tag":38,"props":68,"children":70},{"className":69},[41],[71],{"type":23,"tag":31,"props":72,"children":73},{"style":33},[74],{"type":36,"value":75},"As Large Language Models (LLMs) reach new heights of capability, traditional benchmarks are facing a saturation point: models are scoring so high that the tests are losing their ability to differentiate between a \"good\" model and a \"super\" model. Moreover, traditional benchmarks fail to cover the diverse and long-tail knowledge accumulated by humans.",{"type":23,"tag":38,"props":77,"children":79},{"className":78},[41],[80,85,97],{"type":23,"tag":31,"props":81,"children":82},{"style":33},[83],{"type":36,"value":84},"However, discipline diversity and long-tail knowledge are critical for measuring the true capacity and real-world utility of LLMs. Failure to evaluate performance in diverse disciplines—particularly",{"type":23,"tag":86,"props":87,"children":88},"b",{},[89],{"type":23,"tag":90,"props":91,"children":94},"strong",{"className":92,"style":33},[93],"text__bold",[95],{"type":36,"value":96}," long-tail fields",{"type":23,"tag":31,"props":98,"children":99},{"style":33},[100],{"type":36,"value":101}," like agriculture and light industry—means we are fundamentally mismeasuring the real-world utility and readiness of our models for specialized tasks. To accurately measure the upper bounds of model performance, we need rigorous external validation of AI models that goes beyond general knowledge.",{"type":23,"tag":38,"props":103,"children":105},{"className":104},[41],[106],{"type":23,"tag":31,"props":107,"children":108},{"style":33},[109],{"type":36,"value":110},"SuperGPQA represents a significant step in this direction. It is a massive-scale benchmark covering 285 graduate-level disciplines, from light industry to specialized medical fields. Moreover, the development of SuperGPQA reveals a critical insight for the AI community: evaluating graduate-level reasoning requires graduate-level expertise.",{"type":23,"tag":38,"props":112,"children":114},{"className":113},[41],[115],{"type":23,"tag":116,"props":117,"children":118},"br",{},[],{"type":23,"tag":120,"props":121,"children":122},"figure",{},[123,129],{"type":23,"tag":124,"props":125,"children":128},"img",{"src":126,"alt":127},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251216\u002FVisualization%20of%20SuperGPQA%20question%20sampling%20across%20diverse%20disciplines.webp","Visualization of SuperGPQA question sampling across diverse disciplines",[],{"type":23,"tag":130,"props":131,"children":132},"figcaption",{},[133],{"type":36,"value":127},{"type":23,"tag":38,"props":135,"children":137},{"className":136},[41],[138],{"type":23,"tag":31,"props":139,"children":140},{"style":33},[141],{"type":36,"value":142},"2077AI research team initially employed a mix of crowdsourcing (undergraduate and master’s students) and experts (PhD holders). Through this process, we identified three key limitations of general crowdsourcing for high-level data collection. These findings are redefining what a data annotation task is when applied to advanced reasoning.",{"type":23,"tag":38,"props":144,"children":146},{"className":145},[41],[147],{"type":23,"tag":31,"props":148,"children":149},{"style":33},[150],{"type":36,"value":151},"Here are three crucial lessons we extracted from SuperGPQA's work:",{"type":23,"tag":56,"props":153,"children":156},{"className":154,"id":155},[59],"ensuring-source-credibility",[157],{"type":23,"tag":31,"props":158,"children":159},{"style":33},[160],{"type":36,"value":161},"Ensuring Source Credibility",{"type":23,"tag":38,"props":163,"children":165},{"className":164},[41],[166,175],{"type":23,"tag":86,"props":167,"children":168},{},[169],{"type":23,"tag":90,"props":170,"children":172},{"className":171,"style":33},[93],[173],{"type":36,"value":174},"The Trap: ",{"type":23,"tag":31,"props":176,"children":177},{"style":33},[178],{"type":36,"value":179},"In the initial phases of annotation AI workflows and data collection, the research team relied on crowdsourcing annotators to identify potential questions. However, they discovered that students without deep domain expertise struggled to distinguish between \"credible resources\" and less reliable ones.",{"type":23,"tag":38,"props":181,"children":183},{"className":182},[41],[184,193],{"type":23,"tag":86,"props":185,"children":186},{},[187],{"type":23,"tag":90,"props":188,"children":190},{"className":189,"style":33},[93],[191],{"type":36,"value":192},"The Reality: ",{"type":23,"tag":31,"props":194,"children":195},{"style":33},[196],{"type":36,"value":197},"Crowdsourcing annotators frequently sourced questions from online exercise websites. This presented a significant validity issue: many latest SOTA models at that time, such as GPT-4o and Gemini-flash, frequently output the exact erroneous answers found on these websites. This indicates that models may have already memorized this public data (data leakage), rendering those questions ineffective for testing reasoning capabilities.",{"type":23,"tag":38,"props":199,"children":201},{"className":200},[41],[202,211,216,225],{"type":23,"tag":86,"props":203,"children":204},{},[205],{"type":23,"tag":90,"props":206,"children":208},{"className":207,"style":33},[93],[209],{"type":36,"value":210},"The Expert Advantage: ",{"type":23,"tag":31,"props":212,"children":213},{"style":33},[214],{"type":36,"value":215},"To guarantee the reliability and difficulty of the questions, the team implemented a strict protocol where only ",{"type":23,"tag":86,"props":217,"children":218},{},[219],{"type":23,"tag":90,"props":220,"children":222},{"className":221,"style":33},[93],[223],{"type":36,"value":224},"expert annotators",{"type":23,"tag":31,"props":226,"children":227},{"style":33},[228],{"type":36,"value":229}," (PhDs) were permitted to screen sources. They shifted focus toward textbooks and verified academic materials, ensuring the benchmark tested actual knowledge rather than the model's ability to recall internet content.",{"type":23,"tag":38,"props":231,"children":233},{"className":232},[41],[234],{"type":23,"tag":116,"props":235,"children":236},{},[],{"type":23,"tag":120,"props":238,"children":239},{},[240,245],{"type":23,"tag":124,"props":241,"children":244},{"src":242,"alt":243},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251216\u002FThe%20complex%20rewriting%20process%20of%20correct%20and%20incorrect%20judgment%20questions%20in%20SuperGPQA.webp","The complex rewriting process of correct and incorrect judgment questions in SuperGPQA",[],{"type":23,"tag":130,"props":246,"children":247},{},[248],{"type":36,"value":243},{"type":23,"tag":56,"props":250,"children":253},{"className":251,"id":252},[59],"the-complexity-of-transcription-and-distractor-generation",[254],{"type":23,"tag":31,"props":255,"children":256},{"style":33},[257],{"type":36,"value":258},"The Complexity of Transcription and Distractor Generation",{"type":23,"tag":38,"props":260,"children":262},{"className":261},[41],[263,271],{"type":23,"tag":86,"props":264,"children":265},{},[266],{"type":23,"tag":90,"props":267,"children":269},{"className":268,"style":33},[93],[270],{"type":36,"value":174},{"type":23,"tag":31,"props":272,"children":273},{"style":33},[274],{"type":36,"value":275},"Creating a multiple-choice question involves more than just identifying the correct answer; it requires generating plausible incorrect options (distractors). A clear annotator example of the difficulty here is that crowdsourcing annotators often had \"low accuracy in judging generated distractors\".",{"type":23,"tag":38,"props":277,"children":279},{"className":278},[41],[280,288],{"type":23,"tag":86,"props":281,"children":282},{},[283],{"type":23,"tag":90,"props":284,"children":286},{"className":285,"style":33},[93],[287],{"type":36,"value":192},{"type":23,"tag":31,"props":289,"children":290},{"style":33},[291],{"type":36,"value":292},"When it comes to question types like selecting correct or incorrect options, it is easy for crowdsourcing annotators to generate flawed distractors. Since rigorous distractors are critical for testing true LLM reasoning and preventing simple fact retrieval, this limitation in crowdsourced judgment created a fundamental conflict for the benchmark's integrity. Even advanced LLMs like Claude-3.5 and GPT-4o faced difficulties in generating suitable confounders for complex \"statement selection\" questions. Flawed distractors can make a hard question unintentionally easy or confusingly ambiguous.",{"type":23,"tag":38,"props":294,"children":296},{"className":295},[41],[297,305],{"type":23,"tag":86,"props":298,"children":299},{},[300],{"type":23,"tag":90,"props":301,"children":303},{"className":302,"style":33},[93],[304],{"type":36,"value":210},{"type":23,"tag":31,"props":306,"children":307},{"style":33},[308],{"type":36,"value":309},"Experts possess the nuance required to understand common misconceptions in their specific fields. To address this, the team instituted a \"Transcription\" stage where questions were standardized and reviewed. For complex formats, such as selecting correct\u002Fincorrect statements, expert intervention was necessary to ensure the options were rigorous and discriminatory.",{"type":23,"tag":56,"props":311,"children":314},{"className":312,"id":313},[59],"identifying-systematic-errors-through-expert-review",[315],{"type":23,"tag":31,"props":316,"children":317},{"style":33},[318],{"type":36,"value":319},"Identifying Systematic Errors through Expert Review",{"type":23,"tag":38,"props":321,"children":323},{"className":322},[41],[324,332,337,346,351,359],{"type":23,"tag":86,"props":325,"children":326},{},[327],{"type":23,"tag":90,"props":328,"children":330},{"className":329,"style":33},[93],[331],{"type":36,"value":174},{"type":23,"tag":31,"props":333,"children":334},{"style":33},[335],{"type":36,"value":336},"How do you judge if a question is bad? A typical approach is to see if the AI gets it wrong. But what happens when every",{"type":23,"tag":45,"props":338,"children":339},{},[340],{"type":23,"tag":49,"props":341,"children":343},{"className":342,"style":33},[52],[344],{"type":36,"value":345}," ",{"type":23,"tag":31,"props":347,"children":348},{"style":33},[349],{"type":36,"value":350},"top AI model picks the same",{"type":23,"tag":45,"props":352,"children":353},{},[354],{"type":23,"tag":49,"props":355,"children":357},{"className":356,"style":33},[52],[358],{"type":36,"value":345},{"type":23,"tag":31,"props":360,"children":361},{"style":33},[362],{"type":36,"value":363},"wrong answer?",{"type":23,"tag":38,"props":365,"children":367},{"className":366},[41],[368,376,381,389],{"type":23,"tag":86,"props":369,"children":370},{},[371],{"type":23,"tag":90,"props":372,"children":374},{"className":373,"style":33},[93],[375],{"type":36,"value":192},{"type":23,"tag":31,"props":377,"children":378},{"style":33},[379],{"type":36,"value":380},"2077AI team highlights a fascinating phenomenon: \"Questions where LLMs choose the same incorrect option are highly suspicious\". This usually implies the models are hallucinating in unison or recalling the same incorrect fact from a low-quality training source. Crowdsourcers generally lack the depth of knowledge to identify why",{"type":23,"tag":45,"props":382,"children":383},{},[384],{"type":23,"tag":49,"props":385,"children":387},{"className":386,"style":33},[52],[388],{"type":36,"value":345},{"type":23,"tag":31,"props":390,"children":391},{"style":33},[392],{"type":36,"value":393},"the models are converging on a wrong answer.",{"type":23,"tag":38,"props":395,"children":397},{"className":396},[41],[398,406,411,420],{"type":23,"tag":86,"props":399,"children":400},{},[401],{"type":23,"tag":90,"props":402,"children":404},{"className":403,"style":33},[93],[405],{"type":36,"value":210},{"type":23,"tag":31,"props":407,"children":408},{"style":33},[409],{"type":36,"value":410},"SuperGPQA implemented a ",{"type":23,"tag":86,"props":412,"children":413},{},[414],{"type":23,"tag":90,"props":415,"children":417},{"className":416,"style":33},[93],[418],{"type":36,"value":419},"3-Stage Quality Inspection process",{"type":23,"tag":31,"props":421,"children":422},{"style":33},[423],{"type":36,"value":424},". Expert annotators were tasked with reviewing these suspicious questions with unrestricted access to the web. This \"Human-LLM collaborative filtering\" allowed experts to validate whether a question was legitimately challenging or if it contained fundamental errors that needed correction.",{"type":23,"tag":38,"props":426,"children":428},{"className":427},[41],[429],{"type":23,"tag":116,"props":430,"children":431},{},[],{"type":23,"tag":120,"props":433,"children":434},{},[435,440],{"type":23,"tag":124,"props":436,"children":439},{"src":437,"alt":438},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251216\u002FThe%20data%20collection%20process%20of%20SuperGPQA.webp","The data collection process of SuperGPQA",[],{"type":23,"tag":130,"props":441,"children":442},{},[443],{"type":36,"value":438},{"type":23,"tag":56,"props":445,"children":448},{"className":446,"id":447},[59],"the-growing-importance-of-specialized-human-expertise-knowledge",[449],{"type":23,"tag":31,"props":450,"children":451},{"style":33},[452],{"type":36,"value":453},"The Growing Importance of Specialized Human Expertise Knowledge",{"type":23,"tag":38,"props":455,"children":457},{"className":456},[41],[458],{"type":23,"tag":31,"props":459,"children":460},{"style":33},[461],{"type":36,"value":462},"The methodology behind SuperGPQA highlights a shift in data annotation strategies. While crowdsourcing remains effective for general tasks, probing the boundaries of \"graduate-level knowledge and reasoning capabilities\" requires a higher tier of human oversight.",{"type":23,"tag":38,"props":464,"children":466},{"className":465},[41],[467],{"type":23,"tag":31,"props":468,"children":469},{"style":33},[470],{"type":36,"value":471},"The result of SuperGPQA demonstrates that involving over 80 expert annotators was not just a quality assurance measure, but a necessity to create a benchmark that accurately differentiates between top models and others. For research teams targeting specialized domains, prioritizing subject matter expertise over volume is likely to yield more robust evaluations.",{"type":23,"tag":38,"props":473,"children":475},{"className":474},[41],[476],{"type":23,"tag":116,"props":477,"children":478},{},[],{"title":13,"searchDepth":480,"depth":480,"links":481},2,[482,483,484,485,486],{"id":60,"depth":480,"text":66},{"id":155,"depth":480,"text":161},{"id":252,"depth":480,"text":258},{"id":313,"depth":480,"text":319},{"id":447,"depth":480,"text":453},"benchmark",[489],"llm",{"homepage":491,"arxiv":492,"github":493,"huggingface":494},"https:\u002F\u002Fsupergpqa.github.io\u002F","https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14739","https:\u002F\u002Fgithub.com\u002FSuperGPQA\u002FSuperGPQA","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FSuperGPQA"]