Fostering Transparent Evaluation to Advance the AI Ecosystem
The much-anticipated GPT-5 has finally lifted a corner of its mysterious veil. At 2077AI, we have led the development of the new-generation large model benchmark—SuperGPQA. Unlike traditional benchmarks, SuperGPQA focuses on graduate-level qualifying exam questions across nearly 300 disciplines, rigorously testing a model's deep professional knowledge and complex reasoning abilities.
For more information about SuperGPQA, please visit ourblog: https://www.2077ai.com/blog/2077AI-SuperGPQA
Now, let's dive into the true performance of GPT-5, as revealed by the evaluation from SuperGPQA.
Overall Performance: The Gap Between the Champion and an 'Average Student'
Judging by the overall accuracy rate, the power of the GPT-5 base model (gpt-5) is beyond doubt. It leads the pack with an astonishing accuracy of 66.7%, securely claiming the top spot.
However, when we turn our attention to the version integrated into ChatGPT (gpt-5-chat), the results are sobering. Its accuracy drops to just 58.2%, falling to around 10th place. It is not only far inferior to the base model but is also surpassed by models like Gemini-2.5-pro and Claude-Opus-4. This nearly 9-percentage-point performance gap is a chasm in the competition among top models.

Difficulty Breakdown: The Greater the Challenge, the More Pronounced the Gap
This performance discrepancy is particularly striking across questions of varying difficulty levels.
In the most challenging "Hard" questions, the GPT-5 base model demonstrates its exceptional reasoning and knowledge, standing as a well-deserved leader. In contrast, GPT-5-chat's performance drops sharply. The data shows that on hard questions, gpt-5 maintains an accuracy of around 48%, while gpt-5-chat plummets to 37%.
This indicates that the "optimization" process from the base model to the chat product seems to come at the cost of its core capability to handle complex and professional issues.

Disciplinary Performance: All-Round Leadership vs. 'Uneven Performance'
Through the box plot of disciplinary performance, we can more intuitively observe the stability of a model's performance.
- The accuracy box for gpt-5 is positioned higher than all other models, demonstrating its strength and stability across the vast majority of fields.
- The box for gpt-5-chat, however, lies in a mediocre position, with its median and quartiles comprehensively lagging.

Whether in the disciplines where models perform best (e.g., Engineering, Science) or worst (e.g., Law, Military Science), we observe a similar trend: gpt-5-chat cannot keep pace with the base model, even on favorable ground, and its performance drop is more severe in its weaker areas.
Unexpected Surprise: The Standout GPT-5 Mini
It is worth noting that the "Medium-sized" member of the GPT-5 family, gpt-5-mini, was a highlight in this test. With an average accuracy of 59%, it ranked 7th, outperforming gpt-5-chat. This may suggest that gpt-5-mini is a more finely balanced version between capability and efficiency, offering a highly attractive option for developers and enterprises.

Conclusion: Driving Open Evaluation to Co-build a Trustworthy AI Future
The evaluation results of GPT-5 clearly reveal the vast gap between the "technical ceiling" and the "product floor" in the field of AI. The potential of the base model is stunning, but what users ultimately experience is a "domesticated" version that has been subject to performance trade-offs.
This finding highlights the critical importance of establishing objective, in-depth, and open evaluation standards for the healthy development of the entire industry. This is the core motivation behind 2077AI's investment in developing open-source projects like SuperGPQA. We believe that only through transparent and reproducible evaluation can the community truly understand the strengths and limitations of current technology, thereby driving meaningful innovation.
We hereby extend a sincere invitation to AI researchers, developers, and enthusiasts worldwide:
You are welcome to use, contribute to, and improve the SuperGPQA benchmark. Join us in pushing the boundaries of AI technology to co-build a more open and trustworthy AI future.
>