GPT-5 Series vs. Gemini 3 Pro: The Verdict from SuperGPQA

The release of OpenAI’s GPT-5.2 Pro has reignited the race for AI supremacy, promising significant leaps in reasoning and professional capabilities. But how does it actually perform when tested against the world's hardest domain-specific questions?

We put the leading frontier models—including Google’s Gemini 3 Pro Preview, GPT-5.2 Pro, and GPT-5.1-Thinking—to the test on SuperGPQA, our gold standard benchmark for graduate-level knowledge covering 285 specialized disciplines from Quantum Mechanics to Agronomy, SuperGPQA bypasses surface-level internet knowledge to evaluate deep reasoning.

The results are in, and they signal a shift in the hierarchy of "hard science" capabilities.

Discipline accuracy distribution by model

The Verdict: Gemini 3 Pro Leads in Specialized Knowledge

Contrary to the expectation that newer is always better, our data shows that Gemini 3 Pro Preview currently holds the edge in complex, high-stakes scientific domains. While the GPT-5 series demonstrates impressive reasoning, Gemini's underlying knowledge density in specialized fields appears superior.

Gemini 3 Pro Preview outperforms GPT-5 variants in overall graduate-level accuracy on SuperGPQA

In the Physics domain alone, which aggregates over 2,000 graduate-level questions, the performance gap is distinct. Gemini 3 Pro consistently ranks at the top, outperforming the GPT-5 series in subfields that require precise physical intuition and calculation.

Discipline Deep Dive: Where the Models Diverge

The aggregate scores tell only half the story. The true test of an expert model is its performance in "long-tail" disciplines—subjects that aren't just reasoning puzzles, but require deep, memorized professional knowledge.

Hard Physics: The Reasoning Test

Relativity is one of the most conceptually demanding subfields in our benchmark. Here, Gemini 3 Pro achieved a commanding 79.75% accuracy. In comparison, OpenAI's specialized reasoning model, GPT-5.1-Thinking, scored 74.68%, while the new GPT-5.2 Pro trailed at 70.89%. This suggests that for theoretical physics, Gemini's internal world model is more robust.

Specialized Agriculture: The Knowledge Test

In Aquaculture, a niche field often overlooked by general benchmarks, the difference is even more stark. Gemini 3 Pro maintained a robust 62.50% accuracy, proving its versatility. In contrast, GPT-5.2 Pro struggled significantly, achieving only 48.21% - a gap of over 14 percentage points.

Gemini 3 Pro demonstrates superior breadth, encompassing GPT-5.2 Pro across diverse scientific disciplines

Conclusion

For developers and enterprises choosing between these frontier models, the SuperGPQA verdict is clear:

GPT-5.1-Thinking is a powerful tool for logic-heavy tasks, showing strong improvements over base models in reasoning-intensive questions.
However, Gemini 3 Pro currently reigns supreme in domain expertise. If your application requires handling specialized, graduate-level knowledge, from theoretical physics to agricultural science—Gemini 3 Pro is the statistical leader.

As the AI landscape evolves, SuperGPQA will continue to serve as the unbiased arena for measuring true machine intelligence.

Explore the Full Leaderboard ->

Learn more about SuperGPQA ->

About

Mission

Events

News

Opportunities

Partnerships

Research

Datasets

Projects

EVA

Campus Program

Challenges