Blog/Featured
Featured Content
Latest
Content

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines - Exploring the Real Proficiency Boundaries of LLM

FormalMATH Benchmark: A Formal Mathematics Benchmark for Pushing the Limits of AI

Breaking Traditional Knowledge Dependency: KOR-Bench for Evaluating Intrinsic Reasoning Abilities of Models
