MMAR: A Benchmark for Deep Audio Reasoning
Introduction
Dataset | MMAR |
---|---|
Modalities | Text, Audio |
Formats | json |
Languages | English, Chinese, etc.(16 total) |
Size | 168kB |
Release Date | 2025-05-19 |
Domain | Audio Processing, Speech Recognition |
License | cc-by-nc-4.0 |
MMAR (Massive Multi-disciplinary Audio Reasoning) is a new and challenging benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs).
- It consists of 1,000 meticulously curated audio-question-answer triplets sourced from real-world internet videos. Each task requires multi-step deep reasoning that goes far beyond surface-level perception.
- A key feature of MMAR is its diverse coverage of modalities, including not only traditional speech, audio, and music, but also complex mixtures of them. Furthermore, the benchmark is designed to be difficult, with a portion of questions requiring graduate-level perceptual and domain-specific knowledge to answer correctly.