M-A-P Matrix: A Massive Bilingual Dataset for LLM Pretraining
Introduction
Dataset | m-a-p/Matrix |
---|---|
Modalities | Text |
Formats | json |
Languages | English, Chinese |
Size | 3.19GB |
Release Date | 2024-05-29 |
Tags | Mixed Domain |
License | Apache license 2.0 |
Preview |
Matrix is a massive, open-source pretraining dataset containing approximately 4.7 trillion tokens of bilingual text in English and Chinese. It was created to serve as the foundational training data for the MAP-Neo series of highly capable and transparent large language models.
- The dataset is distinguished by its comprehensive and diverse composition, sourced from a wide range of high-quality corpora. Key components include web text from Common Crawl, technical data from Code and Patent documents, academic language from Papers, literary text from Books, and factual information from Wikipedia articles.
- With its immense scale and rich, multi-domain composition, Matrix provides a crucial resource for researchers and developers aiming to pretrain powerful, generalist bilingual LLMs from the ground up.