M-A-P Matrix: A Massive Bilingual Dataset for LLM Pretraining

Introduction

Dataset

M-A-P Matrix

Modalities

Text, Video

Formats

json

Languages

English, Chinese

Size

3.19GB

Release Date

2024-05-29

Domain

Mixed Domain

License

Apache license 2.0

Matrix is a massive, open-source pretraining dataset containing approximately 4.7 trillion tokens of bilingual text in English and Chinese. It was created to serve as the foundational training data for the MAP-Neo series of highly capable and transparent large language models.

  • The dataset is distinguished by its comprehensive and diverse composition, sourced from a wide range of high-quality corpora. Key components include web text from Common Crawl, technical data from Code and Patent documents, academic language from Papers, literary text from Books, and factual information from Wikipedia articles.

  • With its immense scale and rich, multi-domain composition, Matrix provides a crucial resource for researchers and developers aiming to pretrain powerful, generalist bilingual LLMs from the ground up.

Data Sample

Designed by 2077AI Team