M-A-P Matrix: A Massive Bilingual Dataset for LLM Pretraining image

M-A-P Matrix: A Massive Bilingual Dataset for LLM Pretraining

Introduction

Datasetm-a-p/Matrix
ModalitiesText
Formatsjson
LanguagesEnglish, Chinese
Size3.19GB
Release Date2024-05-29
TagsMixed Domain
LicenseApache license 2.0
Preview

Matrix is a massive, open-source pretraining dataset containing approximately 4.7 trillion tokens of bilingual text in English and Chinese. It was created to serve as the foundational training data for the MAP-Neo series of highly capable and transparent large language models.

  • The dataset is distinguished by its comprehensive and diverse composition, sourced from a wide range of high-quality corpora. Key components include web text from Common Crawl, technical data from Code and Patent documents, academic language from Papers, literary text from Books, and factual information from Wikipedia articles.
  • With its immense scale and rich, multi-domain composition, Matrix provides a crucial resource for researchers and developers aiming to pretrain powerful, generalist bilingual LLMs from the ground up.

Data Sample