M-A-P Matrix: A Massive Bilingual Dataset for LLM Pretraining

Introduction

Dataset	M-A-P Matrix
Modalities	Text, Video
Formats	json
Languages	English, Chinese
Size	3.19GB
Release Date	2024-05-29
Domain	Mixed Domain
License	Apache license 2.0

Matrix is a massive, open-source pretraining dataset containing approximately 4.7 trillion tokens of bilingual text in English and Chinese. It was created to serve as the foundational training data for the MAP-Neo series of highly capable and transparent large language models.

The dataset is distinguished by its comprehensive and diverse composition, sourced from a wide range of high-quality corpora. Key components include web text from Common Crawl, technical data from Code and Patent documents, academic language from Papers, literary text from Books, and factual information from Wikipedia articles.
With its immense scale and rich, multi-domain composition, Matrix provides a crucial resource for researchers and developers aiming to pretrain powerful, generalist bilingual LLMs from the ground up.

About

Mission

Events

News

Opportunities

Partnerships

Research

Datasets

Projects

EVA

Campus Program

Challenges

Ventures

M-A-P Matrix

A Massive Bilingual Dataset for LLM Pretraining

M-A-P Matrix: A Massive Bilingual Dataset for LLM Pretraining

Introduction

Data Sample