COIG-P: A Large-Scale Chinese Preference Dataset for LLM Alignment

Introduction

Dataset	COIG-P
Modalities	Text
Formats	parquet
Languages	Chinese
Size	803MB
Release Date	2025-04-07
Domain	Chat, Code, Math, Logic, Novel, Role
License	-

COIG-P (Chinese Open Instruction Generalist - Preference) is a high-quality, large-scale Chinese preference dataset designed for aligning Large Language Models (LLMs) with human values. It contains over one million chosen-rejected preference pairs, making it a substantial resource for the research community.

A key innovation of COIG-P is its creation via a fully LLM-based annotation pipeline with no direct human intervention, addressing the scalability limitations of human-annotated datasets. The process involved using 15 mainstream LLMs to generate and score response pairs for over 92,000 high-quality, filtered Chinese queries.
The dataset offers broad coverage across six diverse domains: Chat, Code, Math, Logic, Novel, and Role. Training on COIG-P has been shown to yield significant performance improvements for various LLM series, demonstrating its effectiveness for preference alignment tasks.

About

Mission

Events

News

Opportunities

Partnerships

Research

Datasets

Projects

EVA

Campus Program

Challenges

Ventures

COIG-P

A Large-Scale Chinese Preference Dataset for LLM Alignment

COIG-P: A Large-Scale Chinese Preference Dataset for LLM Alignment

Introduction

Data Sample