COIG-P: A Large-Scale Chinese Preference Dataset for LLM Alignment

Introduction

Dataset

COIG-P

Modalities

Text

Formats

parquet

Languages

Chinese

Size

803MB

Release Date

2025-04-07

Domain

Chat, Code, Math, Logic, Novel, Role

License

-

COIG-P (Chinese Open Instruction Generalist - Preference) is a high-quality, large-scale Chinese preference dataset designed for aligning Large Language Models (LLMs) with human values. It contains over one million chosen-rejected preference pairs, making it a substantial resource for the research community.

  • A key innovation of COIG-P is its creation via a fully LLM-based annotation pipeline with no direct human intervention, addressing the scalability limitations of human-annotated datasets. The process involved using 15 mainstream LLMs to generate and score response pairs for over 92,000 high-quality, filtered Chinese queries.

  • The dataset offers broad coverage across six diverse domains: Chat, Code, Math, Logic, Novel, and Role. Training on COIG-P has been shown to yield significant performance improvements for various LLM series, demonstrating its effectiveness for preference alignment tasks.

Data Sample

Designed by 2077AI Team