COIG-P: A Large-Scale Chinese Preference Dataset for LLM Alignment
Introduction
Dataset | COIG-P |
---|---|
Modalities | Text |
Formats | parquet |
Languages | Chinese |
Size | 803MB |
Release Date | 2025-04-07 |
Domain | Chat, Code, Math, Logic, Novel, Role |
License | - |
COIG-P (Chinese Open Instruction Generalist - Preference) is a high-quality, large-scale Chinese preference dataset designed for aligning Large Language Models (LLMs) with human values. It contains over one million chosen-rejected preference pairs, making it a substantial resource for the research community.
- A key innovation of COIG-P is its creation via a fully LLM-based annotation pipeline with no direct human intervention, addressing the scalability limitations of human-annotated datasets. The process involved using 15 mainstream LLMs to generate and score response pairs for over 92,000 high-quality, filtered Chinese queries.
- The dataset offers broad coverage across six diverse domains: Chat, Code, Math, Logic, Novel, and Role. Training on COIG-P has been shown to yield significant performance improvements for various LLM series, demonstrating its effectiveness for preference alignment tasks.