[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"blog":3},{"title":4,"desc":5,"bannerImg":6,"date":7,"links":8,"description":5,"content":9,"tag1":537,"tag2":538,"resLinks":541},"Unlocking Deeper Multimodal Understanding: Introducing PIN-200M, A Massive Dataset for Next-Gen LMMs","Discover PIN, a novel data format for training powerful Large Multimodal Models. Explore PIN-200M, our new 200-million-document dataset designed to eliminate perceptual and reasoning errors in AI. Open-source and ready for research.","https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002FBanner_blog\u002Fblog_pin200m.png","2025-09-30","{ \"github\":\"\",\"huggingface\":\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FPIN-200M\", \"arxiv\":\"https:\u002F\u002Farxiv.org\u002Fhtml\u002F2406.13923v2\" }",{"data":10,"body":12,"toc":527},{"title":4,"description":11},"The era of Large Multimodal Models (LMMs) is here, empowering AI to understand and reason across a blend of text and images. Yet, despite their rapid progress, even the most advanced models struggle with persistent challenges. At 2077AI, we observed two critical hurdles limiting their true potential:",{"type":13,"children":14},"root",[15,23,28,54,73,80,85,108,134,161,167,172,231,236,242,252,269,281,324,341,346,352,357,374,386,419,436,441,447,452,471,504,510,515],{"type":16,"tag":17,"props":18,"children":20},"element","h1",{"id":19},"unlocking-deeper-multimodal-understanding-introducing-pin-200m-a-massive-dataset-for-next-gen-lmms",[21],{"type":22,"value":4},"text",{"type":16,"tag":24,"props":25,"children":26},"p",{},[27],{"type":22,"value":11},{"type":16,"tag":29,"props":30,"children":31},"ul",{},[32,44],{"type":16,"tag":33,"props":34,"children":35},"li",{},[36,42],{"type":16,"tag":37,"props":38,"children":39},"strong",{},[40],{"type":22,"value":41},"Perceptual Errors:",{"type":22,"value":43}," Models often fail to correctly interpret complex visual data like intricate charts, tables, and scientific diagrams.",{"type":16,"tag":33,"props":45,"children":46},{},[47,52],{"type":16,"tag":37,"props":48,"children":49},{},[50],{"type":22,"value":51},"Reasoning Errors:",{"type":22,"value":53}," Models struggle to deduce the relationships between text and images, especially when understanding document flow and context.",{"type":16,"tag":24,"props":55,"children":56},{},[57,59,64,66,71],{"type":22,"value":58},"These are not trivial issues; they are fundamental barriers to creating truly knowledge-intensive AI. We believe the solution lies not just in better models, but in better data. That's why we are excited to introduce ",{"type":16,"tag":37,"props":60,"children":61},{},[62],{"type":22,"value":63},"PIN (Paired and INterleaved multimodal documents)",{"type":22,"value":65},", a novel data format, and announce the release of ",{"type":16,"tag":37,"props":67,"children":68},{},[69],{"type":22,"value":70},"PIN-200M",{"type":22,"value":72},", a massive-scale dataset designed to address these challenges head-on.",{"type":16,"tag":74,"props":75,"children":77},"h2",{"id":76},"the-problem-with-existing-multimodal-data",[78],{"type":22,"value":79},"The Problem with Existing Multimodal Data",{"type":16,"tag":24,"props":81,"children":82},{},[83],{"type":22,"value":84},"Current multimodal datasets largely fall into two categories, each with its own limitations:",{"type":16,"tag":29,"props":86,"children":87},{},[88,98],{"type":16,"tag":33,"props":89,"children":90},{},[91,96],{"type":16,"tag":37,"props":92,"children":93},{},[94],{"type":22,"value":95},"Image-Text Pairs:",{"type":22,"value":97}," These datasets match an image with a short caption (like alt-text). While useful for basic object recognition, the text often lacks the rich context needed for deep reasoning.",{"type":16,"tag":33,"props":99,"children":100},{},[101,106],{"type":16,"tag":37,"props":102,"children":103},{},[104],{"type":22,"value":105},"Interleaved Documents:",{"type":22,"value":107}," These formats interleave images and text, which is a step in the right direction. However, they are scarce, primarily focused on web content, and crucially, they lose the overall visual layout of the document, which is vital for understanding context.",{"type":16,"tag":24,"props":109,"children":110},{},[111,113,118,120,125,127,132],{"type":22,"value":112},"To truly advance LMMs, we need a data format that is ",{"type":16,"tag":37,"props":114,"children":115},{},[116],{"type":22,"value":117},"knowledge-intensive",{"type":22,"value":119},", ",{"type":16,"tag":37,"props":121,"children":122},{},[123],{"type":22,"value":124},"scalable",{"type":22,"value":126},", and ",{"type":16,"tag":37,"props":128,"children":129},{},[130],{"type":22,"value":131},"versatile",{"type":22,"value":133}," enough to support diverse training strategies.",{"type":16,"tag":135,"props":136,"children":142},"div",{"className":137,"style":141},[138,139,140],"img-wrap","has-caption","center","width: 100%; position: relative; margin-bottom: 62px",[143,145,152,153],{"type":22,"value":144},"\n  ",{"type":16,"tag":146,"props":147,"children":151},"img",{"src":148,"alt":149,"style":150},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20250929\u002FPIN200-01.webp","A comparison of traditional multimodal formats (left) with the PIN format (right)","width: 100%; max-height: 60vh; object-fit: contain; background: #141414; border-radius: 8px",[],{"type":22,"value":144},{"type":16,"tag":24,"props":154,"children":158},{"className":155,"style":157},[156],"img-text","position: absolute; top: calc(100% + 16px); left: 0; right: 0;text-align: center; overflow: hidden; white-space: nowrap; text-overflow: ellipsis; line-height: 22px; color: #A1A1A1; font-size: 14px",[159],{"type":22,"value":160},"\n    A comparison of traditional multimodal formats (left) with the PIN format (right)\n  ",{"type":16,"tag":74,"props":162,"children":164},{"id":163},"the-pin-format-a-richer-vision-for-data",[165],{"type":22,"value":166},"The PIN Format: A Richer Vision for Data",{"type":16,"tag":24,"props":168,"children":169},{},[170],{"type":22,"value":171},"The PIN format was created to capture a deeper, more holistic understanding of a document. Instead of choosing between semantic content and visual layout, PIN provides both. Each entry in a PIN dataset consists of two synchronized components:",{"type":16,"tag":29,"props":173,"children":174},{},[175,214],{"type":16,"tag":33,"props":176,"children":177},{},[178,183,185,190,192,198,200,205,207,212],{"type":16,"tag":37,"props":179,"children":180},{},[181],{"type":22,"value":182},"A Semantically Rich Markdown File:",{"type":22,"value":184}," This isn't just plain text. We preserve the document's original structure—headings, ",{"type":16,"tag":37,"props":186,"children":187},{},[188],{"type":22,"value":189},"bold",{"type":22,"value":191}," and ",{"type":16,"tag":193,"props":194,"children":195},"em",{},[196],{"type":22,"value":197},"italic",{"type":22,"value":199}," text, lists, and even code blocks. Images are embedded inline, maintaining the natural, interleaved flow of content. This component captures the ",{"type":16,"tag":193,"props":201,"children":202},{},[203],{"type":22,"value":204},"what",{"type":22,"value":206}," and the ",{"type":16,"tag":193,"props":208,"children":209},{},[210],{"type":22,"value":211},"how",{"type":22,"value":213}," of the information.",{"type":16,"tag":33,"props":215,"children":216},{},[217,222,224,229],{"type":16,"tag":37,"props":218,"children":219},{},[220],{"type":22,"value":221},"A Paired Overall Image:",{"type":22,"value":223}," This is a high-resolution rendering of the entire document or page. It provides the complete visual context — the layout, the spatial relationship between elements, and the overall design. This component captures ",{"type":16,"tag":193,"props":225,"children":226},{},[227],{"type":22,"value":228},"where",{"type":22,"value":230},".",{"type":16,"tag":24,"props":232,"children":233},{},[234],{"type":22,"value":235},"By combining a structured, interleaved Markdown file with a holistic overall image, the PIN format provides a complete, knowledge-rich representation that helps models learn both fine-grained details and high-level context simultaneously.",{"type":16,"tag":74,"props":237,"children":239},{"id":238},"from-14m-to-200m-a-monumental-leap-in-scale",[240],{"type":22,"value":241},"From 14M to 200M: A Monumental Leap in Scale",{"type":16,"tag":24,"props":243,"children":244},{},[245,247,251],{"type":22,"value":246},"Initially, we released PIN-14M, a 14-million-document dataset that validated the power of the PIN format. Today, we are taking a monumental step forward with the release of ",{"type":16,"tag":37,"props":248,"children":249},{},[250],{"type":22,"value":70},{"type":22,"value":230},{"type":16,"tag":135,"props":253,"children":255},{"className":254,"style":141},[138,139,140],[256,257,262,263],{"type":22,"value":144},{"type":16,"tag":146,"props":258,"children":261},{"src":259,"alt":260,"style":150},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20250929\u002FPIN200-02.webp","A statistical overview comparing the composition of PIN-14M and PIN-200M",[],{"type":22,"value":144},{"type":16,"tag":24,"props":264,"children":266},{"className":265,"style":157},[156],[267],{"type":22,"value":268},"\n    A statistical overview comparing the composition of PIN-14M and PIN-200M\n  ",{"type":16,"tag":24,"props":270,"children":271},{},[272,274,279],{"type":22,"value":273},"This new dataset expands our collection more than 10 times, containing approximately ",{"type":16,"tag":37,"props":275,"children":276},{},[277],{"type":22,"value":278},"200 million multimodal documents",{"type":22,"value":280},". As the chart above shows, this isn't just a quantitative jump; it's a qualitative transformation. PIN-200M is built from an incredibly diverse range of sources:",{"type":16,"tag":29,"props":282,"children":283},{},[284,294,304,314],{"type":16,"tag":33,"props":285,"children":286},{},[287,292],{"type":16,"tag":37,"props":288,"children":289},{},[290],{"type":22,"value":291},"Scientific Documents:",{"type":22,"value":293}," Articles from Arxiv and PubMed Central.",{"type":16,"tag":33,"props":295,"children":296},{},[297,302],{"type":16,"tag":37,"props":298,"children":299},{},[300],{"type":22,"value":301},"Web Content:",{"type":22,"value":303}," A massive collection of web pages from sources like OBELICS.",{"type":16,"tag":33,"props":305,"children":306},{},[307,312],{"type":16,"tag":37,"props":308,"children":309},{},[310],{"type":22,"value":311},"Technical Documentation:",{"type":22,"value":313}," Content from communities like Linux-CN and datasets like Leetcode.",{"type":16,"tag":33,"props":315,"children":316},{},[317,322],{"type":16,"tag":37,"props":318,"children":319},{},[320],{"type":22,"value":321},"Literature:",{"type":22,"value":323}," Long-form text from books via the PG19 dataset.",{"type":16,"tag":135,"props":325,"children":327},{"className":326,"style":141},[138,139,140],[328,329,334,335],{"type":22,"value":144},{"type":16,"tag":146,"props":330,"children":333},{"src":331,"alt":332,"style":150},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20250929\u002FPIN200-03.webp","Samples from various subsets in the PIN-200M dataset",[],{"type":22,"value":144},{"type":16,"tag":24,"props":336,"children":338},{"className":337,"style":157},[156],[339],{"type":22,"value":340},"\n    Samples from various subsets in the PIN-200M dataset\n  ",{"type":16,"tag":24,"props":342,"children":343},{},[344],{"type":22,"value":345},"This massive scale is critical. It provides the data volume necessary to train foundational LMMs that can generalize across domains, understand nuanced context, and finally overcome the stubborn perceptual and reasoning errors that have held back progress.",{"type":16,"tag":74,"props":347,"children":349},{"id":348},"designed-for-researchers-quality-signals-and-flexibility",[350],{"type":22,"value":351},"Designed for Researchers: Quality Signals and Flexibility",{"type":16,"tag":24,"props":353,"children":354},{},[355],{"type":22,"value":356},"We didn't just build a massive dataset; we built a usable one. A standardized pipeline ensures that all data, whether from new sources or existing datasets, is converted into the unified PIN format.",{"type":16,"tag":135,"props":358,"children":360},{"className":359,"style":141},[138,139,140],[361,362,367,368],{"type":22,"value":144},{"type":16,"tag":146,"props":363,"children":366},{"src":364,"alt":365,"style":150},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20250929\u002FPIN200-04.webp","An overview of the standardized data processing pipeline",[],{"type":22,"value":144},{"type":16,"tag":24,"props":369,"children":371},{"className":370,"style":157},[156],[372],{"type":22,"value":373},"\n    An overview of the standardized data processing pipeline\n  ",{"type":16,"tag":24,"props":375,"children":376},{},[377,379,384],{"type":22,"value":378},"Furthermore, inspired by projects like RedPajama-Data-v2, PIN datasets come equipped with ",{"type":16,"tag":37,"props":380,"children":381},{},[382],{"type":22,"value":383},"quality_signals",{"type":22,"value":385},". These are metadata tags for each document that provide metrics like:",{"type":16,"tag":29,"props":387,"children":388},{},[389,399,409],{"type":16,"tag":33,"props":390,"children":391},{},[392,397],{"type":16,"tag":37,"props":393,"children":394},{},[395],{"type":22,"value":396},"Image-Text Interleaving Frequency (ITIF):",{"type":22,"value":398}," How often do images and text alternate?",{"type":16,"tag":33,"props":400,"children":401},{},[402,407],{"type":16,"tag":37,"props":403,"children":404},{},[405],{"type":22,"value":406},"Token Counts & Document Length:",{"type":22,"value":408}," Basic but essential metrics for filtering.",{"type":16,"tag":33,"props":410,"children":411},{},[412,417],{"type":16,"tag":37,"props":413,"children":414},{},[415],{"type":22,"value":416},"Markup Statistics:",{"type":22,"value":418}," How many bold, italic, or heading tags are present?",{"type":16,"tag":135,"props":420,"children":422},{"className":421,"style":141},[138,139,140],[423,424,429,430],{"type":22,"value":144},{"type":16,"tag":146,"props":425,"children":428},{"src":426,"alt":427,"style":150},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20250929\u002FPIN200-05.webp","Detailed statistics for key quality signals across the PIN-200M subset",[],{"type":22,"value":144},{"type":16,"tag":24,"props":431,"children":433},{"className":432,"style":157},[156],[434],{"type":22,"value":435},"\n    Detailed statistics for key quality signals across the PIN-200M subset\n  ",{"type":16,"tag":24,"props":437,"children":438},{},[439],{"type":22,"value":440},"These signals allow researchers to easily filter the 200 million documents to create specialized subsets for their specific needs, enabling highly targeted and efficient model training. The rich diversity across subsets is also evident in their statistical distributions, ensuring the dataset can support a wide range of applications.",{"type":16,"tag":74,"props":442,"children":444},{"id":443},"fueling-the-next-wave-of-ai-training",[445],{"type":22,"value":446},"Fueling the Next Wave of AI Training",{"type":16,"tag":24,"props":448,"children":449},{},[450],{"type":22,"value":451},"The true power of the PIN format lies in its flexibility. It not only supports existing training methods but also opens the door to entirely new research paradigms.",{"type":16,"tag":24,"props":453,"children":454},{},[455,457,462,464,469],{"type":22,"value":456},"While you can use PIN for standard training, like ",{"type":16,"tag":37,"props":458,"children":459},{},[460],{"type":22,"value":461},"contrastive learning",{"type":22,"value":463}," (pairing the overall image with the Markdown file) or ",{"type":16,"tag":37,"props":465,"children":466},{},[467],{"type":22,"value":468},"next-token prediction",{"type":22,"value":470}," (using the interleaved markdown), it also enables novel, exciting tasks:",{"type":16,"tag":29,"props":472,"children":473},{},[474,484,494],{"type":16,"tag":33,"props":475,"children":476},{},[477,482],{"type":16,"tag":37,"props":478,"children":479},{},[480],{"type":22,"value":481},"Multimodal Document Rendering (MDR):",{"type":22,"value":483}," Imagine training a model to generate a complete, visually coherent webpage or document layout from only its Markdown source.",{"type":16,"tag":33,"props":485,"children":486},{},[487,492],{"type":16,"tag":37,"props":488,"children":489},{},[490],{"type":22,"value":491},"Knowledge Extraction (KE):",{"type":22,"value":493}," The reverse is also possible—training a model to parse a document image and generate a perfectly structured Markdown file, effectively performing \"holistic OCR.\"",{"type":16,"tag":33,"props":495,"children":496},{},[497,502],{"type":16,"tag":37,"props":498,"children":499},{},[500],{"type":22,"value":501},"Pagination Prediction (PP):",{"type":22,"value":503}," Models can learn to understand the logical flow of multi-page documents, predicting page breaks and structure.",{"type":16,"tag":74,"props":505,"children":507},{"id":506},"the-future-is-open-and-multimodal",[508],{"type":22,"value":509},"The Future is Open and Multimodal",{"type":16,"tag":24,"props":511,"children":512},{},[513],{"type":22,"value":514},"We believe that the release of PIN-200M represents a significant milestone for the AI community. By providing a new data format and a massive, high-quality, and easy-to-use dataset, we are laying the groundwork for the next generation of more capable and reliable Large Multimodal Models.",{"type":16,"tag":24,"props":516,"children":517},{},[518,520,525],{"type":22,"value":519},"This entire project is ",{"type":16,"tag":37,"props":521,"children":522},{},[523],{"type":22,"value":524},"fully open-source",{"type":22,"value":526},". We invite every researcher, developer, and innovator to dive into PIN-200M, explore its potential, and build upon our work. Let's work together to push the boundaries of what's possible in multimodal AI.",{"title":528,"searchDepth":529,"depth":529,"links":530},"",2,[531,532,533,534,535,536],{"id":76,"depth":529,"text":79},{"id":163,"depth":529,"text":166},{"id":238,"depth":529,"text":241},{"id":348,"depth":529,"text":351},{"id":443,"depth":529,"text":446},{"id":506,"depth":529,"text":509},"dataset",[539,540],"multimodal","image",{"homepage":542,"arxiv":543,"github":528,"huggingface":544},"https:\u002F\u002Fwww.2077ai.com\u002Fdatasets\u002Fdataset-pin200","https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.13923","https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FPIN-14M"]