[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"dataset":3},{"title":4,"desc":5,"bannerImg":6,"date":7,"bannerLinks":8,"orgImgLinks":9,"weight":10,"category":11,"description":5,"content":12,"metaBannerImg":182},"PIN Dataset: 200M Paired Multimodal Documents for LMMs","Discover PIN, a new data format and two large-scale datasets (PIN-200M & PIN-14M) designed to help LMMs understand complex, knowledge-intensive multimodal documents.","\u002Fdatasets-banner-images\u002Fpin-dataset-banner.jpg","2025-11-18","{ \"Blog\":\"https:\u002F\u002Fwww.2077ai.com\u002Fblog\u002Fintroducing-pin-200m-multimodal-dataset\",\"Paper\":\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.13923\", \"Dataset\":\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fm-a-p\u002FPIN-200M\" }","[{\"orgName\": \"m-a-p\", \"url\": \"https:\u002F\u002Fm-a-p.ai\u002F\", \"logoUrl\": \"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002Ficons\u002Fm-a-p%20normal.png\"},{\"orgName\": \"tsinghua_university\", \"url\": \"https:\u002F\u002Fwww.tsinghua.edu.cn\u002Fen\u002F\", \"logoUrl\": \"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002Ficons\u002Fthu-normal.png\"},{\"orgName\": \"01.ai\", \"url\": \"https:\u002F\u002Fwww.01.ai\u002F\", \"logoUrl\": \"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002Ficons\u002F01.AI-normal.png\"}]",4,"VLM",{"data":13,"body":16,"toc":179},{"title":14,"description":15},"Introduction","PIN (Paired and INterleaved multimodal documents) is a novel data format and large-scale dataset designed to solve persistent perceptual and reasoning errors in knowledge-intensive Large Multimodal Models (LMMs). Current LMMs often fail when interpreting complex visual data (like tables and charts) or deducing the relationships between images and text, largely because existing datasets separate these information streams.",{"type":17,"children":18},"root",[19,27,32,59,64,79,84,101,106,123,128,134,139,146,152,161,167],{"type":20,"tag":21,"props":22,"children":24},"element","h1",{"id":23},"introduction",[25],{"type":26,"value":14},"text",{"type":20,"tag":28,"props":29,"children":30},"p",{},[31],{"type":26,"value":15},{"type":20,"tag":33,"props":34,"children":40},"div",{"className":35,"style":39},[36,37,38],"img-wrap","has-caption","center","width: 100%; position: relative; margin-bottom: 62px",[41,43,50,51],{"type":26,"value":42},"\n  ",{"type":20,"tag":44,"props":45,"children":49},"img",{"src":46,"alt":47,"style":48},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251112\u002Fprevious%20vs%20PIN.png","Privious Multimodel Formats vs PIN Formats","width: 100%; max-height: 60vh; object-fit: contain; background: #141414; border-radius: 8px",[],{"type":26,"value":42},{"type":20,"tag":28,"props":52,"children":56},{"className":53,"style":55},[54],"img-text","position: absolute; top: calc(100% + 16px); left: 0; right: 0;text-align: center; overflow: hidden; white-space: nowrap; text-overflow: ellipsis; line-height: 22px; color: #A1A1A1; font-size: 14px",[57],{"type":26,"value":58},"\n    Privious Multimodel Formats vs PIN Formats\n  ",{"type":20,"tag":28,"props":60,"children":61},{},[62],{"type":26,"value":63},"The PIN format directly addresses this by fostering a deeper, synergistic integration of visual and textual knowledge. Each document in the dataset uniquely combines:",{"type":20,"tag":65,"props":66,"children":67},"ul",{},[68,74],{"type":20,"tag":69,"props":70,"children":71},"li",{},[72],{"type":26,"value":73},"A Semantically Rich Markdown File: This preserves the fine-grained textual structure, including headings, lists, and tables.",{"type":20,"tag":69,"props":75,"children":76},{},[77],{"type":26,"value":78},"A Holistic Overall Image: This captures the complete document layout, providing the crucial spatial and visual context.\nBy training on this dual representation, LMMs can learn to connect detailed text with its overarching visual layout, correcting the errors that plague current models.",{"type":20,"tag":28,"props":80,"children":81},{},[82],{"type":26,"value":83},"To empower new research, we are releasing our large-scale, open-source dataset built on this format, compiled from diverse web and scientific sources (in English and Chinese). The dataset was first constructed at a scale of PIN-14M (~14 million documents) and, while maintaining the same high-quality standard, has now been successfully scaled up to PIN-200M (~200 million documents).",{"type":20,"tag":33,"props":85,"children":87},{"className":86,"style":39},[36,37,38],[88,89,94,95],{"type":26,"value":42},{"type":20,"tag":44,"props":90,"children":93},{"src":91,"alt":92,"style":48},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251112\u002FStatistical%20Overview.png","Statistical Overview of our PIN-14M and PIN-200M dataset.",[],{"type":26,"value":42},{"type":20,"tag":28,"props":96,"children":98},{"className":97,"style":55},[54],[99],{"type":26,"value":100},"\n    Statistical Overview of our PIN-14M and PIN-200M dataset.\n  ",{"type":20,"tag":28,"props":102,"children":103},{},[104],{"type":26,"value":105},"We process 9 subsets, including PIN-Arxiv, PIN-PMC, DocLayNet, Linux-CN, chinese-markdown, OBELICS, MMC4, leetcode, and PG19. (Note: We do not release the PIN-arXiv subset in the preview version.)",{"type":20,"tag":33,"props":107,"children":109},{"className":108,"style":39},[36,37,38],[110,111,116,117],{"type":26,"value":42},{"type":20,"tag":44,"props":112,"children":115},{"src":113,"alt":114,"style":48},"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002F20251112\u002Fpin%20subsets.png","Dataset Structure",[],{"type":26,"value":42},{"type":20,"tag":28,"props":118,"children":120},{"className":119,"style":55},[54],[121],{"type":26,"value":122},"\n    Dataset Structure\n  ",{"type":20,"tag":28,"props":124,"children":125},{},[126],{"type":26,"value":127},"Equipped with detailed quality signals for easy filtering, the PIN datasets provide a foundational resource for developing and pre-training the next generation of powerful, knowledge-intensive LMMs.",{"type":20,"tag":21,"props":129,"children":131},{"id":130},"dataset-overview",[132],{"type":26,"value":133},"Dataset Overview",{"type":20,"tag":28,"props":135,"children":136},{},[137],{"type":26,"value":138},"Overall, the PIN-200M dataset comprises nearly 200 million documents, with a mean ITIF of 3.24 and a high prevalence of knowledge-intensive attributes. These characteristics indicate its nature as a large-scale, knowledge-intensive resource.",{"type":20,"tag":140,"props":141,"children":145},"donut-chart",{"data":142,"description":143,"title":144},"Documents,194977272,Overall images,269050145,Content images,230718460","","Data Distribution",[],{"type":20,"tag":21,"props":147,"children":149},{"id":148},"data-samples",[150],{"type":26,"value":151},"Data Samples",{"type":20,"tag":153,"props":154,"children":160},"iframe",{"src":155,"style":156,"frameBorder":157,"allowFullScreen":158,"loading":159},"https:\u002F\u002Fdataset.data4o.xyz\u002Fshare\u002Fdataset\u002Fpreview?datasetId=690b03823f23cb1188df1ee6&env=zh","width: 100%; height: 800px;","0",true,"lazy",[],{"type":20,"tag":21,"props":162,"children":164},{"id":163},"bibtex",[165],{"type":26,"value":166},"BibTeX",{"type":20,"tag":168,"props":169,"children":173},"pre",{"className":170,"code":172,"language":163,"meta":143},[171],"language-bibtex","@article{DBLP:journals\u002Fcorr\u002Fabs-2406-13923,\n  author    = {Junjie Wang and\n               Yuxiang Zhang and\n               Minghao Liu and\n               Yin Zhang and\n               Yatai Ji and\n               Weihao Xuan and\n               Nie Lin and\n               Kang Zhu and\n               Zhiqiang Lin and\n               Yiming Ren and\n               Chunyang Jiang and\n               Yiyao Yu and\n               Zekun Wang and\n               Tiezhen Wang and\n               Wenhao Huang and\n               Jie Fu and\n               Qunshu Lin and\n               Yujiu Yang and\n               Ge Zhang and\n               Ruibin Yuan and\n               Bei Chen and\n               Wenhu Chen},\n  title     = {{PIN:} {A} Knowledge-Intensive Dataset for Paired and Interleaved\n               Multimodal Documents},\n  journal   = {CoRR},\n  volume    = {abs\u002F2406.13923},\n  year      = {2024}\n}\n",[174],{"type":20,"tag":175,"props":176,"children":177},"code",{"__ignoreMap":143},[178],{"type":26,"value":172},{"title":143,"searchDepth":180,"depth":180,"links":181},2,[],"https:\u002F\u002Fdoxhub.s3.us-east-1.amazonaws.com\u002F2077ai\u002FBanner_dataset\u002Fdataset_pin.png"]