MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.

Python

更新済み 2026年6月12日

Meta-rater

★195

[ACL 2025 Best Theme Paper] This is the official implementation for the paper: "Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models"

Python

更新済み 2026年6月11日

LOKI

★179

[ICLR 2025 Spotlight] The official implementation of the paper “LOKI：A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models”

Python

更新済み 2026年6月3日

MinerU-Popo

★164

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年6月13日

Earth-Agent

★158

[ICLR 2026] The official implementation of the paper “Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents”

Python

更新済み 2026年6月11日

labelU-Kit

★154

Data annotation component library --provided as NPM packages

TypeScript

更新済み 2026年6月10日

FakeVLM

★150

[NeurIPS 2025 🔥] FakeVLM: Advancing Synthetic Image Detection through Explainable Multimodal Models and Fine-Grained Artifact Analysis

Python

更新済み 2026年6月9日

opendatalab-datasets

★145

datasets resource

不明な言語

更新済み 2026年6月12日

mineru-vl-utils

★128

A Python package for interacting with the MinerU Vision-Language Model.

Python

更新済み 2026年6月11日

laion5b-downloader

★121

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年4月19日

MinerU-Ecosystem

★118

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年6月9日

VHM

★117

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Python

更新済み 2026年6月8日

HA-DPO

★104

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Python

更新済み 2026年5月24日

OHR-Bench

★102

(ICCV 2025) OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Python

更新済み 2026年6月3日

VIGC

★97

AAAI 2024: Visual Instruction Generation and Correction

Python

更新済み 2026年5月4日

MLS-BRN

★89

[CVPR 2024] 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions

Python

更新済み 2026年6月1日

Vis3

★87

Data browser based on s3. 一个基于 S3 的数据（json / jsonl / parquet / html / md等）可视化工具。👇 Try online.

TypeScript

更新済み 2026年6月1日

skydiffusion

★81

[ICCV 2025] The official implementation of the paper “Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm”

Python

更新済み 2026年5月29日

LEGION

★78

[ICCV25 Highlight] The official implementation of the paper "LEGION: Learning to Ground and Explain for Synthetic Image Detection"

Python

更新済み 2026年5月27日

CiteVQA

★68

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年6月7日

CLIP-Parrot-Bias

★66

ECCV2024_Parrot Captions Teach CLIP to Spot Text

Python

更新済み 2025年11月19日

opendatalab-python-sdk

★60

SDK of OpenDataLab - https://opendatalab.org.cn

Python

更新済み 2026年5月9日

MLLM-DataEngine

★48

MLLM-DataEngine: An Iterative Refinement Approach for MLLM

Python

更新済み 2025年9月11日

WanJuan3.0

★46

WanJuan3.0（“万卷·丝路”）一个作为综合性的纯文本语料库，采集了多个国家地区的网络公开信息、文献、专利等资料，数据总规模超1.2TB，Token总数超过300B，处于国际领先水平，首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成，每个子集的数据规模均超过150GB

不明な言語

更新済み 2026年4月24日

dsdl-docs

★46

Data Set Description Language Specification （新一代人工智能数据集描述语言DSDL）

HTML

更新済み 2026年2月22日

ProverGen

★45

[ICLR 2025] This is the official implementation for the paper: "Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation"

Python

更新済み 2026年4月26日

CHARM

★45

[ACL 2024 Main Conference] Chinese commonsense benchmark for LLMs

Python

更新済み 2026年4月8日

UrBench

★37

[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”

Python

更新済み 2026年4月8日

TRivia

★34

(CVPR 2026) TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Python

更新済み 2026年6月4日

REST

★34

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年5月19日

image-downloader

★30

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年5月13日

Sciverse-Agent-Tools

★27

Standardized tool schemas and SDKs that expose Sciverse Open Platform retrieval capabilities to LLM agents.

Python

更新済み 2026年6月11日

RxnCaption

★25

[CVPR 2026] SOTA Chemical Reaction Diagram Parsing Framework

Python

更新済み 2026年5月13日

labelbee

★25

このリポジトリに関する説明は提供されていません。

TypeScript

更新済み 2024年8月7日

Miner-PDF-Benchmark

★24

MPB (Miner-PDF-Benchmark) is an end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios.

Python

更新済み 2026年2月2日

mineru-tutorials

★18

MinerU Training Camp course materials and tutorials

不明な言語

更新済み 2026年6月1日

awesome-markdown-ebooks

★17

このリポジトリに関する説明は提供されていません。

不明な言語

更新済み 2026年5月14日

WebMainBench

★16

WebMainBench is a high-precision benchmark for evaluating web main content extraction.

Python

更新済み 2026年4月20日

CrossViewDiff

★16

The official implementation of the paper "CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis"

JavaScript

更新済み 2026年1月31日

WanJuan2.0-WanJuan-CC

★14

WanJuan-CC是以CommonCrawl为基础，经过数据抽取，规则清洗，去重，安全过滤，质量清洗等步骤得到的高质量数据。

不明な言語

更新済み 2026年5月26日

PM4Bench

★14

このリポジトリに関する説明は提供されていません。

Python

更新済み 2025年7月24日

dsdl-sdk

★13

このリポジトリに関する説明は提供されていません。

Jupyter Notebook

更新済み 2024年5月29日

OpenHuEval

★12

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年5月12日

MinerU-Webkit

★10

このリポジトリに関する説明は提供されていません。

HTML

更新済み 2026年6月9日

labelU-frontend

★9

LabelU front-end library

TypeScript

更新済み 2026年1月27日

awesome-mineru

★8

🕶️ A curated list of awesome things related to MinerU

Python

更新済み 2026年6月4日

allz

★7

A universal command line tool for compression and decompression

Python

更新済み 2026年6月3日

MolRecBench-Wild

★4

このリポジトリに関する説明は提供されていません。

Python

更新済み 2026年6月5日

CRaFT

★4

[AAAI25] Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning

Python

更新済み 2025年7月3日

GRAIT

★3

[NAACL25 findings] Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

Python

更新済み 2025年8月16日

OmniDocLayout

★2

[CVPR26 Highlight] The official implementation of the paper "OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning"

不明な言語

更新済み 2026年6月12日