AICC Dataset

Processing pipeline

An advanced semantic-aware pipeline for large-scale web data processing

Data Processing Pipeline

High-quality main content

High-fidelity main content extracted from diverse Common Crawl pages, including challenging types like forums, Q&A sites, and pages with tables or formulas.

High-quality main content

Precise structured elements

High-fidelity extraction of code blocks, mathematical formulas, and complex tables from real-world web pages, preserving syntax, formatting, and structural integrity.

Precise structured elements

Proven downstream effectiveness

Pretraining a language model on AICC leads to higher accuracy across diverse benchmarks compared to training on datasets extracted with other methods.

Proven downstream effectiveness

Data samples

See how web data transforms into markdown-formatted content

Mathematical Content
Mathematical Content
Code Blocks
Code Blocks
Forum Discussions
Forum Discussions
Structured Tables
Structured Tables
Mathematical Content
Mathematical Content
Code Blocks
Code Blocks
Forum Discussions
Forum Discussions
Structured Tables
Structured Tables
Mathematical Content
Mathematical Content
Code Blocks
Code Blocks
Forum Discussions
Forum Discussions
Structured Tables
Structured Tables
Mathematical Content
Mathematical Content
Code Blocks
Code Blocks
Forum Discussions
Forum Discussions
Structured Tables
Structured Tables

Try It Yourself

Experience real-time web-data processing with your own HTML

Upload File
No file selected

Please upload a single HTML file