Hi folks, today we are launching Fineweb-Edu-Fortified, an open dataset of unique documents of high-quality educational content from the web, augmented with embeddings.
https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified
Fineweb is the largest open dataset of crawled content from the web.
Fineweb-Edu is a subset of it filtered for high-quality educational content. We further processed it to remove duplicate entries (72%) and augmented it with embeddings.
• Exact-match deduplication across all crawls
• Embeddings for each row using the
TaylorAI/bge-micro model
• Count column indicating duplication frequency
• Includes data from 95 Common Crawl crawls (2013-2024)
• Rows have been reduced from 1.279B to 0.324B after deduplication
• It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)
You can explore and download the dataset on Hugging Face, or visualize a 500k sub-sample in the Airtrain
dataset explorer.