Hi folks, today we are launching Fineweb-Edu-Forti...
# 07-self-promotion
e
Hi folks, today we are launching Fineweb-Edu-Fortified, an open dataset of unique documents of high-quality educational content from the web, augmented with embeddings. https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified Fineweb is the largest open dataset of crawled content from the web. Fineweb-Edu is a subset of it filtered for high-quality educational content. We further processed it to remove duplicate entries (72%) and augmented it with embeddings. • Exact-match deduplication across all crawls • Embeddings for each row using the TaylorAI/bge-micro model • Count column indicating duplication frequency • Includes data from 95 Common Crawl crawls (2013-2024) • Rows have been reduced from 1.279B to 0.324B after deduplication • It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu) You can explore and download the dataset on Hugging Face, or visualize a 500k sub-sample in the Airtrain dataset explorer.
🙌 1