A community of founders and builders creating the next generation of technology.

Cerebral Valley

visual.png

Hi folks, today we are launching Fineweb-Edu-Fortified, an open dataset of unique documents of high-quality educational content from the web, augmented with embeddings.

<https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified>

<https://huggingface.co/datasets/HuggingFaceFW/fineweb|Fineweb> is the largest open dataset of crawled content from the web. <https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu|Fineweb-Edu> is a subset of it filtered for high-quality educational content. We further processed it to remove duplicate entries (72%) and augmented it with embeddings.

• Exact-match deduplication across all crawls
• Embeddings for each row using the <https://huggingface.co/TaylorAI/bge-micro|TaylorAI/bge-micro> model
• Count column indicating duplication frequency
• Includes data from 95 Common Crawl crawls (2013-2024)
• Rows have been reduced from 1.279B to 0.324B after deduplication
• It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)
You can explore and download the dataset on Hugging Face, or visualize a 500k sub-sample in the Airtrain <https://app.airtrain.ai/dataset/c232b33f-4f4a-49a7-ba55-8167a5f433da/null/1/0|dataset explorer>.