:wave: I’m excited to share Datalab — a linter fo...
# 07-self-promotion
j
👋 I’m excited to share Datalab — a linter for ML datasets. We recently published a blog introducing Datalab and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run Datalab on your own data. All of us that have dealt with real-world data know it’s full of various issues like label errors, outliers, (near) duplicates, drift, etc. One line of open-source code
datalab.find_issues()
automatically detects all of these issues. In Software 2.0, data is the new code, models are the new compiler, and manually-defined data validation is the new unit test. Datalab combines any ML model with novel data quality algorithms to provide a linter for this Software 2.0 stack that automatically analyzes a dataset for “bugs”. Unlike data validation, which runs checks that you manually define via domain knowledge, Datalab adaptively checks for the issues that most commonly occur in real-world ML datasets without you having to specify their potential form. Whereas traditional dataset checks are based on simple statistics/histograms, Datalab’s checks consider all the pertinent information learned by your trained ML model. Hope Datalab helps you automatically check your dataset for issues that may negatively impact subsequent modeling --- it’s so easy to use you have no excuse not to 😛 LMK your thoughts!
🔥 9
My startup Cleanlab will soon start doing events in SF, so stay tuned!
🎤 1
🎧 2
j
@Jonas Mueller Please keep me posted for Cleanlab events 🙂
👍 1
❤️ 1