wave I m excited to share Datalab a linter for ML datasets Cerebral Valley #07-self-promotion

:wave: I’m excited to share Datalab — a linter fo...

Jonas Mueller

05/17/2023, 7:50 PM

👋 I’m excited to share Datalab — a linter for ML datasets. We recently published a blog introducing Datalab and an open-source Python implementation that is easy-to-use for all data types (image, text, tabular, audio, etc). For data scientists, I’ve made a quick Jupyter tutorial to run Datalab on your own data. All of us that have dealt with real-world data know it’s full of various issues like label errors, outliers, (near) duplicates, drift, etc. One line of open-source code

datalab.find_issues()

automatically detects all of these issues. In Software 2.0, data is the new code, models are the new compiler, and manually-defined data validation is the new unit test. Datalab combines any ML model with novel data quality algorithms to provide a linter for this Software 2.0 stack that automatically analyzes a dataset for “bugs”. Unlike data validation, which runs checks that you manually define via domain knowledge, Datalab adaptively checks for the issues that most commonly occur in real-world ML datasets without you having to specify their potential form. Whereas traditional dataset checks are based on simple statistics/histograms, Datalab’s checks consider all the pertinent information learned by your trained ML model. Hope Datalab helps you automatically check your dataset for issues that may negatively impact subsequent modeling --- it’s so easy to use you have no excuse not to 😛 LMK your thoughts!

🔥 9

Jonas Mueller

05/17/2023, 7:51 PM

My startup Cleanlab will soon start doing events in SF, so stay tuned!

🎤 1

🎧 2

James Le

05/24/2023, 4:31 PM

@Jonas Mueller Please keep me posted for Cleanlab events 🙂

👍 1

❤️ 1

Open in Slack

Previous Next