A question for those of you doing text classificat...
# 06-technical-discussion
A question for those of you doing text classification: I'm looking to classify around 50,000 html pages into around half a dozen distinct categories (including an 'unknown' or 'uncategorized' category), based on a tagged set of around 1,000 pages. To my eye, it looks like most of the classification signal is in the non-text parts of the pages (classes, scripts, links, comments, etc.) My first thought was to use a naive Bayesian classifier, but that led to two questions: (1) is naive Bayesian still a good approach to this, or are there better approaches you'd try? And (2) if I were to use naive Bayesian I think it's not obvious to me how to split the messy non-text parts of the html into fragments I could run the inner loop of the Bayesian classifier on, the way I would if I were following the classic spam classifier example. I'm likely good with either a code-based or SaaS based solution, if anyone has any favorite text classifiers or techniques to suggest.
Naive bayes is a pretty bad option and I'd stay away from it if you can. It makes a lot of very strict and kind of weird assumptions on your data, correlations, lots of features, etc. It's maybe a fine first pass but not really something I'd rely on for something as (seemingly) complex as the task you have in mind.
🙌 1
(I'm doing a little bit of research more on what you should do in a sec)
Thanks - I've been starting to conclude I need to do some pretty heavy pre-processing of the data (extracting classes, linked domains in a tags image tags, etc) to reduce the documents down to something that should be richer in signal and less full of noise
Yeah okay that's what I was going to suggest lol
Once you've got that you can do all your standard stuff -- random forest/SVM/etc. I'd usually go for random forests
It's a big problem you've got here, curious what it's for!
Thanks for that tip and the sanity-checking comments
Classic "scrape large number of retailers" kind of problem, with the observation that lots use, say, shopify or magento or wordpress or any of a handful of other vendors.
Oh dope, love it
Just a small not if its helpful -> not sure about the categories but if it has something to do with visual cues then - GPT4 vision API does great job in categorising or finding information .