A question for those of you doing text classification: I'm looking to classify around 50,000 html pages into around half a dozen distinct categories (including an 'unknown' or 'uncategorized' category), based on a tagged set of around 1,000 pages. To my eye, it looks like most of the classification signal is in the non-text parts of the pages (classes, scripts, links, comments, etc.) My first thought was to use a naive Bayesian classifier, but that led to two questions: (1) is naive Bayesian still a good approach to this, or are there better approaches you'd try? And (2) if I were to use naive Bayesian I think it's not obvious to me how to split the messy non-text parts of the html into fragments I could run the inner loop of the Bayesian classifier on, the way I would if I were following the classic spam classifier example. I'm likely good with either a code-based or SaaS based solution, if anyone has any favorite text classifiers or techniques to suggest.
Naive bayes is a pretty bad option and I'd stay away from it if you can. It makes a lot of very strict and kind of weird assumptions on your data, correlations, lots of features, etc. It's maybe a fine first pass but not really something I'd rely on for something as (seemingly) complex as the task you have in mind.
Thanks - I've been starting to conclude I need to do some pretty heavy pre-processing of the data (extracting classes, linked domains in a tags image tags, etc) to reduce the documents down to something that should be richer in signal and less full of noise
Once you've got that you can do all your standard stuff -- random forest/SVM/etc. I'd usually go for random forests
Classic "scrape large number of retailers" kind of problem, with the observation that lots use, say, shopify or magento or wordpress or any of a handful of other vendors.
Just a small not if its helpful -> not sure about the categories but if it has something to do with visual cues then - GPT4 vision API does great job in categorising or finding information .