A question for those of you doing text classification I m lo Cerebral Valley #06-technical-discussion

A question for those of you doing text classificat...

Don Alvarez

03/10/2024, 4:01 PM

A question for those of you doing text classification: I'm looking to classify around 50,000 html pages into around half a dozen distinct categories (including an 'unknown' or 'uncategorized' category), based on a tagged set of around 1,000 pages. To my eye, it looks like most of the classification signal is in the non-text parts of the pages (classes, scripts, links, comments, etc.) My first thought was to use a naive Bayesian classifier, but that led to two questions: (1) is naive Bayesian still a good approach to this, or are there better approaches you'd try? And (2) if I were to use naive Bayesian I think it's not obvious to me how to split the messy non-text parts of the html into fragments I could run the inner loop of the Bayesian classifier on, the way I would if I were following the classic spam classifier example. I'm likely good with either a code-based or SaaS based solution, if anyone has any favorite text classifiers or techniques to suggest.

Cameron Pfiffer

03/11/2024, 3:55 PM

Naive bayes is a pretty bad option and I'd stay away from it if you can. It makes a lot of very strict and kind of weird assumptions on your data, correlations, lots of features, etc. It's maybe a fine first pass but not really something I'd rely on for something as (seemingly) complex as the task you have in mind.

🙌 1

Cameron Pfiffer

03/11/2024, 3:57 PM

(I'm doing a little bit of research more on what you should do in a sec)

Don Alvarez

03/11/2024, 3:58 PM

Thanks - I've been starting to conclude I need to do some pretty heavy pre-processing of the data (extracting classes, linked domains in a tags image tags, etc) to reduce the documents down to something that should be richer in signal and less full of noise

Cameron Pfiffer

03/11/2024, 3:58 PM

Yeah okay that's what I was going to suggest lol

Cameron Pfiffer

03/11/2024, 3:59 PM

Once you've got that you can do all your standard stuff -- random forest/SVM/etc. I'd usually go for random forests

Cameron Pfiffer

03/11/2024, 3:59 PM

It's a big problem you've got here, curious what it's for!

Don Alvarez

03/11/2024, 3:59 PM

Thanks for that tip and the sanity-checking comments

Don Alvarez

03/11/2024, 4:02 PM

Classic "scrape large number of retailers" kind of problem, with the observation that lots use, say, shopify or magento or wordpress or any of a handful of other vendors.

Cameron Pfiffer

03/11/2024, 4:09 PM

Oh dope, love it

Ritesh Kadmawala

03/16/2024, 6:23 AM

Just a small not if its helpful -> not sure about the categories but if it has something to do with visual cues then - GPT4 vision API does great job in categorising or finding information .

2 Views

Open in Slack

Previous Next