A question for those of you doing text classification: I'm looking to classify around 50,000 html pages into around half a dozen distinct categories (including an 'unknown' or 'uncategorized' category), based on a tagged set of around 1,000 pages. To my eye, it looks like most of the classification signal is in the non-text parts of the pages (classes, scripts, links, comments, etc.)
My first thought was to use a naive Bayesian classifier, but that led to two questions: (1) is naive Bayesian still a good approach to this, or are there better approaches you'd try? And (2) if I were to use naive Bayesian I think it's not obvious to me how to split the messy non-text parts of the html into fragments I could run the inner loop of the Bayesian classifier on, the way I would if I were following the classic spam classifier example.
I'm likely good with either a code-based or SaaS based solution, if anyone has any favorite text classifiers or techniques to suggest.