RSS Aggregators Should Offer Bayesian Classifiers
Posted on October 12th, 2005
I was just reading Ben Kamen's paper on applying Bayesian sorting techniques to document categorization beyond spam versus not-spam. This is a step forward as far as utility goes, but it's still a step within the world of email. What if the next step forward was really a step from the world of email to the world of RSS and ATOM?
I can't think of any RSS application offhand that considers the potential of Bayesian sorting. You might say that it's unnecessary in the first place, since consumption of a syndicated feed presumes interest and trust in the source. The source of an ATOM or RSS feed is known from the get-go and constant; the source of an email may or may not be immediately apparent what with the tricks spammers use to monkey around with the To: and From: fields of their messages.
I think this is naive on two levels. For one thing, a feed from a known "good" source may fluctuate between interesting content and uninteresting. The user makes that distinction in a way that's similar to how you pick and choose which stories to read in a print newspaper-- by considering headlines and maybe starting to read the first few sentences. When you monitor hundreds of feeds and start to build up a backlog of unread stories, answering “Is this worth reading?” becomes a real chore.
And feeds aren't necessarily from known sources. A PubSub feed may pull in matches from who knows where. You can flex your boolean logic muscles all you want, but at some point you'll breach the tradeoff between accuracy and comprehensiveness. More importantly, an accurate match may not be an interesting match.
Kamen's paper describes how Fog Creek's FogBugz software categorizes incoming email into categories like Technical Questions and Customer Service Questions. The generality is important. Bayesian techniques distinguish between the two in a more sophisticated way than just asking whether a particular word is present. You'd want to preserve that generality when you established categories for your ATOM and RSS reading. The first ones that come to mind are Interesting and Not Interesting. This could simplify the reading process quite a bit-- no more hopping from unread entries in one feed to unread entries in another feed. You'd just have two buckets.
Another approach might involve categories based on priority, where High, Medium, and Low represent declining levels of actionable information. Low might capture spam content as well-- either blatant spam or incidental spam.
With categorizations like these in hand and adequate filter training, an unread entry would take on fresh significance based on how it had been categorized just as an unread email in your Inbox is more significant than an unread entry in your spam folder. The closest thing you have right now when reading feeds is the ability to put “important” feeds in one aggregator folder and secondary feeds in another. There's no way to distinguish the occasional inane post coming out of an “important” feed from the occasional important post coming out of a secondary feed.