Tue, 11 Jun 2013
De-spammed this blog (with Naive Bayes)
This morning, I was trying to decrease the amount of email in my inbox. I had a few messages with subjects like:
- comment on http://www.asheesh.org/note/debian/freed-software
- comment on http://www.asheesh.org/note/sysop/comments
- comment on http://www.asheesh.org/note/debian/freed-software
But all the comments in this case were spam. I'm using an Akismet API plugin for pyblosxom, but that has a few shortcomings. Like anything else, it misses some spam, but moreover, it doesn't help me find and remove old spam comments in bulk.
My pattern with email is basically to ignore it for a while, and then deal with it in bulk, sometimes missing messages from the past. The result is that I have often missed these comment notifications, and it was a bit of a drag to figure out which comments I had dealt with already.
So I wrote a small tool this morning. Here is how it works:
- It loops over the comments directory.
- A script reads each comment and prints it to standard out in mailbox format, piping the message to spambayes for processing.
- The main script shows me spambayes' guess as to if the message is spam, as well as spambayes' certainty, and asks me to confirm. If I confirm it is spam, it asks if I want to delete it. (If it notices spambayes got it wrong, it retrains spambayes.)
- After I have dealt with the comment, it creates a stamp file next to the comment so that it won't ask me about that the next time I run the tool.
Voila! A spam moderation queue with artificial intelligence.
You can find it here, on my Github account: https://github.com/paulproteus/spambayes-pyblosxom
Permission to re-use the code is granted under the terms of CC Zero or Apache License 2.0, at your option.
Moreover, now I believe there are zero spam comments left lying around this blog!