Development Seed Blog

Sp*m: What We're Doing About It

Over the past few months, spam has become something like a daily conversation topic for us. It's now a nearly ubiquitous part of internet experience, showing up most prominently in email inboxes, website comment threads and discussion boards. Since we can’t seem to avoid it (and frequently get questions about it), we thought it would be worth addressing how we're thinking about spam in general and then sharing some of our plans for dealing with it.

One of the common questions we get is, “Why do spammers even bother?” It’s an honest question, since (most likely) neither you or anyone you know has clicked on a link from a spammer in a long time. So why do they keep at it? Where could the money possibly be?

The answer is the same one as the answer to another question: Why are spammers increasingly focusing on websites instead of emails? It's simple: search engine optimization. In attempts to trick search engines into putting their websites at the top of search results, spammers try to distribute as many links to their sites as possible. The number of links to a website can have significant impact on that site’s ranking in search results. To do this, spammers often use networks of “bots” – little virus-like programs that infect computers and lie dormant until they’re called up to duty.

Once activated, bots can work to spam websites simultaneously, and they use up their host computers’ resources at the same time. Wired published a fascinating article late last year that linked large and well-organized bot networks (“botnets") to everything from destroying businesses to compromising national security for several countries. A large botnet crashed Typepad (on of the largest blog services), and a similar attack—well-coordinated—could crash Google or Yahoo!. The graph below is from Subtraction.com and maps the influx of spam comments caught by Akismet:

Spam was once a concern to bloggers and site owners because it disrupts conversations and puts nasty content on their sites. Would that this were still the big concern... Since spammers benefit from large volumes of links being distributed around the web, spam can easily take sites completely offline as botnets attempt to post dozens (or hundreds, or thousands) of comments at once. Smaller organizations that rely on the economics of shared servers are particularly vulnerable to this problem. Even if they are well protected, their sites can be adversely affected by spam attacks against other websites hosted on the same server. Since we have several clients in this category, we’ve made this kind of spam our concern.
Luckily, most of these attacks are targeted at specific sites. Sometimes they are motivated directly by things like politics—Sen. Joeseph Lieberman’s campaign site was crashed on Election Day 2006 by an as-yet-unknown assailant. In other cases, spammers discover loopholes in websites that make them friendly to spam and decide to keep coming back, or even increase the scale of their abuse of that site. The first and best line of defense, if it can be maintained is definitely keeping your site off spammers’ radar.

Defending sites against spam used to be easy. It required little more than forcing users to sign up with valid email addresses before they could post comments. But with each new defense, spammers find a new loophole to keep their businesses running. We’ve seen spam bots recently that have learned how to create authenticated user accounts on Drupal sites in order to post spam.

So what do we do about it? Along with others in the Drupal community, Development Seed is trying to look at spam from every angle that we can in order to help keep ours and our clients’ sites up and running:

1) Block it at the router level:


This can be done at any point in order to block traffic from a given IP address, but it’s most helpful during an actual spam attack. During an attack, a host can directly block the offending IP address (if it is one or two, and not a ton of multiple, random IP’s). Blocking IP’s outright can inadvertently block legitimate traffic too.

2) Block it at the Apache level:

mod_security is a module for Apache that uses language rule sets, and looks at anomalies in requests on the web server in order to judge the legitimacy of the traffic. It determines whether to serve pages from a web application—like Drupal—to the requester. http://www.modsecurity.org/

3) Block it at the application layer:

badbehavior is a module that will log HTTP requests for a website. It works with Drupal and other dynamic web platforms. Badbehavior compares HTTP requests against known valid and invalid user agents (these are configurable), and then decides whether or not to serve the requested page. If the request does not match the list, the heavier, database-driven page request for Drupal (or whatever you’re running) will not be served. Instead, an error page displays explaining to the bot, or the mistaken user, that they do not appear legitimate. It provides the user with contact information to argue the ban against them in case it happens by mistake.

4) Block before posting:


You know those little image recognition quizzes on some sites that are used to prove that you’re human? This is an important step in the process that uses a module called Captcha. The one downside of Captcha is usability – users with vision problems or other access difficulties can be hindered by image recognition. (Not to mention people with perfect vision – have you seen some of the images that get served to users?) We would recommend very clean images be used with this method.

5) Block after posting:

Pre-configured spam filters scan IP addresses and compare them with databases of offending sites. Other types of spam filters use contextual filtering to analyze the content of a post, and then delete it if it’s recognized as spam. The Akismet service is currently the largest and most popular of these filters, and there is a Drupal module for integrating with Akismet.

6) Block long term:


These are our least favorite solutions, but are necessary in some cases. Options include turning comments off, or turning the comment field on older posts off entirely, into an email address, or into a general contact form (in which case email programs have to deal with the spam).

Overall, we've found that a mixed approach works best, meaning that the best defense is implementing all of these methods. Doing this well requires the time to monitor your site and respond to people who may write you because they were mistaken for a bot and cannot access your site, or cannot post a particular comment on your site. Even so, the time saved over dealing with constant spam is usually well worth it.

Comments
Good rundown!

But I'd vote for expanding 4 a bit more. I've had luck with a variety of non-captcha obfuscation techniques that can be taken -- JS obfuscation, rotating form targets, decoy form fields -- although I implemented most of them in Movable Type, not Drupal. But yeah, nothing beats Akismet.

Hey Tom,

Hey Tom,

Good to see you at the meetup.

We saw someone talking about this over here - http://www.subtraction.com/archives/2007/0211_loose_commen.php
w/ the comment on Mon 12 Feb 2007 at 01:23 AM

Is Neil's approach along the lines of the techniques you've used w/ Movable Type? Drupal could use a module working along these parameters. I think #4 is a little under-covered in approaches right now, esp. w/ captcha pretty broken in 4.7, though looks like it's finally working in 5, and can now be easily latched onto any form field.

Ian W.

Yup, the JS obfuscation he

Yup, the JS obfuscation he refers to is similar to what I've tried. Using some Javascript to fill in the form's action attribute can go a long way to discouraging spam. I wrote up an example here. It definitely helps, but some spambots are now running pages through a JS virtual machine, so this technique's running out of steam.

The more exciting technique I've implemented in Movable Type involved actually renaming the mt-comments.cgi script randomly via a cron job that runs every 15 minutes (leaving two copies of the script at any one time so that users who have the cron run between page load and submission aren't left with 404s). The FTP-based approach I took obviously wouldn't be relevant to Drupal -- it was written for a shared-host audience -- but the general idea could certainly be adapted.

Is there a way for a module to rewrite other module's hook_menu entry points? My guess is that the answer is no. But it's not too hard to imagine implementing a similar system on a module-by-module basis.

It's far from a silver bullet, of course, but in my experience it does limit the duration of any given spam outbreak to whatever the cron interval is.

One other idea for Drupal implementation from the MT spambusting community: blank form fields hidden with CSS. Spambots will tend to fill them in, in which case you simply deny the submission. That one would probably be pretty easy to implement in Drupal.