Development Seed Blog

Drupal, Meet Python

Speeding Up Aggregation in Drupal with a Daemon

Speeding Up Aggregation in Drupal with a Daemon

Recently we’ve been working to get our team aggregator and media analyzer Managing News running faster – we want it to fly. To do this we’ve had to really push what Drupal and the LAMP stack can do. Aron and Alex have done great things with aggregation and feed parsing to extend the volume of content that can be collected, which is essential for Managing News since is can aggregate tens of thousands of articles every day. But we still wanted to be able to aggregate content faster.  

A major thing that slows down adding content to Managing News is its semantic analysis – every piece of content that the system pulls in is automatically browsed and given tags that describe it. To do this we mostly use third party web services like Yahoo's term extraction API. Waiting for Yahoo to process the text of each article coming in can add a significant amount to time to the process of adding content - it keeps cron running longer, which increases the chances of a bad cron run, and ties up the system’s resources to just wait. 

To get around this we looked for ways to move the semantic analysis elsewhere. We settled on the idea of using a small external program to talk to tagging web services and do it's own analysis of content in our Drupal system. We looked at a couple options, and quickly decided to write a Python daemon. (A daemon is a program that runs in the background on a server. It waits for certain events and then takes specified actions. In our case the daemon waits for new content, and then tags and processes it as it becomes available.)

We choose Python over php for two main reasons:

  • We wanted to get away from the LAMP stack. Apache threads serving Drupal pages can consume upwards of 30 mb of memory. The Python script we've got uses less than 4 mb. Potentially we could run 7 or 8 instances of the Python script with the same memory footprint as a single Drupal page load.
  • Daemonizing a process with Python is easy

Our current version consists of a Drupal module and Python scripts. The Drupal module serves three functions: 1) it queues up new nodes for the daemon to process, 2) it allows you to configure the daemon's behavior, 3) it creates tables inside the Drupal database that the daemon needs.

The Python programs consists of a very thin database wrapper, a minimal interface to Drupal for loading nodes and working with taxonomy terms, a daemonizing script, and finally a set of plugins to process nodes. Each time a new story from an RSS feed is added to the system, it’s put in a queue for the daemon. The daemon then grabs batches of nodes from this queue and processes them together, first giving the plugins a chance to act on each node individually and then as a batch.

Already this has been a great addition to Managing News. Aggregation is lighter and faster, and the system itself requires less resources – which means we can aggregate more. We've got lots of ideas on other things we can do with our little daemon and are very excited to see what we can use it for.

And yes, we do intend to release the code. But as total pythons newbs, we want to go over it with a fine toothed comb to make sure we're not doing anything too offensive to any Pythonistas hidden within the Drupal community :)

Comments
Drupy in Python is Almost Here

I am the founder and lead developer for the Drupy Project, which can be found at http://drupy.net . Our team is working hard to bring Drupal to Python. Drop by our IRC channel and say hi :)

python and php-cli

Why not write the program in php command line? The benefit of php-cli being that you can include the bootstrap.inc, bootstrap up to the database, and then use the normal Drupal API that you love rather than having to write your own system. This requires far fewer resources than a full LAMP+Drupal process while still being easy (just as easy, afaik) to daemonize.

If you want to venture into Python, great, but personally I prefer not mixing my scripting languages.

We talked about using php

We talked about using php for this and bootstrapping Drupal, but the way we are using the our python scripts is very focused and doesn't actually touch many Drupal tables. We didn't really see a need for the full Drupal api. Most of the work the daemon is doing is just contacting third party services, or running some light text analysis. All of data this yields is stored in either taxonomy or custom tables - we aren't creating or deleting nodes.

But you do have a point, for example if we starting working with cck nodes in managing news it would be a lot of work to load them - and it would make much more sense to use php-cli. But we don't have any plans to use cck on Managing news, so it's not likely that particular issues will bite us.

With python we were able to get a proof of concept working very quickly, and we've been able to make it more extensible and powerful while keeping it's interaction with Drupal very clean. So for us it looks like it'll work well.

-jeff

Programming Collective Intelligence

A book you should definitely check out (if you haven't already) is Programming Collective Intelligence. It details techniques for collaborative filtering, detecting groups of similar items in a large datasets, and other "machine intelligence" type activities. I've read through it and am itching for a project where I can apply its techniques.

As a bonus, all the sample code is in Python!

We've got a few copies of

We've got a few copies of that book here in the DS office, it's a great read. Highly recommendable.

-jeff

More joy from Python?

I'm not much of a programmer, but I do know Python much better than I know PHP. I was just curious, did you find Python any more enjoyable than PHP? Note, I didn't say better or more useful for Web development.

I've always felt that something about the structure of Python programming makes it easy and comfortable to follow...even when it's not your own code. With PHP...I've always had to take a step back...and spend more time figuring out what I'm looking at.

Personally I like working in

Personally I like working in python, the syntax is nice and clean and using the interpreter is just fun. There are a few things about python that are took some adjustment after spending so long with php; unicode string handling was the real un-expected one. Surprisingly though it was nice to work in a language that has stronger type handling than php.

-jeff

Post new comment
The content of this field is kept private and will not be shown publicly.