Development Seed Blog
Unclogging the Pipes: Improving Managing News' Feed Load
Getting wget and cron onto the same page
Getting wget and cron onto the same page
Managing News is starting to move past proof of concept, and we’ve officially begun working to improve its core performance. Earlier this week RH Reality Check – the latest organization we’ve customized the tool for - told us that Managing News wasn’t pulling in new content from RSS feeds as often as it should have been. In some cases feeds hadn’t been updated in a week. And an aggregator with week-old news certainly doesn’t enhance an organizations communications capacity. Here's an image of one of its feeds "thinking."
When Alex looked into the issue, he found another one – duplicate feeds. Luckily what at first seemed like two big problems was fixed with just one tiny change (and a lot of Alex’s brainpower, of course).
These two issues – un-updated feeds and duplicate feeds – were related. Let me explain. Cron automatically checks for new content on feeds entered in Managing News and is set up to check a certain number of feeds every ten minutes. It uses wget to request cron.php and therefore trigger the functions that bring new content into the Managing News system (we are using leech module as aggregator). What was happening was that checking for new feed items was sometimes so slow that the functions in cron.php did not return early enough to actually deliver a valid page to wget on time. By default wget is set to retry a URL up to 20 times, so it just kept on requesting cron.php. Because of this we wound up with multiple concurrently running cron jobs that caused Managing News to blip and produce the same feed items from the same feeds over and over again and only slowly moving on to new feeds. And the process created a lot of useless serverload.
The fix to this issue was tiny. Alex added "-t 1" to the wget call in the cron script to tell it to check each URL only once instead of the 20 default times. It worked. RH Reality Check’s Managing News is back to regularly updating all 580 of their feeds with no duplicates. The finding did call for a “heavy” patch for core ;). The patch for Drupal 6.x is already in, the backport for 5.x is pending, a 4.7.x patch is in the works.
This is great news for us because in addition to these problems, the bug was quite literally killing the server. Cron takes a lot out of a server at any time, and for Managing News it’s set to run more frequently than for typical websites. In RH Reality Check’s case, it runs every ten minutes and updates 580 feeds more than once a day. A typical website runs cron once an hour tops for a few feeds. With any Managing News system, the server is already sprinting every ten minutes. This bug was turning those short sprints into mile long sprints run back to back. This wear on the server is one of the potential reasons why wget wasn’t getting the content it should have in its first try. We’re currently working on optimizing the performance of the server and the Managing News system so it can be less of a burden on a server and run faster and more efficiently.
If you’re interested in more technical details on this and the chat that led to Alex’s discovery, check out the conversation on Drupal here, here, and here

Comments
still hitting apache
Hi themegarden.org - another reason to use the flag - if the server is under heavy load and does not serve the request, you're still hitting apache w/ a growing number of requests if it's unable to serve them each time.
themegarden:
themegarden:
Thanks for clarifying this. As I was looking at 4.7 systems here, the -t 1 option for wget is a lot more important than for 5.x and 6.x installations.
That's why I am going to roll this patch for 4.7.x now :)
I'm confused a bit. As I
I'm confused a bit.
As I know when you try to re-run cron job (by wget-ing cron.php), while it is already running, shouldn't harm - as of Drupal 5, cron has semaphore which will prevent scenario you are talking about. So there is no need for "-t 1" wget option, but it shouldn't harm too.
In case of http://www.rhrealitycheck.org things are different - they are running 4.7 drupal - there is a cron without semaphore.
Post new comment