Development Seed Blog
Analyzing News with Graphs and Tag Clouds
Tools That Show What's Happening in a Glance
Tools That Show What's Happening in a Glance
One of the features that sets the Managing News system apart from other aggregators is that it makes it easy to act on information. It does this by showing you the big picture of what’s going on within the system over different timeframes.
The latest graph we plan to integrate into Managing News is a line graph that will show the terms (with the highest and lowest rate of change) that have risen or fallen fastest over a period of time. This will quickly show you the newest hot topics and what is already old news. A big benefit of this graph is that it will cut out issues that are always talked about in your field.
Currently we have a bar graph on Managing News that shows the most talked about topics today, this week, this month, and in the past six months. Here’s a screenshot of it on the reproductive health organization RH Reality Check’s Managing News.
As you can see in the above graphs, Iraq is consistently talked about, and Iraq has rather consistently been talked about since the war began four years ago. But that’s likely not the news that’s going to influence your daily communications and outreach (unless there’s a change or policy shift). That’s why we’re planning to switch to the first graph – to really hone in and show the topics that are spiking up and down.
It’s also easy to see what topics are most significant to the team or organization running Managing News. Just look at the tag cloud.
This shows the top 20 most talked about terms by the sources in your Managing News. Currently you can look at this information over four different time periods: day, week, month, and six months. However we plan to nix the different time periods, at least temporarily, because it’s slowing down the speed at which the graphs can update themselves. Instead of showing tags over the time periods, we plan to show the most popular tags of all time.
As we use Managing News on larger scales, we’ve seen that the graphs are taking a longer time to load and update themselves. Currently it can take up to a few minutes for graphs to update themselves with the latest information, and that’s way too long. One reason for this slow down is simply the sheer volume of content in Managing News systems, and that’s the nature of any aggregator. RH Reality Check’s system is brand new and already has 86,000 articles in it, and it’s only going to grow.
The second reason is that the process to pull information from Managing News into the graphs is complicated. Currently it works like this:
Step 1: Look for articles in the required time period in a potentially huge pool of entries (86,000 in the case of RH Reality Check) to determine the most popular terms.
Step 2: Add up all term relations (the package of an article with each of its tags, currently 640,000 for RH Reality Check) in these articles and figure out which one’s are the most popular.
Step 3: Plots the information in the graphs.
That’s for one time period. Our current graphs each have four. Simply put, this process takes awhile, and we don’t want to wait for it to run its course.
This week Alex will be working on implementing an idea he has to simplify this process. He’ll set up Managing News to keep content cached in chronological order depending on the publish date of the article as the articles are pulled into the system. This means that when the graphs are pulling information, they can skip step 1. This change should make the two new graphs we’re adding to Managing News work quickly.
What do you think of the new graphing ideas? We’d love to hear your thoughts on what information would most help your team and what you think of our plan to forego different time periods to improve speed.



Comments
Graphs
Graphing is a great idea.... I am a visual person, and seeing the curves helps a lot
distribution of terms ?
How about the distribution of the terms ?
I have a table with 2000+ terms created by yahoo! (no limits on the number of terms per node), but less than 15% of them are used on several nodes. It might be a safe bet to ignore all terms that used only once, reducing the input data vastly.
Good call. Especially for
Good call. Especially for yahoo terms that are intended to give you a rough idea about what's going on in your mass of blog posts it could also make sense to expire terms after a while if they only occur once or twice.
Maybe I'm just repeating
Maybe I'm just repeating what Alex has in mind, but this might work to improve performance:
Since past data does not change, you can prepare and cache data about the most popular tags for the past.
Every hour a cron is run to get and cache the most popular tags for the most recent time period (the past hour); the first run after 00:00h creates and caches (in a specially created table) the most popular tags for 1 day.
When you need the most popular tags for a certain period (7 days, 6 months), it is calculated from the most popular tags of the past 7 (or 6*31) days.
Make sure you keep more then just the 20 most popular tags for a certain period; because the most popular tags of a more broad time period are not just an addition of the most popular tags per day (a tag that is on position 51 every day, might get a top 20 popularity on a monthly scale)
You are reading my
You are reading my mind.
The problem is, that even doing it that way is getting to slow in cases where we have, say 10 Groups with 500 feeds total and a 8 stats per group.
I'll do some more testing and coding here the next days, if you're interested, I can keep you posted.
Yes, that might be
Yes, that might be interesting. I'll keep an eye on the posts on this site, or you can contact me directly via http://drupal.org/user/8777/contact .
Post new comment