Big data requires good data janitors. With a mission to end air inequality, OpenAQ has collected over 170 million air quality measurements, and this number grows every 10 minutes. Development Seed and OpenAQ are now working together to build a data preparation tool for analysis-ready data, which you can use on any dataset.
OpenAQ provides their data untouched, so that users can decide for themselves how to validate and analyze the raw data. But data is messy. Raw air quality measurements often include error codes, negatives, and repeating values. These can make it difficult to visualize in meaningful ways, or to calculate aggregates to use downstream for policy and scientific applications.
When OpenAQ stakeholders asked What’s on the OpenAQ Community Wish List? an open-source, extensible tool to flag problematic data was at the top of the list.
Development Seed, working with the OpenAQ community, has built openaq-quality-checks. Openaq-quality-checks is a command-line tool for configuring flags for and optionally removing suspect data. When you’re ready to jump in, check out the README or watch the short intro video.
Starting with user needs
In designing openaq-quality-checks, we asked users: What is your OpenAQ data quality experience?. We were excited by the rich feedback we received. Respondents presented example API calls for problematic data types and discussed whether problematic stations should also be flagged.
We used the responses to gather some data on the data:
- 0.5% of the measurements are codes which signify missing data, e.g. -999.
- 1.4% are negative values which are not error codes.
- 4.6% are exact 0’s - technically possible but highly unlikely, and suggesting a malfunctioning instrument or reported value.
In addition to these problematic values, a measurement value might be repeated exactly. While an exact repeat is possible, it is unlikely that an instrument will measure identical hourly values for days on end. It could signal a stuck instrument or another unknown error in reporting the data. The tool flags repeating values as well as negatives and error codes.
Other data quality issues are more complex. 17% of locations (identified by name) have the same latitude and longitude. Further, a small number of sensor coordinates (72 out of 7342, or about 1%) have coordinates less than 1 yard from another sensor, making it likely it is the same sensor.
Beyond Air Quality
Openaq-quality-checks is configurable and modular. It was designed with OpenAQ in mind but flexible to work with other data sources. For example, if you want to analyze aggregated world news using reddit’s worldnews subreddit, you might want to flag posts from unknown news organizations. You can find details on implementing this example in the README.
Development Seed and OpenAQ are excited to see how it helps the community and grows with it!