Jonathan Coney

MRes Student in Climate and Atmospheric Science 2019-20

Index | About | Computer Project | Files

Computer Project Introduction | Get IDs | Split up IDs | Get Netatmo weather data

Computer Project

For a general introduction to my Master's project, see the Index page. My project involves comparing quality control methods on temperature data collected from Netatmo weather stations in the UK. To do this, data from all the UK stations needs to be collected. My code forms a system: firstly it collects MAC addresses from each Netatmo weather station in the UK and its sensors, then it splits this long list into more manageable chunks, and finally uses these MAC addresses to collect the previous day's temperature data via the API for each chunk throughout the day. The code also collects other parameters in the hope future work can utilise these data, for example pressure, rainfall and wind. In order to be able to judge the quality control methods, I need to have a sizeable archive of Netatmo temperature data in order to compare the data to data from Met Office stations, and to be confident about my conclusions.

There are three main pieces of code to collect data from Netatmo weather stations:

Code to collect the MAC addresses from each UK station, by reading a .csv file containing a list of known Netatmo weather stations. Then, the UK is checked for new stations - if there are any new stations, then these are added to a dictionary before the .csv file is updated.
Code to split up the MAC addresses into chunks so as not to hit the API limits, and remove stations that are not in the UK. The stations_modules.csv file is read, and the stations are split into some 40 files, each containing roughly 250 calls of the API.
Code to act on the stations in each chunk and retrieve data from each station for the previous day. This code runs once every half hour, and reads a given .csv file and creates a netCDF file of every Netatmo weather station's data in the past list.

Flowchart showing the files input and output by each program

Requirements

All my code has been written in Python 3(.7.4). To run my code, you need the following core modules and external packages installed. I have included the specific versions I have tested the code on. On the webpage for each separate script, I have included what packages each script requires.

Core modules:

sys library, to use command line input
os, to create directories.
datetime for managing dates and times in Python.
time for managing times in Python.
pathlib, using the Path function for directory management.

Packages:

numpy 1.17.2, for general number-wrangling with arrays.
netcdf4 1.5.3, to create and write to netCDF files.
requests 2.22.0, which initiates the data retrieval from the Netatmo API
pandas 0.25.1, which is used to read and write .csv files.

Some scripts require a Netatmo Developer Account, in order to retrieve data from the API.

Running the scripts

I run the scripts from the command line; I use cron to schedule the scripts to run at the correct time. Specific syntax for each script (i.e. which arguments each script requires) is written at the top of each script.

Background

Netatmo weather stations all upload their data to a data server. Data can be retrieved through the Netatmo Weather API. This data is also displayed on the Netatmo website, via their weathermap. The API has limits for retrieving data: no user may request data more than 50 times in 10 seconds, and no more than 500 calls per hour. The code has been written so that this is avoided. The API has three commands:

getpublicdata - used to get all "live" data from stations in a given area. getpublicdata also returns the MAC addresses of weather stations, which is useful for the other commands, particularly getmeasure.
getmeasure - used to get all data from a given station (using a station’s MAC address & MAC address of module required) within a given time period.
getstationsdata - used to get all "live" data from a user's own station, or a station a user has been given explicit permission by the owner to access. This command contains "not for public" information such as indoor temperatures, indoor noise levels.

Each Netatmo weather station and component of a weather station has a MAC address (or ID). Accessing data from any part of a weather station requires these MAC addresses. Accessing data from a particular sensor, for example the rain gauge requires the MAC address of both the indoor sensor and the rain gauge, since the outdoor sensors communicate to the Netatmo indoor sensor via Bluetooth before the indoor sensor uploads all the data to the API. My code runs throughout the day, on a batch compute cluster, LOTUS, accessed through JASMIN, a supercomputer run by the Centre for Environmental Data Analysis (CEDA). The LOTUS machine prioritises jobs so that all users get a fair share in the queue so jobs can exit without finishing. Coupled with the limitations of the API discussed above, this code tries to get as much data as possible from observations made the previous day.

Flowchart showing the MAC addresses of each component of the weather station — Diagram showing the MAC addresses of each component of a weather station, and the public data recorded by each component. Images from Netatmo branding kit.

Time	Task
00:00	`get_ids_modules.py` is run to update the data with new stations. More information on the Get IDs page.
00:30	`nationsplit.py` is run to split the data up into smaller files so that the API is called less than 250 times for each resulting `.csv` file. More information on the Split up IDs page.
Every half hour from 01:00 onwards	`gethistoric_netCDF_JASMIN.py` is run, calling the file named `x.csv`, where , and t is the number of seconds since midnight. This selects a different `.csv` file containing MAC addresses each time. Then, for each station in the file, the previous day's data is retrieved from the Netatmo API and saved to a netCDF file, in the format `uol-netatmo-MACADDRESS_STATIONLOCATION_YYYYMMDD_surfacemet_v1.5.nc`. One such example can be found here (you may need a viewer, such as Panoply to open this file). The map to the right shows which stations are retrieved and at which time. Since `nationsplit.py` is run every day, the stations that are retrieved at later times tend to be stations that have been added to the list of stations more recently. More information on this script is on the Get Netatmo weather data webpage.

Future work will focus on using the data collected to analyse the quality control methods mentioned previously, and to see whether Netatmo weather stations are a reliable source of mesoscale observations. To futureproof the code, and to allow for instances where the script does not capture all stations that it should do (usually to do with issues with JASMIN), a script could be added to check what stations are missing for a particular day and then retrieve this data from the API where possible. However, this may lead into issues with the limits imposed by Netatmo (see above). One possible solution is to act on a 48 hour cycle rather than a 24 hour cycle like the code does currently, and retrieve data from two days at a time rather than one which would free up time for retrieving more data, and error-correcting along the way. Another improvement would be to get weather data in a more "random" way rather than regionally as it does at the minute, to ensure some reasonable UK-wide coverage if LOTUS is not functioning properly for some part of a given day.