Completed Work
Below are pieces of work that were completed to support data freshness.
HDX Changes
It was determined that a new field was needed on resources in HDX, the revision_last_updated mentioned in the introduction. This field shows the last time the resource was updated and has been implemented and released to production - HDX-4254Getting issue details... STATUS .
Critical to data freshness is having an indication of the update frequency of the dataset. Hence, the data_update_frequency field was made mandatory instead of optional and its name changed to make it sound less onerous by adding "expected" ie. expected update frequency - HDX-4919Getting issue details... STATUS . It was confirmed that this field should stay at dataset level as our recommendation to data providers is that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. The field is a dropdown with values: every day, every weekly, every two weeks, every month, every three months, every six months, every year, never and it has been implemented.
To cover the datasets updated by systems continually and datasets updated sporadically not according to any schedule, we introduced new "live" and "adhoc" expected update frequencies respectively. The database was patched so that "never" gets value -1 (instead of 0). Adhoc gets value -2 and live gets value 0.
However, since these would be enticing options to pick as they require less thought than determining a time period, we also ensured that data contributors could not choose these in the UI. Instead we wait for them to ask or for us to identify their dataset as stale and contact them about it. - HDX-5046Getting issue details... STATUS Only administrators can set this in the UI. However, it can be set directly by API. Similarly, "never" is now only available to administrators because contributors may pick this simply because they don't want to commit to putting an expected update frequency. It is probable that they will pick every year instead and when that timeframe passes and their dataset is not updated, we will contact them about it and then they can tell us it will never be updated.
Other Work
It was necessary to prepare a server for running the various freshness-related Docker containers. The Data Systems server was cleared of other deprecated work and set up for this purpose.
A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data contributors to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.
Running data freshness has shown that there are many datasets with an update frequency of never. This is understandable because for a long time, it was the default option in the UI. As the data freshness database holds organisation information, steps have been taken to compile a list and contact organisations who have datasets with update frequency never and encourage them to put in a time period.
A collaboration had been started with a team at Vienna University who are considering the issue of data freshness from an academic perspective. We learnt a few thigns from them but proceeded with a more basic and practical approach than what they envisaged. Specifically, they are looking at estimating the next change time for a resource based on previous update history, which is in an early stage of research so not ready for use in a real life system just yet.