Page Comparison

Introduction

We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data contributors to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested.

Important fields

Field	Description	Purpose
data_update_frequency	Dataset expected update frequency	Shows how often the data is expected to be updated or at least checked to see if it needs updating

...

last_

...

modified	Resource last modified date	Indicates the last time the resource was updated irrespective of whether it was a major or minor change
dataset_date	Dataset date	The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

There are two dates that data can have and this can cause confusion, so we define them clearly here:

Date of update: The last time the data was was looked at to confirm it is up to date. The ideal is that the date of update history corresponds with what is selected in the expected update frequency. This is

...

last_

...

modified.
Date of data: The actual date of the data. An update could consist of just confirming that the data has not changed. This is dataset_date.

When we talk about "update time", we are referring to option 1.

The method of determining whether a resource is updated depends upon where the file is hosted. If it is hosted by HDX, then the update time is recorded, but if externally, then there can be challenges in determining if a url has been updated or not.

...

DrawiobaseUrlhttps://humanitarian.atlassian.net/wikidiagramNameHDX structurewidth631pageId20742158height481revision2

Once we have an update time for a dataset's resources, we can calculate its age and combined with the update frequency, we can ascertain the freshness of the dataset.

...

Dataset Aging Methodology

A resource's age can be measured using today's date - last update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.

Thought

...

had previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those

...

statuses are sound so they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we

...

use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date

...

. The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.

Update Frequency	Dataset age state thresholds (how old must a dataset be for it to have this status)
	Fresh		Not Fresh
	Up-to-date	Due	Overdue	Delinquent
Daily	0 days old	1 day old due_age = f	2 days old overdue_age = f + 2	3 days old delinquent_age = f + 3
Weekly	0 - 6 days old	7 days old due_age = f	14 days old overdue_age = f + 7	21 days old delinquent_age = f + 14
Fortnightly	0 - 13 days old	14 days old due_age = f	21 days old overdue_age = f + 7	28 days old delinquent_age = f + 14
Monthly	0 -29 days old	30 days old due_age = f	44 days old overdue_age = f + 14	60 days old delinquent_age = f + 30
Quarterly	0 - 89 days old	90 days old due_age = f	120 days old overdue_age = f + 30	150 days old delinquent_age = f + 60
Semiannually	0 - 179 days old	180 days old due_age = f	210 days old overdue_age = f + 30	240 days old delinquent_age = f + 60
Annually	0 - 364 days old	365 days old due_age = f	425 days old overdue_age = f + 60	455 days old delinquent_age = f + 90
Never	Always	Never	Never	Never
Live	Always	Never	Never	Never
As Needed	Always	Never	Never	Never

Here is a presentation about data freshness from January 2017 that provides a good introduction.

Data Freshness Architecture

Data freshness consists of a database, REST API, freshness process and freshness emailer.

There is a docker container hosting the Postgres database (https://hub.docker.com/r/unocha/alpine-postgres/ - 201703-PR116) and a port is open on there to allow connection from external database clients (hdxdatateam.xyz:5432). There is a another Docker container (https://hub.docker.com/r/mcarans/alpine-haskell-postgrest/) that exposes a REST API to the database (http://hdxdatateam.xyz:3000/) - the docker setup for this is here: https://github.com/OCHA-DAP/alpine-haskell-postgrest. The freshness process and freshness emailer are also within their own Docker containers. The docker-compose that brings all these containers together is here: https://github.com/OCHA-DAP/hdx-data-freshness-docker.

Here is an overall view of the architecture:

...

Data Freshness Process

...

Data Freshness Emailer

...

More information on the Data Freshness emailer can be found by clicking the above link.

Progress

...

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4919

...

It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4254

. Related to that is ongoing work to make the field visible in the UI

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-4894

.

A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data contributors to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.

A collaboration has been started with a team at Vienna University who are considering the issue of data freshness from an academic perspective. We will see what we can learn from them but will likely proceed with a more basic and practical approach than what they envisage. Specifically, they are looking at estimating the next change time for a resource based on previous update history, which is in an early stage of research so not ready for use in a real life system just yet.

Running data freshness has shown that there are many datasets with an update frequency of never. This is understandable because for a long time, it was the default option in the UI. As the data freshness database holds organisation information, steps haev been taken to compile a list and contact organisations who have datasets with update frequency never and encourage them to put in a time period.

To cover the need to specify that although a dataset is updated, it is not according to any schedule ie. it is adhoc and also datasets updated by systems continually (ie. live), we introduced new "live" and "adhoc" expected update frequencies. However, since it would be an enticing option to pick as it does not require much thought, we also ensured that data contributors could not choose this in the UI. Instead we wait for them to ask us or for us to identify their dataset as stale and contact them about it.

Jira Legacy

server	JIRA (humanitarian.atlassian.net)
serverId	efab48d4-6578-3042-917a-8174481cd056
key	HDX-5046

Only administrators can set this in the UI but programmatically, it can be set directly. Similarly, "never" is now only available to administrators because contributors may pick this simply because they don't want to commit to putting an expected update frequency. It is probable that they will pick every year instead and when that timeframe passes and their dataset is not updated, we would contact them about it and they could then tell us it will never be updated.

It has been addressed where should data freshness run and where should it output. The Data Systems server has been cleared of other deprecated work and freshness runs on there. Next Steps

As data freshness collects a lot of metadata, it could be used for more general reporting. If needed, the list of metadata collected could be extended.

Even for datasets which have an update frequency of "never", there could be an argument for a very rare mail reminder just to confirm data really is static.

For the case where data is unchanged and we have sent an overdue email, we should give the option for contributors to respond directly to the automated mail to say so (perhaps by clicking a button in the message).

The amount of datasets that are hosted outside of HDX is growing rapidly and these represent a problem for data freshness if their update time is not available. Rather than ignore them, the easiest solution is to send a reminder to users according to the update frequency - the problem is that this would be irrespective of whether they have already updated and so potentially annoying.

Another way is to provide guidance to data contributors so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. using a Google spreadsheet with our update trigger added. We could investigate a fuller integration between HDX and Google spreadsheets so that if a data provider clicks a button in HDX, it will create a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials. We may need to investigate other platforms for example creating document alerts in OneDrive for Business and/or macros in Excel spreadsheets (although as noted earlier, this might create a support headache).

Number of Files Locally and Externally Hosted

...

22%

...

26%

...

27%

...

2%

...

24%

...

100%

Completed Work

Data Freshness Roadmap

Statistics

References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#

Dataset Aging service:

https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit

https://github.com/luiscape/hdx-monitor-ageing-service

...

University of Vienna paper on methodologies for estimating next change time for a resource based on previous update history:

https://www.adequate.at/wp-content/uploads/2016/04/neumaier2016ODFreshness.pdf

University of Vienna presentation of data freshness:

View file

name	freshness_hdx.pdf

...

Versions Compared

Old Version 62

New Version Current

Key

Introduction

Important fields

Dataset Aging Methodology

Data Freshness Architecture

Data Freshness Process

Data Freshness Emailer

Progress

Number of Files Locally and Externally Hosted

Completed Work

Data Freshness Roadmap

Statistics

References