Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

...

Drawio
baseUrlhttps://humanitarian.atlassian.net/wiki
diagramNameHDX structure
width631
pageId20742158
height481
revision2

Once we have an update time for a dataset's resources, we can calculate its age and combined with the update frequency, we can ascertain the freshness of the dataset. 
 

Dataset Aging Methodology

A resource's age can be measured using today's date - last update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.


Thought had previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those statuses are sound so they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date. The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.


Update Frequency

Dataset age state thresholds

(how old must a dataset be for it to have this status)

FreshNot Fresh

Up-to-date

Due

Overdue

Delinquent

Daily

0 days old

1 day old

due_age = f

2 days old

overdue_age = f + 2

3 days old

delinquent_age = f + 3

Weekly

0 - 6 days old

7 days old

due_age = f

14 days old

overdue_age = f + 7

21 days old

delinquent_age = f + 14

Fortnightly

0 - 13 days old

14 days old

due_age = f

21 days old

overdue_age = f + 7

28 days old

delinquent_age = f + 14

Monthly

0 -29 days old

30 days old

due_age = f

44 days old

overdue_age = f + 14

60 days old

delinquent_age = f + 30

Quarterly

0 - 89 days old

90 days old

due_age = f

120 days old

overdue_age = f + 30

150 days old

delinquent_age = f + 60

Semiannually

0 - 179 days old

180 days old

due_age = f

210 days old

overdue_age = f + 30

240 days old

delinquent_age = f + 60

Annually

0 - 364 days old

365 days old

due_age = f

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90


NeverAlwaysNeverNeverNever


Here is a presentation about data freshness from January 2017 that provides a good introduction.

Data Freshness Architecture

Data freshness consists of a database, REST API, freshness process and freshness emailer.

There is a docker container hosting the Postgres database (https://hub.docker.com/r/unocha/alpine-postgres/ - 201703-PR116) and a port is open on there to allow connection from external database clients (hdxdatateam.xyz:5432). There is a another Docker container (https://hub.docker.com/r/mcarans/alpine-haskell-postgrest/) that exposes a REST API to the database (http://hdxdatateam.xyz:3000/) - the docker setup for this is here: https://github.com/OCHA-DAP/alpine-haskell-postgrest. The freshness process and freshness emailer are also within their own Docker containers. The docker-compose that brings all these containers together is here: https://github.com/OCHA-DAP/hdx-data-freshness-docker.

Here is an overall view of the architecture:

Drawio
baseUrlhttps://humanitarian.atlassian.net/wiki
diagramNameFreshness Architecture
width748
zoom1
pageId20742158
lbox1
height401
revision3

More information on the process can be found by clicking the above link.

 

Data Freshness Emailer

More information on the Data Freshness emailer can be found by clicking the above link.

Next Steps

Related to that is ongoing work to make the field visible in the UI 

Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4894
.

As data freshness collects a lot of metadata, it could be used for more general reporting. If needed, the list of metadata collected could be extended. 

Even for datasets which have an update frequency of "never", there could be an argument for a very rare mail reminder just to confirm data really is static.

For the case where data is unchanged and we have sent an overdue email, we should give the option for contributors to respond directly to the automated mail to say so (perhaps by clicking a button in the message).

The amount of datasets that are hosted outside of HDX is growing rapidly and these represent a problem for data freshness if their update time is not available. Rather than ignore them, the easiest solution is to send a reminder to users according to the update frequency  - the problem is that this would be irrespective of whether they have already updated and so potentially annoying.

Another way is to provide guidance to data contributors so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. using a Google spreadsheet with our update trigger added.  We could investigate a fuller integration between HDX and Google spreadsheets so that if a data provider clicks a button in HDX, it will create a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials. We may need to investigate other platforms for example creating document alerts in OneDrive for Business and/or macros in Excel spreadsheets (although as noted earlier, this might create a support headache). 

References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#

Dataset Aging service:

https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit

https://github.com/luiscape/hdx-monitor-ageing-service


University of Vienna paper on methodologies for estimating next change time for a resource based on previous update history:
University of Vienna presentation of data freshness:
View file
namefreshness_hdx.pdf
height250