Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Introduction

We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data providers contributors to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested.

Progress

It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production

Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4254
. Related to that is ongoing work to make the field visible in the UI
Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4894
.

Critical to data freshness is having an indication of the update frequency of the dataset. Hence, it was proposed to make the data_update_frequency field mandatory instead of optional and change its name to make it sound less onerous by adding "expected" ie. dataset expected update frequency 

Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4919
. It was confirmed that this field should stay at dataset level as our recommendation would be that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. Assuming the field is a dropdown, it could have values: daily, weekly, fortnightly, monthly, quarterly, semiannually, annually, never. It would be good to have something pop up if the user chooses "never" making it clear that this is for datasets for which data is static. We will have to audit datasets where people pick this option as we don't want people choosing "never" because they don't want to commit to putting an expected update frequency. The expected update frequency requires further thought particularly on the issue of static datasets, following which there will be interface design and development effort. 

A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data providers to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.

Important fields


Field

Description

Purpose

data_update_frequency

Dataset expected update frequency

Shows how often the data is expected to be updated or at least checked to see if it needs updating

...

last_

...

modified

Resource last modified date

Indicates the last time the resource was updated irrespective of whether it was a major or minor change

dataset_date

Dataset

...

time period

The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

Approach

  1. Determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.  
  2. Collect frequency of updates based on interns work? 
  3. Define the age of datasets by calculating: Today's date - last modified date 
  4. Compare age with frequency and define the logic: how do we define an outdated dataset

...

There are two dates that data can have and this can cause confusion, so we define them clearly here as they pertain to datasets:

  1. Date of update: The last time any resource in the dataset was modified or the dataset was manually confirmed as up to date. The ideal is that the time between updates corresponds with what is selected in the expected update frequency. This is last_modified.

  2. Time period of data: The earliest start date and latest end date across all resources included in the dataset. This is dataset_date.


The method of determining whether a resource is updated depends upon where the file is hosted. If it is

...

hosted by HDX, then the

...

If it is hosted externally, then it is not so simple. It may be possible to use HTTP get the last_modified field depending upon whether the server supports it or not.

Number of Files Locally and Externally Hosted

...

22%

...

26%

...

27%

...

2%

...

24%

...

100%

Classifying the Age of Datasets

...

last modified date is recorded, but if externally, then there can be challenges in determining if a url has been updated or not. 

...

  

Dataset Aging Methodology

Once we have the last modified dates for all of a dataset's resources and the last date the dataset was manually confirmed as updated in the UI if available, we can calculate the latest of all of them, which we refer to as “last modified date” from here on. This is used to calculate the dataset’s age and combined with the update frequency, we can ascertain the freshness of the dataset. 

A dataset's age can be measured using today's date - last modified date. This value can be compared with the update frequency to determine an age status for the dataset.


Thought had previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for

...

calculating those statuses are sound so they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date. The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.


Update Frequency

Dataset age state thresholds

(how old must a dataset be for it to have this status)

Fresh

Not Fresh

Up-to-date

Due

Overdue

Delinquent

Daily

0 days old

1 day old

due_age = f

2 days old

overdue_age = f + 2

3 days old

delinquent_age = f + 3

Weekly

0 - 6 days old

7 days old

due_age = f

14 days old

overdue_age = f + 7

21 days old

delinquent_age = f + 14

Fortnightly

0 - 13 days old

14 days old

due_age = f

21 days old

overdue_age = f + 7

28 days old

delinquent_age = f + 14

Monthly

0 -29 days old

30 days old

due_age = f

44 days old

overdue_age = f + 14

60 days old

delinquent_age = f + 30

Quarterly

0 - 89 days old

90 days old

due_age = f

120 days old

overdue_age = f + 30

150 days old

delinquent_age = f + 60

Semiannually

0 - 179 days old

180 days old

due_age = f

210 days old

overdue_age = f + 30

240 days old

delinquent_age = f + 60

Annually

0 - 364 days old

365 days old

due_age = f

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90

...

Thoughts

There are two aspects of data freshness:
  1. Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency
  2. Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed
We should send an automated mail reminder to data contributors if the update frequency time window is missed by a certain amount. Perhaps we should give the option for contributors to respond directly to that mail to say that data is unchanged so they don't even need to log into HDX in that case, otherwise provide the link to their dataset that needs updating.
The amount of datasets that are outside of HDX is growing. I think we should try to handle this situation now. The simple but perhaps annoying solution is to send a reminder to users according to the update frequency (irrespective of whether they have already updated as we cannot tell).
Another way to do so is to provide guidance to users so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. Google spreadsheet with update trigger, document alerts in OneDrive for Business, macro in Excel spreadsheet. I don't know if this is possible, but complete automation would be if they could click something in HDX that creates a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials.

Actions

Update frequency needs to be mandatory: 
Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4919

...


Never

Always

Never

Never

Never

Live

Always

Never

Never

Never

As Needed

Always

Never

Never

Never


Here is a presentation about data freshness from January 2017 that provides a good introduction.

Data Freshness Architecture

Data freshness consists of a database, REST API, freshness process and freshness emailer.

There is a docker container hosting the Postgres database (https://hub.docker.com/r/unocha/alpine-postgres/ - 201703-PR116) and a port is open on there to allow connection from external database clients (hdxdatateam.xyz:5432). There is a another Docker container (https://hub.docker.com/r/mcarans/alpine-haskell-postgrest/) that exposes a REST API to the database (http://hdxdatateam.xyz:3000/) - the docker setup for this is here: https://github.com/OCHA-DAP/alpine-haskell-postgrest. The freshness process and freshness emailer are also within their own Docker containers. The docker-compose that brings all these containers together is here: https://github.com/OCHA-DAP/hdx-data-freshness-docker.

Here is an overall view of the architecture:

...


Data Freshness Process

Data Freshness Emailer 

Completed Work

Data Freshness Roadmap

Statistics


References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#

...

https://github.com/luiscape/hdx-monitor-ageing-service


University of Vienna paper on methodologies for estimating next change time for a resource based on previous update history:

https://www.adequate.at/wp-content/uploads/2016/04/neumaier2016ODFreshness.pdf

University of Vienna presentation of data freshness:

View file
namefreshness_hdx.pdf

...