Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data contributors to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested. 

There are two dates that data can have and this can cause confusion, so we define them clearly here:

  1. Date of update: The last time the data was was looked at to confirm it is up to date. The ideal is that the date of update history corresponds with what is selected in the expected update frequency.
  2. Date of data: The actual date of the data. An update could consist of just confirming that the data has not changed.
When we talk about "update time", we are referring to option 1.
The method of determining whether a resource is updated depends upon where the file is hosted. If it is hosted by HDX, then the update time is recorded, but if externally, then there can be challenges in determining if a url has been updated or not. Early research work exists on inferring an update frequency and other approaches are being explored for obtaining and/or creating an update time.
 
Once we have an update time for a dataset's resources, we can calculate its age and combined with the update frequency, we can ascertain the freshness of the dataset. 
 
Progress

Introduction

We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data contributors to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested. 

There are two dates that data can have and this can cause confusion, so we define them clearly here:

  1. Date of update: The last time the data was was looked at to confirm it is up to date. The ideal is that the date of update history corresponds with what is selected in the expected update frequency.
  2. Date of data: The actual date of the data. An update could consist of just confirming that the data has not changed.

When we talk about "update time", we are referring to option 1.
The method of determining whether a resource is updated depends upon where the file is hosted. If it is hosted by HDX, then the update time is recorded, but if externally, then there can be challenges in determining if a url has been updated or not. Early research work exists on inferring an update frequency and other approaches are being explored for obtaining and/or creating an update time.
 
Once we have an update time for a dataset's resources, we can calculate its age and combined with the update frequency, we can ascertain the freshness of the dataset. 
 
Progress

The implementation of HDX freshness in Python reads all the datasets from HDX (using the HDX Python library) and then goes through a sequence of steps. Firstly, it gets the dataset's update frequency if it has one. If that update frequency is Never, then the dataset is always fresh. If not, it checks if dataset and resource metadata has changed - this qualifies as an update from a freshness perspective. It compares the difference between the current time and update time with the update frequency and sets a status: fresh, due, overdue or delinquent. If the dataset is not fresh based on metadata, then the urls of the resources are examined. If they are internal urls (data.humdata.org, manage.hdx.rwlabs.org) then when the files pointed to by these urls update, the HDX metadata is updated, so there is no further checking to be done. If they are urls with an adhoc update frequency (proxy.hxlstandard.org, ourairports.com), then freshness cannot be determined. Currently, there is no mechanism in HDX to specify adhoc update frequencies, but there is a proposal to add this to the update frequency options. The freshness value for adhoc datasets is based on whatever has been set for update frequency which may not 

It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production

Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4254
. Related to that is ongoing work to make the field visible in the UI
Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4894
.

...


FieldDescriptionPurpose
data_update_frequencyDataset expected update frequencyShows how often the data is expected to be updated or at least checked to see if it needs updating
revision_last_updatedResource last modified dateIndicates the last time the resource was updated irrespective of whether it was a major or minor change
dataset_dateDataset dateThe date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates


Dataset Aging Methodology

A resource's age can be measured using today's date - last update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.

 

Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those statuses seems are sound so we will use them they have been used as a foundation. It is important that we distinguish between what we report to our users and data providers with what we need for our automated processing. For the purposes of reporting, then the terminology we would use is simply fresh or not fresh. For contacting data providers, we must give them some leeway from the due date (technically the date after which the data is no longer fresh): the automated email would be sent on the overdue date rather than the due date (but in the email we would tell the data provider that we think their data is not fresh and needs to be updated rather than referring to states like overdue). The delinquent date would also be used in an automated process that tells us it is time for us to manually contact the data providers to see if they have any problems we can help with regarding updating their data.


Update Frequency

Dataset age state thresholds

(how old must a dataset be for it to have this status)

FreshNot Fresh

Up-to-date

Due

Overdue

Delinquent

Daily

0 days old

1 day old

due_age = f

2 days old

overdue_age = f + 2

3 days old

delinquent_age = f + 3

Weekly

0 - 6 days old

7 days old

due_age = f

14 days old

overdue_age = f + 7

21 days old

delinquent_age = f + 14

Fortnightly

0 - 13 days old

14 days old

due_age = f

21 days old

overdue_age = f + 7

28 days old

delinquent_age = f + 14

Monthly

0 -29 days old

30 days old

due_age = f

44 days old

overdue_age = f + 14

60 days old

delinquent_age = f + 30

Quarterly

0 - 89 days old

90 days old

due_age = f

120 days old

overdue_age = f + 30

150 days old

delinquent_age = f + 60

Semiannually

0 - 179 days old

180 days old

due_age = f

210 days old

overdue_age = f + 30

240 days old

delinquent_age = f + 60

Annually

0 - 364 days old

365 days old

due_age = f

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90


NeverAlwaysNeverNeverNever

Drawio
baseUrlhttps://humanitarian.atlassian.net/wiki
diagramNameEmails and Alerts
width502
pageId20742158
height932
revision3

Number of Files Locally and Externally Hosted

TypeNumber of ResourcesPercentage
File Store                                  2,102
22%
CPS                                  2,459
26%
HXL Proxy                                  2,584
27%
ScraperWiki                                     162
2%
Others                                  2,261
24%
Total                                  9,568
100%


Determining if a Resource is Updated

The method of determining whether a resource is updated depends upon where the file is hosted. If it is in HDX ie. in the file store, then the update time is readily available. If it is hosted externally, then it is not as straightforward to find out if the file pointed to by a url has changed. It may be is possible to use the last_modified field that is returned from an HTTP HEAD GET request depending upon whether provided the hosting server supports it or not. (A GET request downloads the body ie. the whole file, whereas a HEAD request only returns the header information). We speculate that if For performance reasons, we open a stream, so that we have the option to only read the header information rather than the entire file). If it is a link to a file on a server like Apache or Nginx then the field will existoften exists, but if it is a url that generates a result on the fly, that the field will either not exist or just contain today's date. Hence, the usefulness of the field then it does not. The field has turned out to be valuable needs to be determined by examining the percentage of datasets it correctly covers. According to Vienna University the figure could be as high as 60%, but this must be verified.
 
An alternative approach discussed with the researchers is to download all the external urls, hash the files and compare the hashes. The Vienna team had done some calculations and asserted that it would be too resource intensive to hash all of the files mainly due to the time to download all of the urls (and we would need to consider datasets like WorldPop which are huge). However, we can apply logic so that we do not need to download all of the files every day (except on first run) based on the due date described earlier. We could restrict the load further if necessary by checking datasets with a lower update frequency less often than nightly.
 
The flowchart below represents the logical flow for each resource in HDX and would occur nightly (unless we need to reduce the load):
Drawio
baseUrlhttps://humanitarian.atlassian.net/wiki
diagramNameHashing Flowchart
width419.5
pageId20742158
height1052
revision3
  

References

...

View file
namefreshness_hdx.pdf
height250
proxy.hxlstandard.org', 'ourairports.com