Purpose
We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data providers to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested.
Progress
It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production - HDX-4254Getting issue details... STATUS . Related to that is ongoing work to make the field visible in the UI - HDX-4894Getting issue details... STATUS .
Critical to data freshness is having an indication of the update frequency of the dataset. Hence, it was proposed to make the data_update_frequency field mandatory instead of optional and change its name to make it sound less onerous by adding "expected" ie. dataset expected update frequency - HDX-4919Getting issue details... STATUS . It was confirmed that this field should stay at dataset level as our recommendation would be that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. Assuming the field is a dropdown, it could have values: daily, weekly, fortnightly, monthly, quarterly, semiannually, annually, never. It would be good to have something pop up if the user chooses "never" making it clear that this is for datasets for which data is static. We will have to audit datasets where people pick this option as we don't want people choosing "never" because they don't want to commit to putting an expected update frequency. The expected update frequency requires further thought particularly on the issue of static datasets, following which there will be interface design and development effort.
A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data providers to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.
Important fields
Field | Description | Purpose |
---|---|---|
data_update_frequency | Dataset expected update frequency | Shows how often the data is expected to be updated or at least checked to see if it needs updating |
revision_last_updated | Resource last modified date | Indicates the last time the resource was updated irrespective of whether it was a major or minor change |
dataset_date | Dataset date | The date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates |
Approach
- Determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.
- Collect frequency of updates based on interns work?
- Define the age of datasets by calculating: Today's date - last modified date
- Compare age with frequency and define the logic: how do we define an outdated dataset
Determining if a Resource is Updated
Number of Files Locally and Externally Hosted
Type | Number of Resources | Percentage | Example |
---|---|---|---|
File Store | 2,102 | 22% | |
CPS | 2,459 | 26% | |
HXL Proxy | 2,584 | 27% | |
ScraperWiki | 162 | 2% | |
Others | 2,261 | 24% | |
Total | 9,568 | 100% |
Classifying the Age of Datasets
Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for determining those statuses is sound and so we will build on that foundation:
Update Frequency | Dataset age state thresholds (how old must a dataset be for it to have this status) | |||
---|---|---|---|---|
Up-to-date | Due | Overdue | Delinquent | |
Daily | 0 days old | 1 day old due_age = f | 2 days old overdue_age = f + 2 | 3 days old delinquent_age = f + 3 |
Weekly | 0 - 6 days old | 7 days old due_age = f | 14 days old overdue_age = f + 7 | 21 days old delinquent_age = f + 14 |
Fortnightly | 0 - 13 days old | 14 days old due_age = f | 21 days old overdue_age = f + 7 | 28 days old delinquent_age = f + 14 |
Monthly | 0 -29 days old | 30 days old due_age = f | 44 days old overdue_age = f + 14 | 60 days old delinquent_age = f + 30 |
Quarterly | 0 - 89 days old | 90 days old due_age = f | 120 days old overdue_age = f + 30 | 150 days old delinquent_age = f + 60 |
Semiannually | 0 - 179 days old | 180 days old due_age = f | 210 days old overdue_age = f + 30 | 240 days old delinquent_age = f + 60 |
Annually | 0 - 364 days old | 365 days old due_age = f | 425 days old overdue_age = f + 60 | 455 days old delinquent_age = f + 90 |
Thoughts
- Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency
- Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed
Actions
References
Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:
https://docs.google.com/document/d/1g8hAwxZoqageggtJAdkTKwQIGHUDSajNfj85JkkTpEU/edit#
Dataset Aging service:
https://docs.google.com/document/d/1wBHhCJvlnbCI1152Ytlnr0qiXZ2CwNGdmE1OiK7PLzo/edit
https://github.com/luiscape/hdx-monitor-ageing-service