Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

...


FieldDescriptionPurpose
data_update_frequencyDataset expected update frequencyShows how often the data is expected to be updated or at least checked to see if it needs updating
revision_last_updatedResource last modified dateIndicates the last time the resource was updated irrespective of whether it was a major or minor change
dataset_dateDataset dateThe date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

Dataset Aging Methodology

A resource's age can be measured using today's date - last update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.

Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for calculating those statuses seems sound so we will use them as a foundation and see how well they work:


Update Frequency

Dataset age state thresholds

(how old must a dataset be for it to have this status)

Up-to-date

Due

Overdue

Delinquent

Daily

0 days old

1 day old

due_age = f

2 days old

overdue_age = f + 2

3 days old

delinquent_age = f + 3

Weekly

0 - 6 days old

7 days old

due_age = f

14 days old

overdue_age = f + 7

21 days old

delinquent_age = f + 14

Fortnightly

0 - 13 days old

14 days old

due_age = f

21 days old

overdue_age = f + 7

28 days old

delinquent_age = f + 14

Monthly

0 -29 days old

30 days old

due_age = f

44 days old

overdue_age = f + 14

60 days old

delinquent_age = f + 30

Quarterly

0 - 89 days old

90 days old

due_age = f

120 days old

overdue_age = f + 30

150 days old

delinquent_age = f + 60

Semiannually

0 - 179 days old

180 days old

due_age = f

210 days old

overdue_age = f + 30

240 days old

delinquent_age = f + 60

Annually

0 - 364 days old

365 days old

due_age = f

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90



Determining if a Resource is Updated

The method of determining whether a resource is updated depends upon where the file is hosted. If it is in HDX ie. in the file store, then the update time is readily available. If it is hosted externally, then it can be problematic to find out if the file pointed to by a url has changed. It may be possible to use the last_modified field that is returned from an HTTP get the last_modified field request depending upon whether the hosting server supports it or not. We speculate that if it is hosted a link to a file on a server like Apache or Nginx that then the field will exist, but if it is a url that generates a result on the fly, that the field will eitehr either not exist or just contain today's date.  Hence, the usefulness of the field needs to be determined by examining the percentage of datasets it correctly covers. According to Vienna University the figure could be as high as 60%, but this must be verified.

Number of Files Locally and Externally Hosted

TypeNumber of ResourcesPercentageExample
File Store                                  2,102
22%

CPS                                  2,459
26%

HXL Proxy                                  2,584
27%

ScraperWiki                                     162
2%

Others                                  2,261
24%

Total                                  9,568
100%

Thoughts

Actions
Update frequency needs to be mandatory: 
Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4919
Investigate http get last modification date field - 60% in HDX have this according to UofV.

References

Using the Update Frequency Metadata Field and Last_update CKAN field to Manage Dataset Freshness on HDX:

...