Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Important fields


FieldDescriptionPurpose
data_update_frequencyDataset suggested update frequencyShows how often the data is expected to be updated or at least checked to see if it needs updating
revision_last_updatedResource last modified dateIndicates the last time the resource was updated irrespective of whether it was a major or minorchange
dataset_dateDataset dateThe date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

Approach

  1. Determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.  
  2. Collect frequency of updates based on interns work? 
  3. Define the age of datasets by calculating: Today's date - last modified date 
  4. Compare age with frequency and define the logic: how do we define an outdated dataset

Determining if a Resource is Updated

The method of determining whether a resource is updated depends upon where the file is hosted. If it is in the file store, then the update time is clear,
If it is hosted externally, then it is not so simple. It may be possible to use HTTP get the last_modified field depending upon whether the server supports it or not.

Number of Files Locally and Externally Hosted

TypeNumber of ResourcesPercentageExample
File Store                                  2,102
22%

CPS                                  2,459
26%

HXL Proxy                                  2,584
27%

ScraperWiki                                     162
2%

Others                                  2,261
24%

Total                                  9,568
100%

Classifying the Age of Datasets


Thought has previously gone into classification of the age of datasets and reviewing this work, the statuses used (up to date, due, overdue and delinquent) and formulae for determining those statuses is sound. Hence, using that work, we have:


Update Frequency

Dataset age state thresholds

(how old must a dataset be for it to have this status)

Up-to-date

Due

Overdue

Delinquent

Daily

0 days old

1 day old

due_age = f

2 days old

overdue_age = f + 2

3 days old

delinquent_age = f + 3

Weekly

0 - 6 days old

7 days old

due_age = f

14 days old

overdue_age = f + 7

21 days old

delinquent_age = f + 14

Fortnightly

0 - 13 days old

14 days old

due_age = f

21 days old

overdue_age = f + 7

28 days old

delinquent_age = f + 14

Monthly

0 -29 days old

30 days old

due_age = f

44 days old

overdue_age = f + 14

60 days old

delinquent_age = f + 30

Quarterly

0 - 89 days old

90 days old

due_age = f

120 days old

overdue_age = f + 30

150 days old

delinquent_age = f + 60

Semiannually

0 - 179 days old

180 days old

due_age = f

210 days old

overdue_age = f + 30

240 days old

delinquent_age = f + 60

Annually

0 - 364 days old

365 days old

due_age = f

425 days old

overdue_age = f + 60

455 days old

delinquent_age = f + 90



Thoughts

There are two aspects of data freshness:

...