Data Freshness Part 1: Adding a New Metadata Field

The Humanitarian Data Exchange (HDX) is adding a new mandatory metadata field called Expected Update Frequency. It replaces a previous optional field, Update Frequency, and its purpose is to tell us how often datasets shared through the site are likely to be updated.

How will it affect you?

When you create a new dataset or update an existing one in HDX either in the website or through the API, you will be required to complete a new mandatory field: Expected Update Frequency. It is a rough estimate of how often the data will be updated. It can be changed later so a best guess is fine to start.

Why are we doing this?

We are introducing a new set of features to HDX based on the concept of "Data Freshness". We are interested in assessing how current is the data within each dataset because we want the portal to become more useful for consumers of the data: one important metric we can give them is how up to date are the datasets. Imagine a large walk in freezer in a restaurant. Delivery staff fill it with new products akin to how contributors add new datasets to HDX. Cooks look inside for items they need and mix them in various tasty ways. Analogously, users find datasets in HDX and combine the data for analysis. Foodstuffs can be safely stored in the freezer for different periods of time. If no one checks, the caterers may use stale ingredients, so there needs to be a method to keep track of the contents and if anything is too old to order replacements. Given the choice, chefs would like to use the freshest produce, and similarly we want users have access to the most up to date data in HDX, particularly since it holds over 4000 datasets. We want to help data providers oversee their data, particularly where update processes are manual, and make it easy for people to find data that is actively maintained. 

How do we determine Data Freshness?

Once we know how often the data will be updated, i.e. every day, week, month or year, we can calculate a dataset’s age by obtaining the difference between the current time and the last time the file was updated. If this time interval is less than the expected update frequency, we can confirm that the dataset is fresh. An update does not mean that the data has to change. It could just verify that the data is the same. Let’s say the expected update frequency is specified as every week, but the dataset is only updated every month, then only for one week each month will the dataset be considered fresh. If that same dataset were updated weekly or daily, it would always be up to date. If we look at the history of update times, the intervals between them should correspond to what was selected in the expected update frequency field.

Fields in HDX related to Data Freshness

FieldDescriptionPurpose
data_update_frequencyDataset expected update frequencyShows how often the data is expected to be updated or at least checked to see if it needs updating
revision_last_updatedResource last modified dateIndicates the last time the resource was updated irrespective of whether it was a major or minor change
dataset_dateDataset dateThe date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

What challenges do we face?

The method of determining whether a resource is updated depends upon where the file is hosted. If it is in HDX, then the update time is readily available. If it is hosted externally, then it is not as straightforward to find out if the file pointed to by a url has changed. We are investigating some approaches to this including examining the header returned by HTTP requests for the existence of a last modified field and downloading files on a regular basis, calculating hashes and comparing for changes with stored hashes.

We are drawing on research being done on data freshness at Vienna University. Specifically, the researchers are looking at estimating the next change time for a resource based on previous update history and applying a Markov chain approach. The research is still ongoing but we hope to learn from their results to enhance HDX.

 

Let us know what you think of this approach. Send feedback to hdx@un.orgWatch this space! There will be more coming on the subject of Data Freshness.