Test and Adjust Freshness Process

Freshness is exposed in the interface by way of a green leaf symbol which indicates that a dataset is up to date - this means that there has been an update to the metadata or the data in the dataset within the expected update frequency plus some leeway. In producing this document, I have examined whether what our definition of freshness makes sense and looked at how users react to it. In particular, I have identified some cases where the freshness process needs some adjustment in order to avoid misleading users. Below I outline the most pervasive problems with our freshness feature and then give proposals for a solution which includes renaming and clearly defining date of dataset, a new Last Modified metadata field for datasets and resources  and 3 options on how to present freshness in the UI.

The Problem

Issues that were already identified 

The following issues were found prior to the start of this investigation:

  • Exclude "Live", "As Needed" and "Never" datasets from no touch if already fresh rule - DONE
  • Discount edits made by HDX (as these edits cause datasets to be marked as fresh)
  • Restrict which metadata changes count as updates for freshness
  • Offer an "archived" icon in addition to "fresh" to indicate a dataset that is old, up-to-date, and no longer being updated. At the moment these are being called fresh, which is technically true, but tends to present a lot of old data to users.
  • Date of Dataset is used in different ways (captured later in this document)

Discount edits made by HDX

Edits by HDX staff are typically to fix issues and have no bearing on the up to dateness of the data, hence they should be ignored by freshness but we need to consider what to do about edits to datasets maintained by HDX.

The edits that have been performed on a dataset can be seen by looking at package_revision_list. One complication is that we must go through the history of edits because someone outside HDX could make an update, followed not long after by someone in HDX. A naive implementation could miss the first edit which should count towards freshness.

Restrict which metadata changes count as updates for freshness

Currently any dataset metadata change counts as an update from a freshness perspective. Our assumption is:

  • Such changes are taken as signifying that the dataset maintainer has thought about the data and checked it
  • If they had newer data, then we would expect them to put it into HDX while updating the dataset metadata
  • The fact they haven't means the data is as up to date as possible

This proposal limits our assumption to certain fields - it becomes:

  • Changes in certain metadata fields are taken as signifying that the dataset maintainer has thought about the data and checked it
  • If they had newer data, then we would expect them to put it into HDX while updating these specific dataset metadata fields
  • The fact they haven't means the data is as up to date as possible

The criteria for choosing the fields should be those that directly affect the underlying data or freshness calculation:

  • Expected update frequency
  • Dataset date
  • Location?
  • Source?

Note that if the number of fields is severely limited, this may render discounting edits by HDX unnecessary.

Points to consider:

  • Expected update frequency is used to calculate freshness, but then if someone changes it from yearly to monthly, that doesn't indicate anything about the data having changed. If the dataset was delinquent with yearly update frequency, it should still be delinquent with monthly.
  • Why should someone changing the dataset description be any less of an update from a freshness perspective than changing the dataset date?
  • There doesn't seem to be a compelling reason to do a partial restriction of metadata changes counting for freshness - it's really all or nothing:
    • either we regard any metadata change as someone indicating that the data is as fresh as it can be (as we originally envisaged)
    • or we simply disregard metadata changes altogether from determination of freshness and rely solely on data changes - note that detecting file store changes specifically would need to be investigated

Offer an "archived" icon in addition to "fresh"

The data in some datasets refers to or covers a date or date period which is far in the past, but the data itself is as up to date as it could be and will not be updated again. For these cases, it makes sense to offer an archived icon instead of fresh (which would be the icon used at present for an expected update frequency of "never"). 

Date of Dataset is used in different ways

The problem with Date of Dataset is twofold:

  1. What date(s) is it trying to represent?
  2. How should it be updated / how should contributors be encouraged to update it?

More on point 1 below in Confusing concepts related to Date of Dataset.

Discovering Other Issues

To discover other possible issues with how freshness is understood, the following strategy was applied:

  1. Take a random sample of datasets ensuring that among them are fresh, due, overdue, and delinquent datasets and that they represent a cross section of different organisations' datasets
  2. Evaluate what fresh and not fresh mean
  3. Determine if it is clear to users
  4. Collect any cases where the fresh label (or lack of it) is misleading
  5. Categorise misleading cases

With an overview of the misleading cases, we can consider what to do about the terminology we use such as fresh and not fresh that accounts for the misleading cases and provides clarity to users.

Misleading cases

The misleading cases are documented in the Google spreadsheet here and the resources for those datasets were all frozen and stored in GitHub for further analysis. From the full analysis, a subset of examples of specific cases were picked and coloured in red.

Some highlights from the red coloured examples are detailed below:

https://data.humdata.org/dataset/fts-requirements-and-funding-data-for-guatemala

The dataset date here is the date the data was taken from FTS, but it could be the date period the data in the dataset covers. The dataset is fresh.

https://data.humdata.org/dataset/wfp-food-prices-for-sri-lanka

This dataset has weekly update frequency, its data and metadata were updated recently and is hence fresh, but the data covers a period ending March 2018.

https://data.humdata.org/dataset/philippines-haima-house-damage-pcoded-ndrrmc-sitrep-9

This is one of many examples of a dataset that is delinquent because it is crisis specific and so is updated regularly during a crisis, but then updates stop once the crisis is over. These need to be detected and expected update frequency set to Never.

https://data.humdata.org/dataset/4w-central-sulawesi-earthquake-and-tsunami-2018

A good example of the problem of dataset date not being updated. The dataset is regularly being updated with new dated resources and is fresh, but the dataset date remains as it was when the dataset was created.

https://data.humdata.org/dataset/iom-dtm-iraq-master-lists-july-2017

The data is from 2017 and does not look like it has been updated but the dataset activity list shows that Kashif updated it recently - this likely an example of a metadata update, perhaps a licence change, in which case it should be ignored. 

https://data.humdata.org/dataset/nepal-openstreetmap-extracts-buildings

An example of a dataset being deliquent and superceded (by https://data.humdata.org/dataset/hotosm_npl_buildings) - how we would detect that this new dataset is a newer version?

https://data.humdata.org/dataset/irc-ethiopia-all-ongoing-emergency-responses-by-sector-implementing-partners-28-feb-2018

This dataset had a problem so could not be opened. Fixing the problem meant that the dataset was touched and is now fresh - an example of how edits by HDX should be discounted.

https://data.humdata.org/dataset/response-plan-coverage-nepal-earthquake

This dataset was frozen as part of ScraperWiki migration. After this dataset was chosen in for this exercise, DP team updated it and set its update frequency to Never. Previously it was overdue.

https://data.humdata.org/dataset/ethiopia-healthsites

This dataset is delinquent but may still get new updates in which case its expected update frequency is set to too short a period.

https://data.humdata.org/dataset/lake-chad-basin-baseline-population

The system which had been producing the data for this dataset was decommissioned so this dataset became delinquent.

https://data.humdata.org/dataset/manufacturing-companies-2006-072011-12

This dataset is fresh and was incorrectly touched by freshness due to a bug that was fixed on the first day freshness touching was turned on.

https://data.humdata.org/dataset/worldpop-cameroon

The dataset is marked as fresh. The data is as up to date as it can be so qualifies as fresh both on that criteria and on the fact it was updated within the expected update frequency of one year. The dataset date is 2010 and WorldPop's "date of production" says 2013 - would users regard data covering 2010 as fresh? 

If an update was run on this dataset too late, it would be marked as overdue and then delinquent if still not updated. This would correctly reflect the current definition of freshness, but would not be correct when considering the criteria of the data in the dataset being as up to date as it can be (which it would still be).

https://data.humdata.org/dataset/water-sanitation-drinking-water-sanitation-and-hygiene-database

The last update value in the spreadsheet resource and the dataset date are July 2017, but the metadata was updated 11 months ago so it is fresh. It looks like the data is actually from 2015 and earlier.

https://data.humdata.org/dataset/cayman-islands-administrative-level-0-nation-and-1-district-boundaries

In this overdue dataset, the licence column contains the actual date of the data while the dataset date is the creation date of the dataset.

https://data.humdata.org/dataset/guadeloupe-basse-terre-and-grande-terre-islands-populated-places

In this overdue dataset, the methodology column contains the actual date of the data while the dataset date is the creation date of the dataset.

Confusing concepts related to Date of Dataset

The "Date of Dataset" was meant to capture the concept:

  • What date or date period does the data cover

However, it is apparent that people are or could be using this date field to represent a range of things:

  1. What date or date period is the data covering?
  2. When was the data in the dataset last modified?
  3. When was the dataset created? (this may be more of a problem with updating the field)
  4. When was the data reviewed (which could be implied by a metadata update) regardless of whether the data was updated?
  5. When was the data downloaded or taken from a system?
  6. When was the data uploaded into HDX?

Confusing concepts related to freshness

The following are possible dates freshness could use:

  • What date or date period does the data in the dataset cover
  • The date the data in the dataset was last modified
    • Was the update significant or minor?
  • The date the metadata of the dataset was last modified
    • Was the change significant or relevant to any dates we report?

Freshness currently captures the concept:

  • How recently the data was updated within its update frequency

However people may think it is about:

  1. How current is the data in the dataset
  2. How recently the data was updated (regardless of update frequency)

In essence, there are two main concepts that users probably think are captured in freshness when in reality only one of them is:

  1. Was the data updated recently? - this is the current freshness criterion "how recently the data was updated within its update frequency" but with the proviso that for a long update frequency like one year, if something was updated 11 months ago, although it is within the update frequency, users may not regard that as being updated "recently"
  2. Is the data current? - is the date or date period that the data in the dataset covers recent?

The Solution

Proposed Solution for Date of Dataset

Multiple ideas are being represented by "Date of Dataset" rendering the field less valuable than it might be since there is no easy way to determine which idea is being represented. Consequently, the best idea is to capture the more useful ideas in separate clearly named and described fields. Some of these ideas are already captured elsewhere so need only be exposed in the user interface.

The ideas we should see in the UI are:

  1. "Date Coverage" indicating what date or date period does the data in the dataset cover
  2. "Last Modified or Reviewed" showing when the data (not the metadata) was last modified or reviewed
  3. "Dataset Created on HDX" showing when the dataset was created (which is useful to DP team for tracking) - Not sure how useful this is to users, but helps in clarifying what 1 and 2 are not and highlights if the coverage field has not been set correctly (as currently it is often stuck on the dataset creation date).

Clarifying the fields in the UI should hopefully encourage their update.

The way in which this could be implemented is as follows:

  1. Rename "Date of Dataset" to "Date Coverage" and keep the current intended usage of indicating what date or date period does the data in the dataset cover (which can include being a singular moment in time like a 3W). Does this account for all cases or does "Date Coverage" not make sense for some datasets? 
  2. The underlying "Date Coverage" metadata field needs to allow the date or end date to be the current day (rolling forwards each day) eg. by allowing the value "DATE" - hence data that is being added to with each update can be set to a fixed start date and a floating end date or a download of live current data can just have a floating date. Maybe it is better to take this opportunity to make the dataset_date field into two fields for the start and end rather than messing with the existing? 
  3. Use last_modified metadata field on resources - I tested it and it indicates when the data was updated not the resource's metadata. Add a new last_modified field to the dataset metadata. The latest of the last_modified resource fields should be automatically copied to the a dataset level last_modified metadata field, but not the other way round ie. changing the dataset level last_modified metadata field should not affect the resource level last_modified resource fields.The dataset list/search UI should show this new field not the metadata_modified field it currently shows and this field should be added to the dataset page. Freshness will need to be modified to set this field instead of touching resources.
  4. Introduce the concept of "Reviewed" (or "Data is up to date"?) by having a new button in the contributor's (not users') UI, both inside and outside the dataset form, which the maintainer of the dataset or organisation administrator can click to indicate they have reviewed the dataset's data and agree it is as up to date as it can be. When the pointer hovers over the "Reviewed" button, a popup could ask the contributor to ensure the "Date Coverage" field is correct before clicking the button.
  5. Rather than introduce another new metadata field for the concept of "Reviewed", the dataset level last_modified metadata field can be updated when the "Reviewed" button is clicked (regardless of whether any resource's data has actually been modified). Since we have the resource level last_modified fields, we can determine if the dataset has been reviewed or data has actually changed. Freshness will need to check this dataset level field.
  6. The "Dataset Created" field already exists in the metadata

One advantage of the new "Last Modified" field is that it would be possible to edit that field to set it back to an earlier date in the event of a mistaken update without messing up CKAN's existing field.

One benefit of the "Reviewed" button is that we no longer need to infer that a dataset's data has been checked by looking at metadata changes. 

For HXLated data on HDX, if there is a an appropriate #date column (or #date+start and #date+end), we could have an option to allow "Date Coverage" to be automatically populated from it. This would solve the problem (for a subset of our datasets) of contributors neglecting to update this field. (Similar logic could be applied to "Location" in the HDX UI which could optionally be autopopulated from #country+code.) Issues like whether to do this on-the-fly when a user views a dataset with or without caching or daily would need to be determined.

Proposed Solution for freshness

Given the different concepts being conflated in the term fresh, it makes sense to me to separate them. Instead of having just one icon "fresh" with a range of meanings to users, we should have multiple icons that make it clear the concept we are trying to get across. I propose the following categories:

  1. "Active" for data that has been recently updated or reviewed (where we need to decide what recent means eg. last 2 weeks) - using the "Last Modified" dataset field mentioned above. 
  2. "New" for newly created datasets (where we need to decide what new means eg. last 2 weeks) - using the "Dataset Created" dataset field. 
  3. "Up to date" for data that has been updated or reviewed within its expected update frequency (or is Live) -  using the "Last Modified" and "Expected update frequency" dataset fields. 
  4. "Current" for data that covers a date or date period close to the present (where we need to decide what close to the present means which might be different depending upon the length of the date period eg. last 2 weeks) - using the "Date Coverage" metadata field(s) described earlier. 
  5. "Archived" for data that covers a date or date period far from the present (where it is up to the maintainer or HDX sys admins to decide what far from the present means), will not be updated again (expected update frequency=Never) and is as up to date as it can be - using the "Date Coverage", "Last Modified" and "Expected update frequency" dataset fields. 
  6. "Superceded" for data that covers a date or date period far from the present (where it is up to the maintainer or HDX sys admins to decide what far from the present means), will not be updated again (expected update frequency=Never) and for which another dataset exists with more current data)- using the "Date Coverage", "Last Modified" and "Expected update frequency" dataset fields. While useful, is it feasible to identify these? Let's leave this for now.

The categories Current and Archived are obviously mutually exclusive. It makes sense for "Active" and "New" not to be used together ie. if a dataset is newly created, there is no need to identify it as "Active" as well. 

There is a draft document on Archiving/Versioning Best Practices here. We need a process whereby when a dataset becomes delinquent, it is examined and if it looks like it is for a past crisis or is basically as up to date as it can be, its "Expected update frequency" will be set to "Never" and it will be labelled as "Archived" - in practice this means there needs to be a metadata field that can be set to indicate the dataset is archived. For datasets that are already delinquent, we will have to go through them all to do this. For datasets that will become delinquent in future, should we email maintainers or should the current process of when a dataset becomes deliquent and someone in DP looks at it be sufficient? 

Depending upon how the data is structured, it may be possible to determine if there is another dataset that is the next in the data series. In which case a dataset that would have been marked "Archived" could instead be labelled "Superceded" and it would be a nice feature to link to the newer dataset in the UI. While it might be technically possible to automate suggesting candidates for newer data, it would likely require manual curation to ensure that the correct dataset is identified.

Other than these exceptions, the categories can be assigned together ie. a dataset can be in multiple categories. However, if icons are displayed for all the categories a dataset is in, this could lead to there being too many displayed at once (at least on the dataset list/search page) - we may need to be experiment to see if this causes confusion and/or looks ugly. If so, I can think of three possibilities:

  1. Instead of using icons, use colours or similar on the dates that show in the dataset page and list/search page eg. "Updated January 7, 2019 | Dataset date: Jan 1, 2014".
    Assuming today's date is and the dataset was created 3 days ago and not updated since (ie. "Last Modified" is same day as dataset creation date), it covers a date range up to today and is updated daily, then the text would now be:
    "Created January 7, 2019 | Date coverage: Jan 1, 2014-date | Updates are Every day" - although it was created in the last two weeks and the data is current, it is already delinquent. Other examples: 
    "Updated October 10, 2018 | Date coverage: Jun 1-31, 2018 | Updates are Every year" - this dataset is Up to date, but not Active or Current
    "Updated October 10, 2015 | Date coverage: Jan 1, 2014 | Updates are Never" - this dataset is Archived
  2. We could consider limiting the dataset list/search page to show the most relevant category which means that there must be a priority for when multiple categories apply, for example for a dataset that is "New", "Current" and "Up to date", show "New" but not "Current" or "Up to date". The dataset page would show all the categories as it has more space.
  3. We group the categories under fewer icons, for example the "New" icon would actually mean the dataset is in categories "New", "Current" and "Up to date" and the icon "Active" would mean it is "Active", "Current" and "Up to date". We would have to do this on both the list/search page and the dataset page for consistency.

In the dataset list/search UI, the categories could be sorted on. We need to decide a priority perhaps using a points-based system such that "New" is at the top, then "Active"+"Current"+"Up to date" is next etc. Additionally, it would be very useful to be able to filter by these different categories.

I do not suggest distinguishing major and minor updates because it is probably impossible to detect automatically.

By using the new last_modified field, metadata updates will not be counted - we are only concerned with updates or reviews of data. We leverage the existing resource level last_modified field which is updated by filestore updates. Given the "Reviewed" button, we do not need to look at metadata changes to infer that a dataset's data has been checked. Ignoring all metadata changes negates the need to "discount edits by HDX" as that is a byproduct.