Now that freshness is exposed in the interface, we need to examine if what it is saying makes sense and look at how users react to it. In particular, there are some cases where the freshness process needs some adjustment in order to avoid misleading users. Below I outline the Problem and then give proposals for a Solution.

The Problem

Issues that

...

were already identified

The following issues were found prior to the start of this investigation:

Exclude "Live", "As Needed" and "Never" datasets from no touch if already fresh rule - DONE
Discount edits made by HDX
Restrict which metadata changes count as updates for freshness
Offer an "archived" icon in addition to "fresh" to indicate a dataset that is old, up-to-date, and no longer being updated. At the moment these are being called fresh, which is technically true, but tends to present a lot of old data to users.
What to do with Date of Dataset ?is used in different ways

Discount edits made by HDX

Edits by HDX staff are typically to fix issues and have no bearing on the up to dateness of the data, hence they should be ignored by freshness but we need to consider what to do about edits to datasets maintained by HDX.

The edits that have been performed on a dataset can be seen by looking at package_revision_list. One complication is that we must go through the history of edits because someone outside HDX could make an update, followed not long after by someone in HDX. A naive implementation could miss the first edit which should count towards freshness.

Restrict which metadata changes count as updates for freshness

Currently any dataset metadata change counts as an update from a freshness perspective. Our assumption is:

Such changes are taken as signifying that the dataset maintainer has thought about the data and checked it
If they had newer data, then we would expect them to put it into HDX while updating the dataset metadata
The fact they haven't means the data is as up to date as possible

This proposal limits our assumption to certain fields - it becomes:

Changes in certain metadata fields are taken as signifying that the dataset maintainer has thought about the data and checked it
If they had newer data, then we would expect them to put it into HDX while updating these specific dataset metadata fields
The fact they haven't means the data is as up to date as possible

The criteria for choosing the fields should be those that directly affect the underlying data or freshness calculation:

Expected update frequency
Dataset date
Location?
Source?

Note that if the number of fields is severely limited, this may render discounting edits by HDX unnecessary.

Points to consider:

Expected update frequency is used to calculate freshness, but then if someone changes it from yearly to monthly, that doesn't indicate anything about the data having changed. If the dataset was delinquent with yearly update frequency, it should still be delinquent with monthly.
Why should someone changing the dataset description be any less of an update from a freshness perspective than changing the dataset date?
There doesn't seem to be a compelling reason to do a partial restriction of metadata changes counting for freshness - it's really all or nothing:
- either we regard any metadata change as someone indicating that the data is as fresh as it can be (as we originally envisaged)
- or we simply disregard metadata changes altogether from determination of freshness and rely solely on data changes - note that detecting file store changes specifically would need to be investigated

Offer an "archived" icon in addition to "fresh"

The data in some datasets refers to or covers a date or date period which is far in the past, but the data itself is as up to date as it could be and will not be updated again. For these cases, it makes sense to offer an archived icon instead of fresh (which would be the icon used at present for an expected update frequency of "never").

...

Date of Dataset

...

is used in different ways

The problem with Date of Dataset is twofold:

What date(s) is it trying to represent?
How should it be updated / how should contributors be encouraged to update it?

More on point 1 below in Confusing concepts related to Date of Dataset.

Discovering Other Issues

To discover other possible issues with how freshness is understood, the following strategy was applied:

Take a random sample of datasets ensuring that among them are fresh, due, overdue, and delinquent datasets and that they represent a cross section of different organisations' datasets
Evaluate what fresh and not fresh mean
Determine if it is clear to users
Collect any cases where the fresh label (or lack of it) is misleading
Categorise misleading cases

With an overview of the misleading cases, we can consider what to do about the terminology we use such as fresh and not fresh that accounts for the misleading cases and provides clarity to users.

Misleading cases

The misleading cases are documented in the Google spreadsheet here and the resources for those datasets were all frozen and stored in GitHub for further analysis. From the full analysis, a subset of examples of specific cases were picked and coloured in red.

...

In this overdue dataset, the methodology column contains the actual date of the data while the dataset date is the creation date of the dataset.

Anchor
confusingdate
confusingdate
Confusing concepts related to Date of Dataset

The "Date of Dataset" was meant to capture the concept:

...

"Date Coverage" indicating what date or date period does the data in the dataset cover
"Last Modified or Reviewed" showing when the data (not the metadata) was last modified or reviewed
"Dataset Created on HDX" showing when the dataset was created - Not sure how useful this is to users, but helps in clarifying what 1 and 2 are not and highlights if the coverage field has not been set correctly (as currently it is often stuck on the dataset creation date).

Clarifying the fields in the UI should hopefully encourage their update.

The way in which this could be implemented is as follows:

...

Versions Compared

Old Version 44

New Version 45

Key

The Problem

Issues that

were already identified

Discount edits made by HDX

Restrict which metadata changes count as updates for freshness

Offer an "archived" icon in addition to "fresh"

Date of Dataset

is used in different ways

Discovering Other Issues

Misleading cases

Anchor
confusingdate
confusingdate
Confusing concepts related to Date of Dataset

Page Comparison

Versions Compared

Old Version 44

New Version 45

Key

The Problem

Issues that

were already identified

Discount edits made by HDX

Restrict which metadata changes count as updates for freshness

Offer an "archived" icon in addition to "fresh"

Date of Dataset

is used in different ways

Discovering Other Issues

Misleading cases

AnchorconfusingdateconfusingdateConfusing concepts related to Date of Dataset

Anchor
confusingdate
confusingdate
Confusing concepts related to Date of Dataset