Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Introduction

We would like to be able to determine how fresh is the data on HDX for two purposes. Firstly, we want to be able to encourage data providers contributors to make regular updates of their data where applicable, and secondly, we want to be able to tell users of HDX how up to date are the datasets in which they are interested

There are two dates that data can have and this can cause confusion, so we define them clearly here:

  1. Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency
  2. Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed

The method of determining whether a resource is updated depends upon where the file is hosted. If it is hosted by HDX, then the update time is recorded, but if externally, then there can be challenges in determining if a url has been updated or not. Early research work exists on inferring an update frequency and other approaches are being explored for obtaining and/or creating an update time.
Once we have an update time for a dataset's resources, we can calculate its age and combined with the update frequency, we can ascertain the dataset's freshness of the dataset. 
Progress

It was determined that a new field was needed on resources in HDX. This field shows the last time the resource was updated and has been implemented and released to production

Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4254
. Related to that is ongoing work to make the field visible in the UI
Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4894
.

Critical to data freshness is having an indication of the update frequency of the dataset. Hence, it was proposed to make the data_update_frequency field mandatory instead of optional and change its name to make it sound less onerous by adding "expected" ie. dataset expected update frequency 

Jira Legacy
serverJIRA (humanitarian.atlassian.net)
serverIdefab48d4-6578-3042-917a-8174481cd056
keyHDX-4919
. It was confirmed that this field should stay at dataset level as our recommendation to data providers would be that if a dataset has resources with different update frequencies, it should be divided into multiple datasets. Assuming the field is a dropdown, it could have values: daily, weekly, fortnightly, monthly, quarterly, semiannually, annually, never. It would be good to have something pop up if the user chooses "never" making it clear that this is for datasets for which data is static. We will have to audit datasets where people pick this option as we don't want people choosing "never" because they don't want to commit to putting an expected update frequency. The expected update frequency requires further thought particularly on the issue of static datasets, following which there will be interface design and development effort

A trigger has been created for Google spreadsheets that will automatically update the resource last modified date when the spreadsheet is edited. This helps with monitoring the freshness of toplines and other resources held in Google spreadsheets and we can encourage data providers contributors to use this where appropriate. Consideration has been given to doing something similar with Excel spreadsheets, but support issues could become burdensome.

A collaboration has been started with a team at Vienna University who are considering the issue of data freshness from an academic perspective. We will see what we can learn from them but will likely proceed with a more basic and practical approach than what they envisage. Specifically, they are looking at estimating the next change time for a resource based on previous update history, which is in an early stage of research so not ready for use in a real life system just yet.

Next Steps

The expected update frequency field requires further thought particularly on the issue of static datasets, following which there will be interface design and development effort. 

Once the field is in place, there are some simple improvements we can make that will have a positive impact on data freshness. For example, we should send an automated mail reminder to data contributors if the update frequency time window for any of their datasets is missed by a certain amount. Even for ones which have an update frequency of "never", there could be an argument for a very rare mail reminder just to confirm data really is static. For the case where data is unchanged, we should give the option for contributors to respond directly to the automated mail to say so (perhaps by clicking a button in the message). Where data has changed, we would provide the link to the dataset that needs updating. We should consider if/how we batch emails if many datasets from one organisation need updating so they are not bombarded.

The amount of datasets that are hosted outside of HDX is growing rapidly and these represent a problem for data freshness as their update time may not be available. Rather than ignore them and concentrate only on HDX hosted files, it was decided to work out what we can do to handle this situation.The easiest solution is to send a reminder to users according to the update frequency  - the problem is that this would be irrespective of whether they have already updated and so potentially annoying.

Another way is to provide guidance to data contributors so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. using a Google spreadsheet with our update trigger added.  We could investigate a fuller integration between HDX and Google spreadsheets so that if a data provider clicks a button in HDX, it will create a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials. We may need to investigate other platforms for example creating document alerts in OneDrive for Business and/or macros in Excel spreadsheets (although as noted earlier, this might create a support headache).

Exploration is currently under way into the header returned by HTTP requests. Sometimes, this header contains a last modified field. The percentage of externally hosted resources for which this field is usefully populated needs to be measured.

Important fields


FieldDescriptionPurpose
data_update_frequencyDataset expected update frequencyShows how often the data is expected to be updated or at least checked to see if it needs updating
revision_last_updatedResource last modified dateIndicates the last time the resource was updated irrespective of whether it was a major or minor change
dataset_dateDataset dateThe date referred to by the data in the dataset. It changes when data for a new date comes to HDX so may not need to change for minor updates

Approach

  • Determine the scope of our problem by calculating how many datasets are locally and externally hosted. Hopefully we can use the HDX to calculate this number.  
  • Collect frequency of updates based on interns work? 
  • Define the age of datasets by calculating: Today

    Dataset Aging Methodology

    A resource's age can be measured using today's date - last

    modified date 
  • Compare age with frequency and define the logic: how do we define an outdated dataset
  • Determining if a Resource is Updated

    The method of determining whether a resource is updated depends upon where the file is hosted. If it is in the file store, then the update time is clear,
    If it is hosted externally, then it is not so simple. It may be possible to use HTTP get the last_modified field depending upon whether the server supports it or not.

    Number of Files Locally and Externally Hosted

    TypeNumber of ResourcesPercentageExampleFile Store                                  2,102
    22%
    CPS                                  2,459
    26%
    HXL Proxy                                  2,584
    27%
    ScraperWiki                                     162
    2%
    Others                                  2,261
    24%
    Total                                  9,568
    100%
    Classifying the Age of Datasets

    update time. For a dataset, we take the lowest age of all its resources. This value can be compared with the update frequency to determine an age status for the dataset.

    Thought has previously gone into classification of the age of datasets. Reviewing that work, the statuses used (up to date, due, overdue and delinquent) and formulae for

    determining

    calculating those statuses

    is

    seems sound

    and

    so we will

    build on that foundation

    use them as a foundation and see how well they work:


    Update Frequency

    Dataset age state thresholds

    (how old must a dataset be for it to have this status)

    Up-to-date

    Due

    Overdue

    Delinquent

    Daily

    0 days old

    1 day old

    due_age = f

    2 days old

    overdue_age = f + 2

    3 days old

    delinquent_age = f + 3

    Weekly

    0 - 6 days old

    7 days old

    due_age = f

    14 days old

    overdue_age = f + 7

    21 days old

    delinquent_age = f + 14

    Fortnightly

    0 - 13 days old

    14 days old

    due_age = f

    21 days old

    overdue_age = f + 7

    28 days old

    delinquent_age = f + 14

    Monthly

    0 -29 days old

    30 days old

    due_age = f

    44 days old

    overdue_age = f + 14

    60 days old

    delinquent_age = f + 30

    Quarterly

    0 - 89 days old

    90 days old

    due_age = f

    120 days old

    overdue_age = f + 30

    150 days old

    delinquent_age = f + 60

    Semiannually

    0 - 179 days old

    180 days old

    due_age = f

    210 days old

    overdue_age = f + 30

    240 days old

    delinquent_age = f + 60

    Annually

    0 - 364 days old

    365 days old

    due_age = f

    425 days old

    overdue_age = f + 60

    455 days old

    delinquent_age = f + 90

    Thoughts

    There are two aspects of data freshness:
    1. Date of update: The last time the data was was looked at to confirm it is up to date ie. it must be examined according to the update frequency
    2. Date of data: The actual date of the data - an update could consist of just confirming that the data has not changed
    We should send an automated mail reminder to data contributors if the update frequency time window is missed by a certain amount. Perhaps we should give the option for contributors to respond directly to that mail to say that data is unchanged so they don't even need to log into HDX in that case, otherwise provide the link to their dataset that needs updating.
    The amount of datasets that are outside of HDX is growing. I think we should try to handle this situation now. The simple but perhaps annoying solution is to send a reminder to users according to the update frequency (irrespective of whether they have already updated as we cannot tell).


    Another way to do so is to provide guidance to users so that as they consider how to upload resources, we steer them towards a particular technological solution that is helpful to us eg. Google spreadsheet with update trigger, document alerts in OneDrive for Business, macro in Excel spreadsheet. I don't know if this is possible, but complete automation would be if they could click something in HDX that creates a resource pointing to a spreadsheet in Google Drive with the trigger set up that opens automatically once they enter their Google credentials.

    Determining if a Resource is Updated

    The method of determining whether a resource is updated depends upon where the file is hosted. If it is in HDX ie. in the file store, then the update time is readily available. If it is hosted externally, then it can be problematic to find out if the file pointed to by a url has changed. It may be possible to use the last_modified field HTTP get the last_modified field depending upon whether the server supports it or not. We speculate that if it is hosted on a server like Apache or Nginx that the field will exist, but if it is a url that generates a result on the fly, that the field will eitehr not exist or just contain today's date. 

    Number of Files Locally and Externally Hosted

    TypeNumber of ResourcesPercentageExample
    File Store                                  2,102
    22%

    CPS                                  2,459
    26%

    HXL Proxy                                  2,584
    27%

    ScraperWiki                                     162
    2%

    Others                                  2,261
    24%

    Total                                  9,568
    100%

    Thoughts

    Actions
    Update frequency needs to be mandatory: 
    Jira Legacy
    serverJIRA (humanitarian.atlassian.net)
    serverIdefab48d4-6578-3042-917a-8174481cd056
    keyHDX-4919
    Investigate http get last modification date field - 60% in HDX have this according to UofV.

    ...