HDX Resource Date Coverages, Timelines and Fixed URLs
Consider how a timeline and resource date coverages would work for the following datasets (taken from HDX Enhancements):
- Dataset containing data in xlsx and csv formats as separate resources eg. https://data.humdata.org/dataset/afghan-voluntary-repatriation
We need to encourage the removal of 2019 from the dataset title. How?
The 2 resources would have the same coverage date range (2019).
The latest url would work ok as there are 2 formats so:
https://data.humdata.org/dataset/afghan-voluntary-repatriation/latest/download/data.xlsx
https://data.humdata.org/dataset/afghan-voluntary-repatriation/latest/download/data.xlsx - Dataset with rolling updates of resource (ie. dataset end date should be DATE) eg. https://data.humdata.org/dataset/inso-key-data-dashboard, https://data.humdata.org/dataset/indonesia-monthly-humanitarian-update
The resource will have a coverage from the start date to an end date of DATE
The timeline would reflect the rolling end date.
The latest url would work ok - Dataset with metadata in resource eg. https://data.humdata.org/dataset/global-airports
Two of the resources will have a coverage from the start date to an end date of DATE as well as a latest url.
The timeline would reflect the rolling end date.
One resource (metadata) won't have coverage and won't have a latest url. - Dataset with tiff in a zip: https://data.humdata.org/dataset/malawi_national_vulnerability_index_2015 (note the 2015 in the url is incorrect as it is current)
This resource updates each year (it is overwritten). Do we:- Rely on the contributor remembering to update the coverage end date when they update the resource each year (which doesn't work well at the moment)
- On overwriting a resource, prompt contributors to consider the date coverage period (however this won't help for remote urls that get updated)
- Have a value of DATE that is updated automatically but which due to the annual nature of the data means "up until the end of the previous year"
- Dataset with pdfs, zips (on OneDrive and filestore), mbtiles, tiff : https://data.humdata.org/dataset/iom-npm-cox-bazar-uav-imagery
There are many resources in this dataset.
Many have dates in filenames - should we discourage this somehow (same problem at dataset level)?
If filename gets auto populated, it may well fill in with a date - do we already need to consider changing filename to resource title?
As many are maps, the coverage date is more like a date of validity as it is a single day.
There will be many single day coverage periods in the timeline - how to make it easy to understand? - Dataset with JSON feed, HXLated JSON feed and xlsx (from automated output): https://data.humdata.org/dataset/migrant-deaths-by-month
Resource coverage periods will be 2017. There should be new resources for 2018 and 2019.
Latest url won't work as there are 2 JSON files.
If we change filename to resource title there is a greater chance of consistent naming enabling latest to differentiate between resources - Disaggregate by country into datasets and by indicator into resources eg. https://data.humdata.org/dataset/who-data-for-barbados
Issue same as 4. Latest url is not relevant for this dataset. - Disaggregate by date into datasets eg. https://data.humdata.org/dataset/syria-idp-flow-and-returnee-data-october-2018, https://data.humdata.org/dataset/syria-idp-flow-and-returnee-data-september-2018
These should now be in one dataset with new resources for each month (which is the date coverage). - Disaggregate by date into resources within one dataset eg. https://data.humdata.org/dataset/nigeria-humanitarian-needs-overview
Instead of dates in the filename, there will be coverage dates. - Disaggregate by indicator into datasets eg. https://data.humdata.org/dataset/gender-development-index-female-to-male-ratio-of-hdi
Coverage date is 2013 - Disaggregate by country into datasets and by date and region into resources eg. https://data.humdata.org/dataset/drc-displacement-data-baseline-assessment-iom-dtm
There are dates in the resource filenames which would become coverage dates.
Latest url would correspond to a region rather than being the latest for all regions. Until we can group by region, nothing much can be done. - Disaggregate by country into datasets and by round into resources eg. https://data.humdata.org/dataset/nigeria-baseline-data-iom-dtm
Each round corresponds to a date coverage period.
Latest url is latest round. - Disaggregate by country and emergency into datasets and by round into resources eg. https://data.humdata.org/dataset/indonesia-displacement-data-sulawesi-earthquake-site-assessment-iom-dtm
No problems with this one - Map data for a country at different admin levels for various dates eg. https://data.humdata.org/dataset/administrative-boundaries-of-bangladesh-as-of-2015 (note the 2015 in the url is incorrect as it is current)
Issues same as 5 - Map and population data for a country with varying file formats and metadata in a pdf eg. https://data.humdata.org/dataset/bhutan-administrative-level-0-1-population-statistics
Date should be removed from filename.
Latest url should work because of different file types. - National and subnational data per set of indicators per country eg. https://feature-data.humdata.org/dataset/dhs-data-for-democratic-republic-of-the-congo
As with other scraper made datasets, the scraper will need updating to try to make coverage dates per resource rather than calculating per dataset.
Latest url will be a problem for this dataset.