Dataset Update Best Practices
It is common for data contributors who are uploading datasets manually to create new ones when they have new data. One such example is here: https://data.humdata.org/dataset/afghanistan-who-does-what-where-october-to-december-2016
Challenges
The creation of new datasets for updated data presents multiple problems:
- The contributor will almost certainly forget to change the expected update frequency of the old dataset to "Never"
- The provider may be inconsistent in naming the new dataset similarly to the old
- Data freshness will eventually report the freshness of the old dataset as "delinquent"
- All statistics about the "old" dataset are not transferred to the new one eg. number of downloads
- Any visuals relying on the url to the resource in the old dataset will need to be updated otherwise they will fail
- Automated systems lose predictability about where to find the current dataset
- As we move towards quality rather than quantity metrics of success, old unmaintained datasets become an issue
Solutions
While we could try to find an automated way to detect new datasets being created for updated data, this would be technically difficult because the link between the old and new dataset may not be obvious, at least to a machine. For example, if the provider is inconsistent with structuring the dataset name and title, then the automated solution will likely fail. However, it might be worth having some basic checks in the UI during the dataset creation process as per this Jira ticket: - HDX-5117Getting issue details... STATUS
The data freshness overdue email could be expanded to include information about best practices (and also a reminder to update the dataset date if needed). Similarly, best practices information could be made available in the FAQ and probably it would be good to put it on the dataset creation page as well. We need to highlight the functionality to sorting the resources (manually by drag and drop on the edit page) - could sorting be made more obvious in the UI?
Best Practices
Here we enumerate what are our best practices:
- We recommend providing data disaggregated by country (ie. one dataset per country with all indicators for that country). This is helpful to country offices, crisis teams etc.
- In addition, we recommend providing data disaggregated by indicator set or indicator (ie. one dataset per indicator/indicator set with data for all countries). This is helpful for headquarters.
- We discourage providing data disaggregated by time period (ie. one dataset per time period)
- For each dataset, if it is possible then the whole time period should be contained in a single resource which is continually updated
- If that is not possible, the data in that dataset can be disaggregated into resources within that dataset:
- If the dataset is for a country, then its resources could be per indicator set or indicator
- If the dataset is for an indicator set or indicator, then its resources could be per country
- If the above is not possible or still results in unwieldy resources, then the resources can be further subdivided by time period within that dataset
- Resources for time periods should be sorted with latest at the top
Previously we came up with dataset naming conventions to be used internally. Should these be expanded to cover the above best practices and also to include resource naming? Should they be made available to data providers?
Next Steps
- Get agreement on what are our best practices
- Decide if dataset/resource naming conventions should form part of these best practices
- Clean them up into a form suitable for publication/distribution
- Consider UI improvements