Fixed Data URLs Idea

Rationale

If we want to get to QuickMaps and QuickDash and properly support QuickCharts, to support curation of data and if we want to increase the usage of the API to retrieve data as per the brief discussion in the team call, then I see the proposal set out below as being a necessary first stage and hence something we should consider for our roadmap.

Currently, users of data in our resources may find that the data stops being updated without warning because the dataset contributor wishes to update their data with the most current and does so by creating a new resource or dataset rather than adding to the existing resource in the dataset. The sad thing is that this means that some organisations' data cannot be used in automated things not because it is in some way bad but simply because of the way they are choosing to update it.

If we want more people to want to use direct URLs to retrieve data for use in automated reports, visuals and systems (as opposed to clicking the Download button), then they need URLs that will not need to be changed each time the contributor wants to add new data. This is also true for any dashboards, maps and charts we wish to to build and maintain. In fact it is not just automated processes that would benefit. It will be easier for people manually compiling regular reports not to have to search each time for the most current URL.

I'll give a use case: A HDX user in the future starts making a Quick Dashboard (or creating a manual monthly report) based on multiple datasets in HDX. The dashboard (or user) pulls data from the URLs of resources in those datasets. A contributor comes along to add current data to their dataset which happens to be one used in the dashboard (or report). Unaware that their dataset is being used by the dashboard (or user), the contributor decides to do one (or more) of these:

add the most current data in a new dataset
add the most current data in a new resource within the dataset
delete the dataset
change the format of the resource
change the types of columns in the resource

1 and 2 will mean that the user’s dashboard (or report) will be showing old data without warning them that this is the case. 3, 4 and 5 will cause some sort of failure in the dashboard (or errors the next time the user manually tries to create the report from the same urls).

What Needs to be Investigated

Hence finding a way to have fixed data URLs is critical to a number of possible future goals. This was previously referred to as stable API alluded to in the team chat, but fixed data URLs better reflects that URLs may just point to files eg. a csv rather than an API endpoint. Here is an example: https://data.humdata.org/dataset/6a60da4e-253f-474f-8683-7c9ed9a20bf9/resource/45dc4269-405a-433d-9011-d1ae23d624a5/download/fts_requirements_funding_cluster_afg.csv

Fixing data URLs may sound simple but requires delving into or at least considering a number of issues. Fortunately many of them have appeared in the brainstorming ideas Trello. Now we have a common thread that ties them together.

we need to be able to distinguish data resources from auxiliary ones - the joint top brainstorming idea from the last meeting was to do that and show it in the UI
resources can't keep growing indefinitely - we need a way to archive non-current data (different to dataset archiving which is in freshness)
newly added data may contain errors so it may be helpful to be able to fall back to a previous version of the data eg. if a dashboard cannot load latest/xxx.csv, it could try 1/xxx.csv (versioning brainstorming idea)
finding data by API needs to be simpler. Currently the limitation there is the capabilities of the CKAN API search. It can be helped by adding more metadata into the dataset for example the list of fields and HXL tags in the data (as in another brainstorming idea)
a system whereby automated users (and maybe normal users as well) can register to receive important information about a dataset they are using eg. a breaking change to the format, no longer being updated etc.
a workflow that tries to alert a contributor when an update to a resource they are making has unexpected field names, data type changes, etc.