Estimating the number of Data Series on HDX

Introduction

There are many datasets on HDX that have the same type of information, but are split by location for the convenience of our users. For purposes of this document, we will refer to such a superset of datasets as a “data series”. In some cases datasets are provided already split by location by the contributor. In other cases HDX is disaggregating by country via script when the datasets are created or updated.

CKAN doesn’t contain a “data series” concept, but it is useful to us to have some idea of how many data series there are. Some day, we may want to introduce the data series concept in CKAN as it might simplify searching, or allow automated aggregations of datasets by region, world, etc.

Definition of a data series

For the first estimation of data series, geographic aggregation was used. In other words:

Datasets from the same contributor about the same thing but covering different locations are part of the same data series. Conversely, datasets from the same contributor and about the same location but covering different topics are part of different data series.

A “singular dataset”, is one that has only itself in its data series.

Some examples:

  • Same Data Series (same theme, different location)

    • HOTOSM Bangladesh (southwest) Points of Interest (OpenStreetMap Export)

    • HOTOSM Bangladesh (northwest) Points of Interest (OpenStreetMap Export)

  • Different Data Series (different theme, same location)

    • HOTOSM Bangladesh (southwest) Points of Interest (OpenStreetMap Export)

    • HOTOSM Bangladesh (southwest) Waterways (OpenStreetMap Export)

First Estimation

This first estimation of data series was executed no 4-Jun-2021. The process is described below, but here are the results:

Number of datasets on HDX:

19,069

Number of data series on HDX (estimated):

6,092

Number of singular datasets on HDX:

5,735

Number of non-singular data series (with >1 dataset):

357

% of datasets on HDX that are part of non-singular data series:

70%

The takeaway is that 357 data series on HDX account for 70% of the datasets on HDX.

There are 54 data series with more than 100 datasets in them and these account for 40% of the datasets.

In the chart below, you can see that most of the non-singular data series, have between 2 and 12 datasets

The data for this analysis can be found here: https://drive.google.com/drive/folders/1hvQ-BOOVIY8CfNsgt1SOQ3JU0W0py1D7

Methodology

Grouping datasets into data series was accomplished with a script (attached below) which uses the following test: any dataset from the same contributor that has the same dataset name once all words associated with country names are removed is highly likely to be part of the same dataset.

The script uses these steps to implement that test:

  1. Take each dataset name and strip out all punctuation.

  2. Remove words from the name that appear in any country name in HDX Python Country. This includes alternate names as well as names in Russian, Chinese, Arabic, French, and Spanish.

  3. Concatenate the contributing organization name with the “cleaned” dataset name.

  4. Any datasets having the same resulting concatenated string are considered to be in the same data series

Note that this logic will result in some false positives, especially in the large number of data series with only 2 or 3 datasets. For example these two datasets have different sources, but because the dataset names are identical (except for location) and both are shared by HDX, they are counted as a data series consisting of two datasets.

https://data.humdata.org/dataset/haiti-covid-19-subnational-cases
https://data.humdata.org/dataset/europe-covid-19-subnational-cases

False negatives are possible, especially in cases where a data series is created manually. For example, the IOM DTM datasets are in many (all?) cases created manually, so differences in naming are likely, resulting in those datasets being treated as separate data series instead of a single data series.

Possible improvements to the process

The script for this was written quickly. And there are several ways to improve it.

  • Use batch_id as a first cut at grouping data series, though this should be tested to be sure that datasets on different themes that are being produced by the same script are not grouped into the same data series.

  • Add additional words to the “country name words” list to be stripped out of the dataset names. A quick browse of the results from above shows that sub-national location identifiers like “center” or “north1” are generating additional data series (mainly for the HOT datasets).

  • Including both dataset name and dataset title in the analysis might reduce false positives marginally.