This document is intended as a collection of procedures and resources to guide the curation of Data Completeness instances (henceforth, Data Grids) which can be activated for any location page on HDX (by a sysadmin). This document and others linked from it, should evolve to capture best practices and any other useful info learned as the data grid curators do their work.
Once activated for a given location page, the Data Grid will appear and will be using a default recipe (based on tags) to fill the data grid. However, tags are seldom enough to accurately gauge if a dataset meets the requirements of a given data grid. Curation, then, is the process of customizing a specific location's data grid so that the datasets included in the data grid meet the defined requirements for the subcategory. That customization is done by editing the recipe yaml file (which is format that is friendly to both humans and machines).
The basic curation process is outlined below:
There may be more on the feature server for testing purposes, but the ones listed below should be the only active ones on the production server.
Country | Production Data Grid | Feature Server Data Grid | Curator(s) | Last check date |
---|---|---|---|---|
Yemen | Production: yem | Feature: yem | Amadu | 26 April 2019 |
Sudan | Production: sdn | Feature: sdn | Meti | |
Indonesia | Production: idn | Feature: idn | Faizal | 26 April 2019 |
Somalia | Production: som | Feature: som | Meti | 26 April 2019 |
Colombia | Production: col | Feature: col | Amadu | |
Philippines | Production: phl | Feature: phl | Amadu | 26 April 2019 |
Afghanistan | Production: afg | Feature: afg | Meti | |
Bangladesh | Production: bgd | Feature: bgd | Faizal | |
Chad | Production: tcd | Feature: tcd | Nafi | |
Mozambique | Production: moz | Feature: moz | Obadah | 26 April 2019 |
Venezuela | Production: ven | Feature: ven | Joseph | |
Democratic Repubic of the Congo | Production: cod | Feature: cod | Joseph | |
Central African Republic | Production: caf | Feature: caf | Nafi | |
Myanmar | Production: mmr | Feature: mmr | Obadah |
Each dataset that is a candidate for data grid has to be evaluated to determine if it fully meets the requirements to be included, partially meets the requirements, or does not meet them at all. The outcome determines what actions have to be taken in the YAML file to inlcude or exclude the file, and any comments to be recorded for users to understand where the dataset falls short. Below the process diagram, you will find more details on each quality check.
This one is straightforward. National level statistics are by definition excluded from data grid. The extent to which it needs to be subnational is handled further down this list.
Look at the sub-category definition and determine if the definition is met fully, partially, or not at all. If a lack of clear field names and/or a data dictionary make it hard to be sure, the dataset can be excluded or included as partially meeting the requirements.
Note: if t complete coverage can be obtained by combining several datasets (for example: several different 3Ws, one for each cluster with all clusters being covered), then all the datasets can be included but marked as "incomplete" with the same comment. The logic here is that someone should be combining these datasets.
Suggested comment language for a partial fit:
For tabular data, the the dataset should be tidy in the sense that field names and data rows should be easy to determine. There shouldn't be subtotal rows interspersed with data rows. For a format like xls or xlsx, the required data for a single sub-category should be on the same tab, and if not this should be noted in the comments. For tabular data with coordinates, the x and y columns (usually longitude and latitude) should be in decimal degree format and separated into two columns, and if not, this should be noted in the comments.
For geographic data, the data should be zipped shapefile, geojson, or geodatabase. Other somewhat common formats (kml, kmz, but not raster formats) could be accepted, with a comment.
Suggested comment language for a partial fit:
This one is trickier than it sounds. There are two ways to assess completeness:
Suggested comment language for a partial fit:
If a dataset contains references to location, are those locations defined in the dataset (such as latitude and longitude columns)? If not, then do p-codes or some other identifier make it possible to join this dataset to a location reference that is available in data grid (such as a COD admin boundary or a facilities list)? Datasets with partially successful joins should be included in data grid as incomplete with a comment.
Suggested comment language for a partial fit:
If most of the data for the country uses admin level 3, but the dataset in question is only disaggregated to admin level 1, the dataset can be included as "incomplete" with a comment.
Suggested comment language for a partial fit:
Ourairports data relies on user contributions and may not be comprehensive.
OpenStreetMap data relies on user contributions and may not be comprehensive. Dataset does not always contain data about practicability of a road.
Dataset contains data about Interaction member organizations but not other organizations working in the country. Dataset is disaggregated geographically, but inconsistently.
There can be a large number of these datasets, many of them out of date. Only include these datasets if the "date of dataset" is recent (probably within 6 months, but that could vary depending on the nature of the crisis). Generally speaking, these datasets cover only a small area (a single city or town per dataset) and therefore should have this comment:
Dataset covers a limited area.
In the normal practice of curation, only the include rules, exclude rules and metadata overrides need to be edited. Metadata overrides can only apply to datasets that are included based on the include rules.
Here's an annotated example for the Baseline Population subcategory:
#BASELINE POPULATION
- name: baseline_population
title: Baseline Population
description: Total population aggregated by administrative division.
rules:
include: #adds to the data grid any dataset that matches the country of the data grid (yemen in this case), one or more of the specified tags, and is subnational
- (tags:"population" AND subnational:1)
- (tags:"population statistics" AND subnational:1)
- (tags:"demographics" AND subnational:1)
exclude: # removes from the data grid any datasets matching these tags. Note the last one in this list is not based on tags (see comment in line)
- (tags:"people in need")
- (tags:"people affected")
- (tags:"displaced people")
- (tags:"interally displaced people")
- (tags:"hno")
- (tags:"humanitarian needs overview")
- (tags:"humanitarian needs overview - hno")
- (organization:"worldpop" AND title:" - Population" AND res_format:"zipped geotiff") # removes a specific type of worldpop dataset based on the org name,
# resource format, and a pattern in the dataset title
metadata_overrides: # this section modifies how a particular dataset is displayed in the data grid
- dataset_name: yemen-admin1-combined-food-insecurity-phase-2017-and-population-estimates-2015-2020 #specifies which dataset to modify the display of. It
# must already be included from the 'rules' section above, otherwise this is ignored.
display_state: incomplete # forces the dataset to be displayed with the "hashed" color, not solid blue.
comments: The dataset is only available in a GIS format, and tabular data is preferred for this subcategory. # These comments will appear when
# someone hovers over the dataset in data grid.
The YAML file for each country (and the default YAML file from which they are derived), create a hierarchy:
In the normal practice of curation, only the include rules, exclude rules and metadata overrides need to be edited.