Data Quality Checks process on HDX
Data quality checks are an important aspect of HDX. These are done twice daily ( morning and Evening) on new datasets or existing datasets on HDX that have been changed. Any addition/creation or change to an existing dataset on HDX triggers a notification through the Pushbullet application to an HDX notifications channel. The notification comes with the dataset title, link to the dataset on HDX and an indicator showing which HDX user made the changed. Additionally, one can go through HDX data link at the portal to find out new data sets coming in.
A candidate dataset for quality checking is normally taken through a 6 step series of checks. If any of the criteria checked on is not satisfied, the dataset is made private and the person who uploaded is notified through E-mail indicating areas that were found to have issues.
For quality checks, a dataset is checked to see if:
1.Dataset contains PII
A check is done to see if the uploaded data set has personal Identifiable Information (PII). This is information such as Names of people, e-mail addresses, telephone numbers and any other information that can be used to track an individual/individuals about whom the data is about.
2. Dataset metadata missing or incomplete
A check is done on whether the Metadata added to HDX to accompany the dataset is complete. This includes checks on whether there are indicated data sources, date of data, location for which the data is about, and the methodology used to collect the data
3.Dataset has no resources (files or links) or link is broken
A check is done on whether the created dataset can be reachable and downloadable by users. All datasets added to HDX should be downloadable by users depending on the privacy set on them by the organisation uploading
4.Dataset contains non-humanitarian data, test data or sensitive data
The created dataset is checked to see if it is just test data i.e. some users testing out the system for the first time might upload a test dataset. A dataset is also checked to see if it contains sensitive data (which could be data on attacks, Terrorism or Opinions) or any other inappropriate or otherwise objectionable content
5.Unrelated gallery item or outdated gallery item
Some datasets are uploaded to HDX accompanied with Gallery items. Gallery items are supposed to give a summary or a pictorial view at a glance on what the data is about. A check is done on the accompanying gallery item to ensure that it matches the data uploaded and whether it is up to date
6.Data is not of the right format
Datasets uploaded to HDX are to be in the right easily/universally accessible and usable format. This is mostly in form of Comma separated values (CSV’s) or Excel file. Geographical data uploaded to HDX should be in a shapefile format or geojson .