Introduction

Scrapers collect data from websites that is pushed into HDX or in the case of the dap ones into CPS as indicators and from there into HDX. The data is extracted programmatically from spreadsheets or html. The status of the dap scrapers is reviewed here. This document considers the others.

Current Status on ScraperWiki

Name	Last run	Created
DataStore: Mali Topline Figures	a few seconds ago	a year ago
DataStore: WFP Topline Figures	a few seconds ago	a year ago
DataStore: Colombia Topline Figures	a few seconds ago	a year ago
Guinea 3W Validation + DataStore	a few seconds ago	a year ago
Nepal Earthquake Topline Figures	a few seconds ago	a year ago
Somalia NGO Topline Figures	a few seconds ago	a year ago
OCHA Afghanistan Topline Figures	a few seconds ago	9 months ago
UNOSAT Topline Figures	a few seconds ago	9 months ago
DataStore: UNHCR Topline Figures	a few seconds ago	10 months ago
Alert: HDX Dataset Revision	a few seconds ago	a year ago
DataStore: Topline Ebola Outbreak Figures	a few seconds ago	a year ago
Fiji Topline Figures	4 minutes ago	a month ago
Ebola Data Lister	21 minutes ago	a year ago
FTS Ebola	23 minutes ago	a year ago
FTS Ebola Coverage	23 minutes ago	a year ago
HDX Source Count Tracker	an hour ago	6 months ago
HDX ebola-cases-2014 Data for Visualisation	an hour ago	a year ago
HDX Registered Users Stats	an hour ago	9 months ago
HDX Management Statistics	3 hours ago	a year ago
OpenNepal Scraper	5 hours ago	6 months ago
Dragon - FTS	6 hours ago	2 years ago
UNOSAT Product Scraper	7 hours ago	a year ago
Dragon - ReliefWeb R	7 hours ago	2 years ago
FTS Emergency Collector	7 hours ago	a year ago
FTS Appeals Collector	7 hours ago	a year ago
OCHA CERF Collector	7 hours ago	5 months ago
WHO GAR	7 hours ago	a year ago
UNHCR Real-time API	7 hours ago	2 years ago
HDX Repo Analytics	7 hours ago	2 years ago
WHO Ebola SitRep Scraper	19 hours ago	a year ago
ACLED Africa Collector	a day ago	15 days ago
WFP mVAM Collector	2 days ago	2 months ago
UN Iraq Casualty Figures	2 days ago	a year ago
IDMC Global Figures	2 days ago	2 years ago
UNOSAT Flood Portal (shows as successful in ScraperWiki but runs with errors)	6 days ago	9 months ago
UPS blog posts	a month ago	a month ago
IFPRI Dataverse Collector	2 months ago	4 months ago
World Bank Climate Collector	5 months ago	5 months ago
UNDP Climate Collector	5 months ago	5 months ago
FAO Collector	5 months ago	6 months ago
HDRO Collector	5 months ago	7 months ago
WHO Health	5 months ago	5 months ago
UNHCR Mediterranean Collector	6 months ago	6 months ago
WFP VAM API Scraper	7 months ago	a year ago
OCHA ORS Scraper	7 months ago	a year ago
Datastore: Feature Organization Page	7 months ago	7 months ago
FTS: Nepal Earthquake Coverage	8 months ago	a year ago
FTS Collector	9 months ago	9 months ago
OCHA Syria Key Humanitarian Figures	9 months ago	a year ago
UN Habitat	9 months ago	9 months ago
CAP Appeals Scraper	9 months ago	2 years ago
CPSer	9 months ago	2 years ago
WFP Food Prices (full)	9 months ago	a year ago
NGO AidMap Ebola Projects	10 months ago	a year ago
ReliefWeb Scraper	10 months ago	2 years ago
HealthMap Ebola News (geo)	10 months ago	2 years ago
Colombia SIDIH	11 months ago	a year ago
DataStore: Nutrition SMART Survey	a year ago	a year ago
OCHA ROWCA ORS	a year ago	2 years ago
DataStore: Colombia Page Indicators	a year ago	a year ago
HDX ebola-cases-2014 dataset updater	a year ago	a year ago
NCDC / NOAA Precipitation Collector	Never	4 months ago
WorldPop Collector	Never	3 months ago
Scraper: Violations Documentation Center	Never	7 months ago

Status (Colour is the Key for above Table)	Number of Scrapers
Working	32
Working but not run for > 4 months	22
Failing	7
Not run	3

Notes:

Scrapers with “Dragon -” as a prefix were created by Dragon, all others are created by “UN OCHA” - finding specifically who for these requires looking for and then at the source code

Working but not in HDX

A number of FTS scrapers work but their output cannot be seen in HDX. FTS refers to the OCHA Financial Tracking Service. It is "a centralized source of curated, continuously updated, fully downloadable data and information on humanitarian funding flows."

The FTS Appeals and FTS Emergency collectors are working. It is not obvious where they output. There are similarly named datasets in HDX but they do not have any resources - see https://data.hdx.rwlabs.org/dataset/fts-appeals and https://data.hdx.rwlabs.org/dataset/fts-emergencies. FTS Appeals in fact outputs to a number of datasets ending “ humanitarian contributions” eg. Senegal 2015 humanitarian contributions according to https://docs.google.com/spreadsheets/d/1RdHFRn3s8uxRfbmJ6aZCghpt85mFk3aR8eDZAgwLRKs/edit?usp=sharing. This is not readily apparent in the code. Since these collectors are coded in R and rely on CPS, they should be rewritten in Python and their dependence on CPS removed.

The FTS Ebola and FTS Ebola Coverage scrapers are working but rely on CPS to get data to HDX. These collectors are written in R and should be recoded in Python without the reliance on CPS. FTS Ebola Coverage (https://data.hdx.rwlabs.org/dataset/fts-ebola-coverage) was last updated in HDX four months ago. It is not clear which dataset FTS Ebola is updating - from the code there should be one called "FTS Ebola Indicators" but it does not come up in a search on HDX.

Working but not run recently

It is not easy to find out the frequency with which scrapers are supposed to be running as this information is not in the ScraperWiki GUI. The frequency tells us whether a scraper should have run in a certain period or not. For example, if a scraper has not run for 5 months, but its frequency is annual, then it is probably ok, but if it has not run for a week and its frequency is daily then it is is broken.

The table above uses the arbitrary cutoff of 4 months to try to point to scrapers that should be examined in more detail.

UNHCR Mediterranean collector

The UNHCR Mediterranean collector had not run for 6 months. Connecting to the machine where it resides by ssh and running it worked fine, and afterwards it displayed as recently run on ScraperWiki.

The reason the scraper had not run is because its scheduling was broken and this might be the same for others. As there was no cron job, it looks probable that it tries to use a Python scheduler but from experiments with this approach it appears that the scheduler is killed on logging out. To fix the Mediterranean collector's scheduling, it is likely that setting up crontab is all that is required.

The scraper's data taken from: http://data.unhcr.org/api/stats/mediterranean/monthly_arrivals_by_country.json is put into a ScraperWiki sqlite database and relies on CPS to get it into HDX. It is not clear what dataset is being updated in HDX. This would be clearer if the scraper was rewritten to use the CKAN API to HDX.

FTS Nepal Earthquake Collector

The FTS Nepal Earthquake collector has not run for eight months but since it was designed to collect data around a specific crisis, this is most likely not a problem. The dataset in HDX is here: https://data.hdx.rwlabs.org/dataset/response-plan-coverage-nepal-earthquake

FTS Collector

The FTS collector has not run for nine months. It does not have any scheduling set up and the datasets in HDX do not have any resources (https://data.hdx.rwlabs.org/dataset/fts-clusters, https://data.hdx.rwlabs.org/dataset/fts-appeals and https://data.hdx.rwlabs.org/dataset/fts-emergencies). It was likely designed to replace the two scrapers written in R with the same output datasets in HDX. It resides in a GitHub repository in Reuben's public area and should be moved to a central well known location if it is to be maintained (assuming it works).

Failed Scrapers

UNHCR Real-time API

This failed with the error: “Error in library(sqldf) : there is no package called ‘sqldf’”

Sqldf is an R library that appears to be missing. The code for this scraper is written in R and put directly in ScraperWiki (“code in your browser”). It would be advisable to rewrite it in Python as supporting multiple languages is a maintenance headache and make it standalone (“code your own tool”) in ScraperWiki.

HDX Repo Analytics

This scraper has the same issue as the UNHCR Real-Time API.

UNOSAT Product Scraper

This scraper failed because one of the urls that it scrapes from a webpage was a dead link and it has no check for this possibility. It was difficult to find this fault because the scraper is written in R with limited logging and error handling. After fixing it, although the scraper claims to have run successfully, the datasets on HDX seem to remain unchanged, hopefully because the data is assessed as unmodified. This requires further study. The metadata modification date is generally 24/11/2015 but a few datasets are newer (from Luis spreadsheet). Ultimately, the scraper should be rewritten in Python with detailed logging and error handling.

UNOSAT Flood Portal

Running this scraper produced this output:

lgfoslr@cobalt-u:~$ bash tool/bin/run.sh

→ Cleaning table `unprocessed_data`.

→ Table `unprocessed_data` cleaned successfully.

ERROR: Failed to store record in database.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

→ Processing dates.

→ Identifying countries.

→ Identifying file extension.

→ Cleaning title.

→ Cleaning table `processed_data`.

ERROR: Failed to clean table `processed_data`.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

ERROR: Failed to store record in database.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

SUCCESS: Successfully patched 1485 records.

SUCCESS: Successfully fetched 1485 records from the UNOSAT Flood Portal.

→ Exporting Datasets JSON to disk.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

Although it shows as successful on ScraperWiki, given these errors, it is not clear if it is working and further analysis is required. The metadata modification dates

IFPRI Dataverse Collector

When running this scraper, the error it produces is: “Service Temporarily Unavailable. The server is temporarily unable to service your request due to maintenance downtime or capacity

problems. Please try again later.”

It tries to connect to this url: https://dataverse.harvard.edu/api/dataverses/IFPRI/contents which fails similarly in a browser.

This seems to be a problem with the website and needs to be followed up with the maintainer.

WFP VAM API Scraper

This scraper runs for several hours and then fails. It posts a form to a number of urls of this form: http://reporting.vam.wfp.org/API/Get_CSI.aspx along with a set of parameters. It is likely one of these that is timing out. The script is quite complicated so will require some time to debug.

Overall Condition of Scrapers

Looking at the scrapers as a whole rather than individually, an assessment can be made of their general health. The unfortunate conclusion is that they are not in a good state even those which are currently running regularly. It was raised previously in the DAP scraper review before the recent departure of a long serving data scientist that there is “significant key man risk as knowledge about them is not distributed and they are built and looked after in varying ways by different authors. As the number of scrapers grows, it will be difficult to maintain and if there are staff changes, this will present serious challenges.” That is the situation we are in now.

Because the collectors have not been built with maintenance and support in mind, it requires spending a great deal of time to understand each one. There is virtually no documentation and they do they follow a common approach or standard. They have not been coded in a consistent way or even in one language, particularly older ones. Where code appears at first glance similar, closer inspection reveals that it has been copied, pasted and modified. There are no example web pages or files to indicate what each script is built to parse and mostly no tests. The collectors often do not log useful output for debugging and when they fail, there is frequently no notification or a cryptic error message.

Compounding these problems are infrastructural problems. Having a separate GitHub repository for each scraper is good practice but it is not easy to review many scrapers. Identifying the repository corresponding to a collector in ScraperWiki involves a manual search by name and is complicated by the fact that the repository could be in someone’s personal space, even an individual who has left the team.

A major headache is that there is no link to go from a data collector in ScraperWiki to its corresponding dataset in HDX. This means that one must search on HDX by name for a dataset (which is unreliable) or look it up in the scraper’s source code. This becomes even more opaque once CPS enters the equation - it seems to be very difficult to see the flow of data through CPS in its user interface.

The update frequency of a dataset in HDX is not tied to the schedule of scrapers in ScraperWiki which are set up using crontab. For example, a collector may operate every week in ScraperWiki but its corresponding dataset in HDX may have update frequency “never” (particularly if CPS sits in between), so determining if data in HDX is current is not obvious as it may be that the same data is being downloaded and then not updated in HDX. Whether CPS refreshes datasets in HDX if the files have not changed is not clear and it needs to be confirmed that the activity stream in HDX always shows when public datasets are updated. Some scrapers are run manually on demand eg. WFP food prices and those should be documented.

Next Steps

As was mentioned for dap scrapers, for those that are not used or for which it is decided that data is unimportant, they should be deleted so that it is clear what needs to be properly maintained and tested.

We need a policy (if there is not one already) around how long a dataset is private and/or how long a scraper is not run (eg. because it is for a one off event) before we delete it from ScraperWiki.

Scrapers need to be raised to a common standard covering style/approach, documentation, acceptance test with data snapshot, recording last known good data on each successful scrape and using a common library.

The link from scraper in ScraperWiki to dataset in HDX possibly via CPS needs to be fully enumerated which will likely require database access as the user interfaces are not sufficient to extract this information.

Review of Scrapers

Current Status on ScraperWiki

Working but not in HDX

Working but not run recently

UNHCR Mediterranean collector

FTS Nepal Earthquake Collector

FTS Collector

Failed Scrapers

UNHCR Real-time API

HDX Repo Analytics

UNOSAT Product Scraper

UNOSAT Flood Portal

IFPRI Dataverse Collector

WFP VAM API Scraper

Overall Condition of Scrapers

Next Steps