/
Review of Scrapers

Review of Scrapers

Introduction

Scrapers collect data from websites that is pushed into HDX or in the case of the dap ones into CPS as indicators and from there into HDX. The data is extracted programmatically from spreadsheets or html. The status of the dap scrapers is reviewed here. This document considers the others.


Current Status on ScraperWiki

 

Name

Last run

Created

DataStore: Mali Topline Figures

a few seconds ago

a year ago

DataStore: WFP Topline Figures

a few seconds ago

a year ago

DataStore: Colombia Topline Figures

a few seconds ago

a year ago

Guinea 3W Validation + DataStore

a few seconds ago

a year ago

Nepal Earthquake Topline Figures

a few seconds ago

a year ago

Somalia NGO Topline Figures

a few seconds ago

a year ago

OCHA Afghanistan Topline Figures

a few seconds ago

9 months ago

UNOSAT Topline Figures

a few seconds ago

9 months ago

DataStore: UNHCR Topline Figures

a few seconds ago

10 months ago

Alert: HDX Dataset Revision

a few seconds ago

a year ago

DataStore: Topline Ebola Outbreak Figures

a few seconds ago

a year ago

Fiji Topline Figures

4 minutes ago

a month ago

Ebola Data Lister

21 minutes ago

a year ago

FTS Ebola

23 minutes ago

a year ago

FTS Ebola Coverage

23 minutes ago

a year ago

HDX Source Count Tracker

an hour ago

6 months ago

HDX ebola-cases-2014 Data for Visualisation

an hour ago

a year ago

HDX Registered Users Stats

an hour ago

9 months ago

HDX Management Statistics

3 hours ago

a year ago

OpenNepal Scraper

5 hours ago

6 months ago

Dragon - FTS

6 hours ago

2 years ago

UNOSAT Product Scraper

7 hours ago

a year ago

Dragon - ReliefWeb R

7 hours ago

2 years ago

FTS Emergency Collector

7 hours ago

a year ago

FTS Appeals Collector

7 hours ago

a year ago

OCHA CERF Collector

7 hours ago

5 months ago

WHO GAR

7 hours ago

a year ago

UNHCR Real-time API

7 hours ago

2 years ago

HDX Repo Analytics

7 hours ago

2 years ago

WHO Ebola SitRep Scraper

19 hours ago

a year ago

ACLED Africa Collector

a day ago

15 days ago

WFP mVAM Collector

2 days ago

2 months ago

UN Iraq Casualty Figures

2 days ago

a year ago

IDMC Global Figures

2 days ago

2 years ago

UNOSAT Flood Portal (shows as successful in ScraperWiki but runs with errors)

6 days ago

9 months ago

UPS blog posts

a month ago

a month ago

IFPRI Dataverse Collector

2 months ago

4 months ago

World Bank Climate Collector

5 months ago

5 months ago

UNDP Climate Collector

5 months ago

5 months ago

FAO Collector

5 months ago

6 months ago

HDRO Collector

5 months ago

7 months ago

WHO Health

5 months ago

5 months ago

UNHCR Mediterranean Collector

6 months ago

6 months ago

WFP VAM API Scraper

7 months ago

a year ago

OCHA ORS Scraper

7 months ago

a year ago

Datastore: Feature Organization Page

7 months ago

7 months ago

FTS: Nepal Earthquake Coverage

8 months ago

a year ago

FTS Collector

9 months ago

9 months ago

OCHA Syria Key Humanitarian Figures

9 months ago

a year ago

UN Habitat

9 months ago

9 months ago

CAP Appeals Scraper

9 months ago

2 years ago

CPSer

9 months ago

2 years ago

WFP Food Prices (full)

9 months ago

a year ago

NGO AidMap Ebola Projects

10 months ago

a year ago

ReliefWeb Scraper

10 months ago

2 years ago

HealthMap Ebola News (geo)

10 months ago

2 years ago

Colombia SIDIH

11 months ago

a year ago

DataStore: Nutrition SMART Survey

a year ago

a year ago

OCHA ROWCA ORS

a year ago

2 years ago

DataStore: Colombia Page Indicators

a year ago

a year ago

HDX ebola-cases-2014 dataset updater

a year ago

a year ago

NCDC / NOAA Precipitation Collector

Never

4 months ago

WorldPop Collector

Never

3 months ago

Scraper: Violations Documentation Center

Never

7 months ago

 


 

Status (Colour is the Key for above Table)

Number of Scrapers

Working

32

Working but not run for > 4 months

22

Failing

7

Not run

3

 


Notes:


Scrapers with “Dragon -” as a prefix were created by Dragon, all others are created by “UN OCHA” - finding specifically who for these requires looking for and then at the source code

Working but not in HDX

A number of FTS scrapers work but their output cannot be seen in HDX. FTS refers to the OCHA Financial Tracking Service. It is "a centralized source of curated, continuously updated, fully downloadable data and information on humanitarian funding flows."


The FTS Appeals and FTS Emergency collectors are working. It is not obvious where they output. There are similarly named datasets in HDX but they do not have any resources - see https://data.hdx.rwlabs.org/dataset/fts-appeals and https://data.hdx.rwlabs.org/dataset/fts-emergencies. FTS Appeals in fact outputs to a number of datasets ending “ humanitarian contributions” eg. Senegal 2015 humanitarian contributions according to https://docs.google.com/spreadsheets/d/1RdHFRn3s8uxRfbmJ6aZCghpt85mFk3aR8eDZAgwLRKs/edit?usp=sharing. This is not readily apparent in the code. Since these collectors are coded in R and rely on CPS, they should be rewritten in Python and their dependence on CPS removed.

The FTS Ebola and FTS Ebola Coverage scrapers are working but rely on CPS to get data to HDX. These collectors are written in R and should be recoded in Python without the reliance on CPS. FTS Ebola Coverage (https://data.hdx.rwlabs.org/dataset/fts-ebola-coverage) was last updated in HDX four months ago. It is not clear which dataset FTS Ebola is updating - from the code there should be one called "FTS Ebola Indicators" but it does not come up in a search on HDX.

Working but not run recently

It is not easy to find out the frequency with which scrapers are supposed to be running as this information is not in the ScraperWiki GUI. The frequency tells us whether a scraper should have run in a certain period or not. For example, if a scraper has not run for 5 months, but its frequency is annual, then it is probably ok, but if it has not run for a week and its frequency is daily then it is is broken.


The table above uses the arbitrary cutoff of 4 months to try to point to scrapers that should be examined in more detail.


UNHCR Mediterranean collector

The UNHCR Mediterranean collector had not run for 6 months. Connecting to the machine where it resides by ssh and running it worked fine, and afterwards it displayed as recently run on ScraperWiki.


The reason the scraper had not run is because its scheduling was broken and this might be the same for others. As there was no cron job, it looks probable that it tries to use a Python scheduler but from experiments with this approach it appears that the scheduler is killed on logging out. To fix the Mediterranean collector's scheduling, it is likely that setting up crontab is all that is required.


The scraper's data taken from: http://data.unhcr.org/api/stats/mediterranean/monthly_arrivals_by_country.json is put into a ScraperWiki sqlite database and relies on CPS to get it into HDX. It is not clear what dataset is being updated in HDX.  This would be clearer if the scraper was rewritten to use the CKAN API to HDX.


FTS Nepal Earthquake Collector


The FTS Nepal Earthquake collector has not run for eight months but since it was designed to collect data around a specific crisis, this is most likely not a problem. The dataset in HDX is here: https://data.hdx.rwlabs.org/dataset/response-plan-coverage-nepal-earthquake

FTS Collector


The FTS collector has not run for nine months. It does not have any scheduling set up and the datasets in HDX do not have any resources (https://data.hdx.rwlabs.org/dataset/fts-clusters, https://data.hdx.rwlabs.org/dataset/fts-appeals and https://data.hdx.rwlabs.org/dataset/fts-emergencies). It was likely designed to replace the two scrapers written in R with the same output datasets in HDX. It resides in a GitHub repository in Reuben's public area and should be moved to a central well known location if it is to be maintained (assuming it works).

Failed Scrapers

UNHCR Real-time API


This failed with the error: “Error in library(sqldf) : there is no package called ‘sqldf’”


Sqldf is an R library that appears to be missing. The code for this scraper is written in R and put directly in ScraperWiki (“code in your browser”). It would be advisable to rewrite it in Python as supporting multiple languages is a maintenance headache and make it standalone (“code your own tool”) in ScraperWiki.


HDX Repo Analytics


This scraper has the same issue as the UNHCR Real-Time API.


UNOSAT Product Scraper

This scraper failed because one of the urls that it scrapes from a webpage was a dead link and it has no check for this possibility. It was difficult to find this fault because the scraper is written in R with limited logging and error handling. After fixing it, although the scraper claims to have run successfully, the datasets on HDX seem to remain unchanged, hopefully because the data is assessed as unmodified. This requires further study. The metadata modification date is generally 24/11/2015 but a few datasets are newer (from Luis spreadsheet). Ultimately, the scraper should be rewritten in Python with detailed logging and error handling.


UNOSAT Flood Portal

Running this scraper produced this output:

lgfoslr@cobalt-u:~$ bash tool/bin/run.sh

→ Cleaning table `unprocessed_data`.

→ Table `unprocessed_data` cleaned successfully.


ERROR: Failed to store record in database.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

→ Processing dates.

→ Identifying countries.

→ Identifying file extension.

→ Cleaning title.

→ Cleaning table `processed_data`.

ERROR: Failed to clean table `processed_data`.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

ERROR: Failed to store record in database.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]

SUCCESS: Successfully patched 1485 records.

SUCCESS: Successfully fetched 1485 records from the UNOSAT Flood Portal.


→ Exporting Datasets JSON to disk.

(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]


Although it shows as successful on ScraperWiki, given these errors, it is not clear if it is working and further analysis is required. The metadata modification dates


IFPRI Dataverse Collector


When running this scraper, the error it produces is: “Service Temporarily Unavailable. The server is temporarily unable to service your request due to maintenance downtime or capacity

problems. Please try again later.”


It tries to connect to this url: https://dataverse.harvard.edu/api/dataverses/IFPRI/contents which fails similarly in a browser.


This seems to be a problem with the website and needs to be followed up with the maintainer.


WFP VAM API Scraper


This scraper runs for several hours and then fails. It posts a form to a number of urls of this form: http://reporting.vam.wfp.org/API/Get_CSI.aspx along with a set of parameters. It is likely one of these that is timing out. The script is quite complicated so will require some time to debug.


Overall Condition of Scrapers


Looking at the scrapers as a whole rather than individually, an assessment can be made of their general health. The unfortunate conclusion is that they are not in a good state even those which are currently running regularly. It was raised previously in the DAP scraper review before the recent departure of a long serving data scientist that there is “significant key man risk as knowledge about them is not distributed and they are built and looked after in varying ways by different authors. As the number of scrapers grows, it will be difficult to maintain and if there are staff changes, this will present serious challenges.” That is the situation we are in now.


Because the collectors have not been built with maintenance and support in mind, it requires spending a great deal of time to understand each one. There is virtually no documentation and they do they follow a common approach or standard. They have not been coded in a consistent way or even in one language, particularly older ones. Where code appears at first glance similar, closer inspection reveals that it has been copied, pasted and modified. There are no example web pages or files to indicate what each script is built to parse and mostly no tests. The collectors often do not log useful output for debugging and when they fail, there is frequently no notification or a cryptic error message.


Compounding these problems are infrastructural problems. Having a separate GitHub repository for each scraper is good practice but it is not easy to review many scrapers. Identifying the repository corresponding to a collector in ScraperWiki involves a manual search by name and is complicated by the fact that the repository could be in someone’s personal space, even an individual who has left the team.


A major headache is that there is no link to go from a data collector in ScraperWiki to its corresponding dataset in HDX. This means that one must search on HDX by name for a dataset (which is unreliable) or look it up in the scraper’s source code. This becomes even more opaque once CPS enters the equation - it seems to be very difficult to see the flow of data through CPS in its user interface.


The update frequency of a dataset in HDX is not tied to the schedule of scrapers in ScraperWiki which are set up using crontab. For example, a collector may operate every week in ScraperWiki but its corresponding dataset in HDX may have update frequency “never” (particularly if CPS sits in between), so determining if data in HDX is current is not obvious as it may be that the same data is being downloaded and then not updated in HDX. Whether CPS refreshes datasets in HDX if the files have not changed is not clear and it needs to be confirmed that the activity stream in HDX always shows when public datasets are updated. Some scrapers are run manually on demand eg. WFP food prices and those should be documented.


Next Steps

As was mentioned for dap scrapers, for those that are not used or for which it is decided that data is unimportant, they should be deleted so that it is clear what needs to be properly maintained and tested.


We need a policy (if there is not one already) around how long a dataset is private and/or how long a scraper is not run (eg. because it is for a one off event) before we delete it from ScraperWiki.


Scrapers need to be raised to a common standard covering style/approach, documentation, acceptance test with data snapshot, recording last known good data on each successful scrape and using a common library.


The link from scraper in ScraperWiki to dataset in HDX possibly via CPS needs to be fully enumerated which will likely require database access as the user interfaces are not sufficient to extract this information.