Review of Scrapers
Introduction
Scrapers collect data from websites that is pushed into HDX or in the case of the dap ones into CPS as indicators and from there into HDX. The data is extracted programmatically from spreadsheets or html. The status of the dap scrapers is reviewed here. This document considers the others.
Current Status on ScraperWiki
Name | Last run | Created |
---|---|---|
DataStore: Mali Topline Figures | a few seconds ago | a year ago |
DataStore: WFP Topline Figures | a few seconds ago | a year ago |
DataStore: Colombia Topline Figures | a few seconds ago | a year ago |
Guinea 3W Validation + DataStore | a few seconds ago | a year ago |
Nepal Earthquake Topline Figures | a few seconds ago | a year ago |
Somalia NGO Topline Figures | a few seconds ago | a year ago |
OCHA Afghanistan Topline Figures | a few seconds ago | 9 months ago |
UNOSAT Topline Figures | a few seconds ago | 9 months ago |
DataStore: UNHCR Topline Figures | a few seconds ago | 10 months ago |
Alert: HDX Dataset Revision | a few seconds ago | a year ago |
DataStore: Topline Ebola Outbreak Figures | a few seconds ago | a year ago |
Fiji Topline Figures | 4 minutes ago | a month ago |
Ebola Data Lister | 21 minutes ago | a year ago |
FTS Ebola | 23 minutes ago | a year ago |
FTS Ebola Coverage | 23 minutes ago | a year ago |
HDX Source Count Tracker | an hour ago | 6 months ago |
HDX ebola-cases-2014 Data for Visualisation | an hour ago | a year ago |
HDX Registered Users Stats | an hour ago | 9 months ago |
HDX Management Statistics | 3 hours ago | a year ago |
OpenNepal Scraper | 5 hours ago | 6 months ago |
Dragon - FTS | 6 hours ago | 2 years ago |
UNOSAT Product Scraper | 7 hours ago | a year ago |
Dragon - ReliefWeb R | 7 hours ago | 2 years ago |
FTS Emergency Collector | 7 hours ago | a year ago |
FTS Appeals Collector | 7 hours ago | a year ago |
OCHA CERF Collector | 7 hours ago | 5 months ago |
WHO GAR | 7 hours ago | a year ago |
UNHCR Real-time API | 7 hours ago | 2 years ago |
HDX Repo Analytics | 7 hours ago | 2 years ago |
WHO Ebola SitRep Scraper | 19 hours ago | a year ago |
ACLED Africa Collector | a day ago | 15 days ago |
WFP mVAM Collector | 2 days ago | 2 months ago |
UN Iraq Casualty Figures | 2 days ago | a year ago |
IDMC Global Figures | 2 days ago | 2 years ago |
UNOSAT Flood Portal (shows as successful in ScraperWiki but runs with errors) | 6 days ago | 9 months ago |
UPS blog posts | a month ago | a month ago |
IFPRI Dataverse Collector | 2 months ago | 4 months ago |
World Bank Climate Collector | 5 months ago | 5 months ago |
UNDP Climate Collector | 5 months ago | 5 months ago |
FAO Collector | 5 months ago | 6 months ago |
HDRO Collector | 5 months ago | 7 months ago |
WHO Health | 5 months ago | 5 months ago |
UNHCR Mediterranean Collector | 6 months ago | 6 months ago |
WFP VAM API Scraper | 7 months ago | a year ago |
OCHA ORS Scraper | 7 months ago | a year ago |
Datastore: Feature Organization Page | 7 months ago | 7 months ago |
FTS: Nepal Earthquake Coverage | 8 months ago | a year ago |
FTS Collector | 9 months ago | 9 months ago |
OCHA Syria Key Humanitarian Figures | 9 months ago | a year ago |
UN Habitat | 9 months ago | 9 months ago |
CAP Appeals Scraper | 9 months ago | 2 years ago |
CPSer | 9 months ago | 2 years ago |
WFP Food Prices (full) | 9 months ago | a year ago |
NGO AidMap Ebola Projects | 10 months ago | a year ago |
ReliefWeb Scraper | 10 months ago | 2 years ago |
HealthMap Ebola News (geo) | 10 months ago | 2 years ago |
Colombia SIDIH | 11 months ago | a year ago |
DataStore: Nutrition SMART Survey | a year ago | a year ago |
OCHA ROWCA ORS | a year ago | 2 years ago |
DataStore: Colombia Page Indicators | a year ago | a year ago |
HDX ebola-cases-2014 dataset updater | a year ago | a year ago |
NCDC / NOAA Precipitation Collector | Never | 4 months ago |
WorldPop Collector | Never | 3 months ago |
Scraper: Violations Documentation Center | Never | 7 months ago |
Status (Colour is the Key for above Table) | Number of Scrapers |
Working | 32 |
Working but not run for > 4 months | 22 |
Failing | 7 |
Not run | 3 |
Notes:
Scrapers with “Dragon -” as a prefix were created by Dragon, all others are created by “UN OCHA” - finding specifically who for these requires looking for and then at the source code
Working but not in HDX
A number of FTS scrapers work but their output cannot be seen in HDX. FTS refers to the OCHA Financial Tracking Service. It is "a centralized source of curated, continuously updated, fully downloadable data and information on humanitarian funding flows."
The FTS Appeals and FTS Emergency collectors are working. It is not obvious where they output. There are similarly named datasets in HDX but they do not have any resources - see https://data.hdx.rwlabs.org/dataset/fts-appeals and https://data.hdx.rwlabs.org/dataset/fts-emergencies. FTS Appeals in fact outputs to a number of datasets ending “ humanitarian contributions” eg. Senegal 2015 humanitarian contributions according to https://docs.google.com/spreadsheets/d/1RdHFRn3s8uxRfbmJ6aZCghpt85mFk3aR8eDZAgwLRKs/edit?usp=sharing. This is not readily apparent in the code. Since these collectors are coded in R and rely on CPS, they should be rewritten in Python and their dependence on CPS removed.
The FTS Ebola and FTS Ebola Coverage scrapers are working but rely on CPS to get data to HDX. These collectors are written in R and should be recoded in Python without the reliance on CPS. FTS Ebola Coverage (https://data.hdx.rwlabs.org/dataset/fts-ebola-coverage) was last updated in HDX four months ago. It is not clear which dataset FTS Ebola is updating - from the code there should be one called "FTS Ebola Indicators" but it does not come up in a search on HDX.
Working but not run recently
It is not easy to find out the frequency with which scrapers are supposed to be running as this information is not in the ScraperWiki GUI. The frequency tells us whether a scraper should have run in a certain period or not. For example, if a scraper has not run for 5 months, but its frequency is annual, then it is probably ok, but if it has not run for a week and its frequency is daily then it is is broken.
The table above uses the arbitrary cutoff of 4 months to try to point to scrapers that should be examined in more detail.
UNHCR Mediterranean collector
The UNHCR Mediterranean collector had not run for 6 months. Connecting to the machine where it resides by ssh and running it worked fine, and afterwards it displayed as recently run on ScraperWiki.
The reason the scraper had not run is because its scheduling was broken and this might be the same for others. As there was no cron job, it looks probable that it tries to use a Python scheduler but from experiments with this approach it appears that the scheduler is killed on logging out. To fix the Mediterranean collector's scheduling, it is likely that setting up crontab is all that is required.
The scraper's data taken from: http://data.unhcr.org/api/stats/mediterranean/monthly_arrivals_by_country.json is put into a ScraperWiki sqlite database and relies on CPS to get it into HDX. It is not clear what dataset is being updated in HDX. This would be clearer if the scraper was rewritten to use the CKAN API to HDX.
FTS Nepal Earthquake Collector
The FTS Nepal Earthquake collector has not run for eight months but since it was designed to collect data around a specific crisis, this is most likely not a problem. The dataset in HDX is here: https://data.hdx.rwlabs.org/dataset/response-plan-coverage-nepal-earthquake
FTS Collector
The FTS collector has not run for nine months. It does not have any scheduling set up and the datasets in HDX do not have any resources (https://data.hdx.rwlabs.org/dataset/fts-clusters, https://data.hdx.rwlabs.org/dataset/fts-appeals and https://data.hdx.rwlabs.org/dataset/fts-emergencies). It was likely designed to replace the two scrapers written in R with the same output datasets in HDX. It resides in a GitHub repository in Reuben's public area and should be moved to a central well known location if it is to be maintained (assuming it works).
Failed Scrapers
UNHCR Real-time API
This failed with the error: “Error in library(sqldf) : there is no package called ‘sqldf’”
Sqldf is an R library that appears to be missing. The code for this scraper is written in R and put directly in ScraperWiki (“code in your browser”). It would be advisable to rewrite it in Python as supporting multiple languages is a maintenance headache and make it standalone (“code your own tool”) in ScraperWiki.
HDX Repo Analytics
This scraper has the same issue as the UNHCR Real-Time API.
UNOSAT Product Scraper
This scraper failed because one of the urls that it scrapes from a webpage was a dead link and it has no check for this possibility. It was difficult to find this fault because the scraper is written in R with limited logging and error handling. After fixing it, although the scraper claims to have run successfully, the datasets on HDX seem to remain unchanged, hopefully because the data is assessed as unmodified. This requires further study. The metadata modification date is generally 24/11/2015 but a few datasets are newer (from Luis spreadsheet). Ultimately, the scraper should be rewritten in Python with detailed logging and error handling.
UNOSAT Flood Portal
Running this scraper produced this output:
lgfoslr@cobalt-u:~$ bash tool/bin/run.sh
→ Cleaning table `unprocessed_data`.
→ Table `unprocessed_data` cleaned successfully.
ERROR: Failed to store record in database.
(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]
→ Processing dates.
→ Identifying countries.
→ Identifying file extension.
→ Cleaning title.
→ Cleaning table `processed_data`.
ERROR: Failed to clean table `processed_data`.
(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]
ERROR: Failed to store record in database.
(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]
SUCCESS: Successfully patched 1485 records.
SUCCESS: Successfully fetched 1485 records from the UNOSAT Flood Portal.
→ Exporting Datasets JSON to disk.
(sqlite3.OperationalError) database is locked [SQL: "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name"]
Although it shows as successful on ScraperWiki, given these errors, it is not clear if it is working and further analysis is required. The metadata modification dates
IFPRI Dataverse Collector
When running this scraper, the error it produces is: “Service Temporarily Unavailable. The server is temporarily unable to service your request due to maintenance downtime or capacity
problems. Please try again later.”
It tries to connect to this url: https://dataverse.harvard.edu/api/dataverses/IFPRI/contents which fails similarly in a browser.
This seems to be a problem with the website and needs to be followed up with the maintainer.
WFP VAM API Scraper
This scraper runs for several hours and then fails. It posts a form to a number of urls of this form: http://reporting.vam.wfp.org/API/Get_CSI.aspx along with a set of parameters. It is likely one of these that is timing out. The script is quite complicated so will require some time to debug.
Overall Condition of Scrapers
Looking at the scrapers as a whole rather than individually, an assessment can be made of their general health. The unfortunate conclusion is that they are not in a good state even those which are currently running regularly. It was raised previously in the DAP scraper review before the recent departure of a long serving data scientist that there is “significant key man risk as knowledge about them is not distributed and they are built and looked after in varying ways by different authors. As the number of scrapers grows, it will be difficult to maintain and if there are staff changes, this will present serious challenges.” That is the situation we are in now.
Because the collectors have not been built with maintenance and support in mind, it requires spending a great deal of time to understand each one. There is virtually no documentation and they do they follow a common approach or standard. They have not been coded in a consistent way or even in one language, particularly older ones. Where code appears at first glance similar, closer inspection reveals that it has been copied, pasted and modified. There are no example web pages or files to indicate what each script is built to parse and mostly no tests. The collectors often do not log useful output for debugging and when they fail, there is frequently no notification or a cryptic error message.
Compounding these problems are infrastructural problems. Having a separate GitHub repository for each scraper is good practice but it is not easy to review many scrapers. Identifying the repository corresponding to a collector in ScraperWiki involves a manual search by name and is complicated by the fact that the repository could be in someone’s personal space, even an individual who has left the team.
A major headache is that there is no link to go from a data collector in ScraperWiki to its corresponding dataset in HDX. This means that one must search on HDX by name for a dataset (which is unreliable) or look it up in the scraper’s source code. This becomes even more opaque once CPS enters the equation - it seems to be very difficult to see the flow of data through CPS in its user interface.
The update frequency of a dataset in HDX is not tied to the schedule of scrapers in ScraperWiki which are set up using crontab. For example, a collector may operate every week in ScraperWiki but its corresponding dataset in HDX may have update frequency “never” (particularly if CPS sits in between), so determining if data in HDX is current is not obvious as it may be that the same data is being downloaded and then not updated in HDX. Whether CPS refreshes datasets in HDX if the files have not changed is not clear and it needs to be confirmed that the activity stream in HDX always shows when public datasets are updated. Some scrapers are run manually on demand eg. WFP food prices and those should be documented.
Next Steps
As was mentioned for dap scrapers, for those that are not used or for which it is decided that data is unimportant, they should be deleted so that it is clear what needs to be properly maintained and tested.
We need a policy (if there is not one already) around how long a dataset is private and/or how long a scraper is not run (eg. because it is for a one off event) before we delete it from ScraperWiki.
Scrapers need to be raised to a common standard covering style/approach, documentation, acceptance test with data snapshot, recording last known good data on each successful scrape and using a common library.
The link from scraper in ScraperWiki to dataset in HDX possibly via CPS needs to be fully enumerated which will likely require database access as the user interfaces are not sufficient to extract this information.