Background

Scrapers are scripts that read data or urls from websites or servers, complete dataset metadata and create datasets in HDX. They either read urls to include as metadata in HDX resources or download files, do some processing and upload them into the HDX filestore.

HDX has a RESTful API largely unchanged from the underlying CKAN API which can be used from any programming language that supports HTTP GET and POST requests. The HDX Python API provides a simple interface that communicates with HDX using the CKAN Python API, a thin wrapper around the CKAN REST API. It is a mature library that supports Python 2.7 and 3 with tests that have a high level of code coverage. The major goal of the library is to make pushing and pulling data from HDX as simple as possible for the end user. HDX objects, such as datasets and resources, are represented by Python classes. The scrapers we will discuss here use this library to communicate with HDX. 

Current Platform

The current platform is an external service called ScraperWiki. It has a web based user interface that enables the status of scrapers to be viewed. Through the UI, it is possible to create a new environment on which to run new scrapers. The environment is an Amazon Web Services virtual server with various packages included on it like Python. The UI allows the user to obtain a url for a server. The user can ssh into that server and set up the scraper by, for example, git cloning its code onto the server and then setting up a virtualenv with the required Python packages. Cron is used to execute the scrapers according to the desired schedule.

Requirements for New Platform

With the move to the new platform, it was decided to deprecate many old scrapers and so the range of technologies needed has been reduced dramatically. These are the high level requirements for the new platform:

Choice of platform

It was decided given these requirements (subject to approval) to use Jenkins on an OCHA IT server. Jenkins is typically used for running unit tests, but it has test scheduling capability and a user interface for looking at suites of tests. To use Jenkins, we need only treat each scraper like a suite of unit tests. Jenkins is already deployed on OCHA IT infrastructure which means that the software is already approved in another context and that the expertise to understand and support it exists.

Rather than each scraper executing within a Python virtualenv as currently, they will each be in a Docker container. The scrapers' Docker images will build upon (inherit) a base image owned by OCHA IT. The draft base image is here. It inherits from unocha/alpine-base:3.8 and contains a Python 3 environment suitable for running scrapers - it includes HDX Python API library, awesome-slugify and Pandas (including its dependencies on Scipy and Numpy). The libraries that HDX Python API depends on are all open source. An example scraper that inherits this base image is the FTS scraper.

There is some private information that is needed by the scrapers to run. Currently it resides in a private OCHA GitHub repository, but it will be moved to Ansible.

The setup will comply with OCHA IT's Hosting in Shared Infrastructure: Project Requirements.