HDX to Frictionless

The motivation for this project is to enable taking advantage of third party libraries that have been developed to help with tasks like data validation. These libraries were created by the Open Knowledge Foundation and are built to support a data standard called Frictionless Data. If that standard becomes commonplace then it will be an additional benefit if HDX supports it.

Uptake of Frictionless

Work has been done on assessing the uptake of the Frictionless standard including looking at case studies outlined here: Implementers of Frictionless and monitoring the interest in Frictionless by looking at GitHub Repo Uptake Statistics. These statistics tell us that there seems to be a steady increase in interest in Frictionless tooling which implies a growing adoption of the standard.

Frictionless Technologies

For now, the main interest is in reusing the technologies underpinning Frictionless. That list has expanded significantly between June and November 2017 with libraries added for all major programming languages. Ones we might use or borrow code from are:

Data Curator

Desktop CSV editor to help describe, validate and share usable open data.

goodtables-py

Validate and process tabular data in Python.

Stenci.la coming soon

The office suite for reproducible research

Import for Google Spreadsheets experimental

Import Tabular Data Packages into Google Spreadsheets.

Data Package Pipelines

Framework for processing data packages in pipelines of modular components.

datapackage-py/js/...

A library for working with Data Packages.

tableschema-py/js/...

A library for working with Table Schema.

tabulator-py

Consistent interface for stream reading and writing tabular data (csv/xls/json/etc).

Where to use Frictionless

There are a few areas where Frictionless can be used.

HDX Utilities library

Tabulator-py is already in use in the HDX Utilities library and through that in the HDX Python API for uploading to the HDX datastore and also in the Chatham House project.

HXL Proxy

Tabulator-py could also be used in the HXL Proxy to replace the stream reading code, the advantage being that improvements coded by others in this package will automatically be available to the HXL Proxy. The main disadvantage is the time needed to refactor the HXL Proxy to use it and to identify any missing features needed.

Migration Tool for Organisations

Import for Google Spreadsheets could be used to enable organisations to easily move from local Excel spreadsheets to Google Spreadsheets in which we can embed a trigger to determine if the data has changed for freshness purposes.

HDX

datapackage-js export data packages

Data Check

goodtables-py, Data Curator and Stenci.la (looks like a cross between Word and Pandas) could provide code and ideas for this tool. This would be the most significant use of