Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

By: Nafissatou Pouye and Metasabya Metasebya Sahlu

Household surveys, needs assessments and other forms of microdata make up an increasingly significant volume of data in the humanitarian sector. This type of data is critical to determining the needs and perspective of people affected by crises. Sharing microdata, whether with colleagues, partner organisation or publicly, can improve the effectiveness and efficiency of a response but it also presents unique risks. 

...

Over the past year, our team at HDX have taken a number of steps to ensure that microdata can be shared on the platform in a way that limits the risk to vulnerable populations and communities. In this post, we will describe the steps that we have taken to provide the opportunity for organisations to share this valuable data on HDX in a responsible way.

Our approach to handling microdata

To handle the microdata shared on HDX, we use an open-source software package for Statistical Disclosure Control (SDC) called sdcMicro. The tool was developed by Statistics Austria, the Vienna University of Technology, the International Household Survey Network (IHSN), PARIS21 (OECD), and the World Bank. 

...

When new microdata is shared on HDX, the HDX team follows the process described in the workflow below.

...

The sdcMicro tool is a useful start, but we have observed its limitations. There are only three SDC methods available for categorical variables, which are variables that take values over a finite set (e.g., gender). Also, the process can take longer for large microdata. The risk-utility trade-off between lowering the disclosure risk and limiting information loss is tricky to handle for microdata that has a high risk of disclosure. 

...

  1. New Microdata is Added to the Platform: When a user uploads a resource on the HDX platform, they are asked to indicate if the resource contains microdata. We also manually verify whether a resource contains microdata as part of the standard quality assurance process that we perform on all new resources added to the platform. 

  2. Perform Quality Assurance Checks: Next, we perform a set of quality assurance checks, which includes assessing a review of the data to determine whether the dataset includes microdata or other potentially sensitive information. If we determine there is no the HDX team detects sensitive information, we mark the dataset 'under review' and perform a disclosure risk assessment. We also notify the contributor via HDX and email at this stage to let them know that their data will be made available on HDX. Otherwise, the dataset will remain ‘under review’ will we perform our disclosure risk assessment. temporarily be unavailable for download while we complete our assessment.

  3. Assess Disclosure Risk: If we determine that there is potentially sensitivity data in the file, the next step is the disclosure risk assessment. This is done using sdcMicro. Our team will first develop disclosure scenarios, then select key variables and finally run the disclosure risk assessment. If the dataset has a ‘global risk’ of less than three percent, then it is deemed safe to share on HDX and will be taken out of review and made public on the platform. If the global risk is higher than 3% we will talk to the contributor about how to proceed. It is important to note that there are a few different methods for quantifying disclosure risk -  global risk is just one of them. While our threshold for sharing data on HDX is a global risk of under three percent, we also look at the individual risk scores to ensure that no individual has a particularly high risk of disclosure. You can learn more about these methods and details on all the steps of the process by following our Disclosure Risk Assessment Learning Path.

  4. Inform Contributor: If we determine that the global risk of re-identification is above our threshold we will contact the contributor via email to share the results of the assessment to discuss how to reduce the disclosure risk. We may have recommendations for how they can use disclosure control techniques to reduce the disclosure risk, or, in some cases, may advise that they forgo this the SDC process to only and instead only share the metadata and make the data full dataset available ‘by request’

  5. Applying SDC: There are a number of perturbative and non-perturbative methods that can be used to reduce the risk of disclosure in data. Through perturbative methods, the data value is altered in order to create uncertainty about what the true value is. On the other hand, through non-perturbative methods, the data’s structure Non-perturbative methods the goal is to reduce the detail of the data by suppressing individual values or combining values by creating intervals or brackets (ie age from 19 → ‘18 - 25’). 

  6. Re-Assessing Risk and Quantifying Information Loss: Applying these methods will necessarily result in information loss. Through this process, our goal is to find a balance between limiting risk and maximising the utility of the data. Therefore, after the techniques have been applied and we are sure that we have successfully lowered the risk, we also assess the data utility of the treated data by quantifying the information loss. In some cases, if the steps we need to take to reduce the disclosure risk of the data result in too much information loss, we may advise that you share the original data ‘by request’ rather than share the treated data publicly. 

  7. Sharing Data Via HDX Connect: Finally, if we determine that data cannot be shared publicly, we provide the option to share only the metadata and make the dataset available ‘on request’. This option allows data contributors to control whether and how to share their data.

...

Limitations of sdcMicro & Other Open-Source SDC tools

sdcMicro is a great place to start for humanitarian organisations new to statistical disclosure control. sdcMicro is an add-on package in R that you simply need to download and install and you are ready to get started. If you are new to R, there will certainly be a learning curve but for R users it should be quite small. sdcMicro does have a few limitations. For example, there are only three statistical methods available for categorical variables. These are variables that take values over a finite set (e.g., gender). Categorical variables also happen to be more prevalent than continuous variables in humanitarian microdata. Furthermore, the process can take quite a long time for large microdata files. Finally, the risk-utility trade-off between lowering the disclosure risk and limiting information loss can be tricky to manage for microdata that has a high risk of disclosure.

Different research institutions and statistical offices have developed generic or specifically tailored SDC tools and made them openly available to the public. Aiming to find alternatives to mitigate the shortcomings identified with sdcMicro, we explored the ARX- Data Anonymization Tool developed by the Technical University of Munich and the μ-ARGUS tool developed by Statistics Netherlands.

...