Household surveys, needs assessments and other forms of microdata make up an increasingly significant volume of data in the humanitarian sector. This type of data is critical to determining the needs and perspective of people affected by crises. Sharing microdata, whether with colleagues, partner organisation or publicly, can improve the effectiveness and efficiency of a response but it also presents unique risks.
HDX does not allow data that includes personally identifiable information (PII) to be shared publicly through the platform (read more in our HDX Terms of Service). However, with microdata it could be possible to infer confidential information and even re-identify an individual even after the personally identifiable information has been removed. This is done by combining what are known as ‘key variables’ and matching these combinations of key variables with information in an external file or other available information sources. This risk that individuals could be re-identified or other confidential information could be inferred is called the disclosure risk.
Over the past year, our team at HDX have taken a number of steps to ensure that microdata can be shared on the platform in a way that limits the risk to vulnerable populations and communities. In this post, we will describe the steps that we have taken to provide the opportunity for organisations to share this valuable data on HDX in a responsible way.
Our approach to handling microdata
To handle the microdata shared on HDX, we use an open-source software package for Statistical Disclosure Control (SDC) called sdcMicro. The tool was developed by Statistics Austria, the Vienna University of Technology, the International Household Survey Network (IHSN), PARIS21 (OECD), and the World Bank.
The SDC process in the sdcMicro is divided into three steps:
Perform a disclosure risk assessment by identifying the key variables.
Apply SDC methods to reduce the risk of disclosing information on individuals.
Re-measure the risk and quantify the information loss.
When new microdata is shared on HDX, the HDX team follows the process described in the workflow below.
New Microdata is Added to the Platform: When a user uploads a resource on the HDX platform, they are asked to indicate if the resource contains microdata. We also manually verify whether a resource contains microdata as part of the standard quality assurance process that we perform on all new resources added to the platform.
Perform Quality Assurance Checks: Next, we perform a set of quality assurance checks, which includes a review of the data to determine whether the dataset includes microdata or other potentially sensitive information. If the HDX team detects sensitive information, we mark the dataset 'under review' and perform a disclosure risk assessment. We also notify the contributor via HDX and email at this stage to let them know that their data will temporarily be unavailable for download while we complete our assessment.
Assess Disclosure Risk: If we determine that there is potentially sensitivity data in the file, the next step is the disclosure risk assessment. This is done using sdcMicro. Our team will first develop disclosure scenarios, then select key variables and finally run the disclosure risk assessment. If the dataset has a ‘global risk’ of less than three percent, then it is deemed safe to share on HDX and will be taken out of review and made public on the platform. If the global risk is higher than 3% we will talk to the contributor about how to proceed. It is important to note that there are a few different methods for quantifying disclosure risk - global risk is just one of them. While our threshold for sharing data on HDX is a global risk of under three percent, we also look at the individual risk scores to ensure that no individual has a particularly high risk of disclosure. You can learn more about these methods and details on all the steps of the process by following our Disclosure Risk Assessment Learning Path.
Inform Contributor: If we determine that the global risk of re-identification is above our threshold we will contact the contributor via email to share the results of the assessment to discuss how to reduce the disclosure risk. We may have recommendations for how they can use disclosure control techniques to reduce the disclosure risk, or, in some cases, may advise that they forgo the SDC process and instead only share the metadata and make the full dataset available ‘by request’
Applying SDC: There are a number of perturbative and non-perturbative methods that can be used to reduce the risk of disclosure in data. Through perturbative methods, the data value is altered in order to create uncertainty about what the true value is. On the other hand, through non-perturbative methods, the goal is to reduce the detail of the data by suppressing individual values or combining values by creating intervals or brackets (ie age from 19 → ‘18 - 25’).
Re-Assessing Risk and Quantifying Information Loss: Applying these methods will necessarily result in information loss. Through this process, our goal is to find a balance between limiting risk and maximising the utility of the data. Therefore, after the techniques have been applied and we are sure that we have successfully lowered the risk, we also assess the data utility of the treated data by quantifying the information loss. In some cases, if the steps we need to take to reduce the disclosure risk of the data result in too much information loss, we may advise that you share the original data ‘by request’ rather than share the treated data publicly.
Sharing Data Via HDX Connect: Finally, if we determine that data cannot be shared publicly, we provide the option to share only the metadata and make the dataset available ‘on request’. This option allows data contributors to control whether and how to share their data.
Limitations of sdcMicro & Other Open-Source SDC tools
sdcMicro is a great place to start for humanitarian organisations new to statistical disclosure control. sdcMicro is an add-on package in R that you simply need to download and install and you are ready to get started. If you are new to R, there will certainly be a learning curve but for R users it should be quite small. sdcMicro does have a few limitations. For example, there are only three statistical methods available for categorical variables. These are variables that take values over a finite set (e.g., gender). Categorical variables also happen to be more prevalent than continuous variables in humanitarian microdata. Furthermore, the process can take quite a long time for large microdata files. Finally, the risk-utility trade-off between lowering the disclosure risk and limiting information loss can be tricky to manage for microdata that has a high risk of disclosure.
Different research institutions and statistical offices have developed generic or specifically tailored SDC tools and made them openly available to the public. Aiming to find alternatives to mitigate the shortcomings identified with sdcMicro, we explored the ARX- Data Anonymization Tool developed by the Technical University of Munich and the μ-ARGUS tool developed by Statistics Netherlands.
To compare the effectiveness of these tools in terms of computation time and scalability, we have identified five microdata based on their size and complexity and assessed their disclosure risks prior and post SDC. The SDC process in ARX is utility-focused. In μ-ARGUS and sdcMicro, since there is no advanced feature for assessing the risk-utility trade-off, human expertise and effort are more heavily required. Unlike in sdcMicro and μ-ARGUS, key variables can be automatically detected under ARX. Indeed, ARX provides a method for detecting attributes that must be modified according to the Safe Harbor method of the US Health Insurance Portability and Accountability Act (HIPAA identifiers). Compared to μ-ARGUS and sdcMicro, the risk assessment under ARX is faster for large microdata in terms of computation time.
For the selected microdata in this test, μ-ARGUS ran fairly quickly as it only required the key variables in the input file. And since it provides an automatic way of combining key variables based on their identification level, it made it simpler to work with varied combinations. However, there is a limitation in assessing disclosure risks when the number of key variables is greater than 10.
Glossary of Terms
Demographically Identifiable Information (DII) is defined as either individual and/or aggregated data points that allow inferences to be drawn that enable the classification, identification, and/or tracking of both named and/or unnamed individuals, groups of individuals, and/or multiple groups of individuals according to ethnicity, economic class, religion, gender, age, health condition, location, occupation, and/or other demographically defining factors.
Disclosure Risk/Re-Identification Risk occurs if an unacceptably narrow estimation of a respondent’s confidential information is possible or if exact disclosure is possible with a high level of confidence. Disclosure risk also refers to the probability that successful disclosure could occur.
Key Variables: Aso called “quasi-identifiers”, key variables are a set of variables that, in combination, can be linked to external information to re-identify respondents in the released dataset.
Personally Identifiable Information (PII): Also called “direct identifiers”, PII are variables that reveal directly and unambiguously the identity of a respondent, e.g., names, social identity numbers.
Statistical Disclosure Control (SDC): Statistical Disclosure Control techniques are a set of methods to reduce the risk of disclosing information on individuals, businesses or other organizations.
Acknowledgement: The Centre’s work to develop an improved technical infrastructure for the management of sensitive data on HDX was made possible with support from the Directorate-General for European Civil Protection and Humanitarian Aid Operations (DG ECHO). Development of this technical documentation was supported through the United Kingdom Foreign, Commonwealth and Development Office (FCDO)’s COVIDAction programme.