Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

By: Nafissatou Pouye and Metasabya Sahlu

Humanitarian organizations regularly collect individual survey data to assess people’s needs and respond to crisis. Such datasets, called microdata, are shared on HDX only after removing Personally Identifiable Information (PII) in accordance with the HDX Terms of Service. Nevertheless, it is possible to make either a narrow estimation of survey respondents’ confidential information or an exact disclosure by combining what are called ‘key variables’ in the microdata. As this causes ‘disclosure risk’ or ‘re-identification risk’, it is a major concern when working with humanitarian data. 

Not all organizations are aware that even if PII (e.g., names and contact information) and Demographically Identifiable Information (e.g., GPS coordinates) are removed from individual survey data, a combination of key variables, such as age, marital status, and location, could point to a specific individual (e.g., a 14 year-old widow in a given camp). Over several months, the Centre team has checked all individual survey data shared on HDX. We have also reached out to organizations to inform them of the potential risk their microdata may present.Household surveys, needs assessments and other forms of microdata make up an increasingly significant volume of data in the humanitarian sector. This type of data is critical to determining the needs and perspective of people affected by crises. Sharing microdata, whether with colleagues, partner organisation or publicly, can improve the effectiveness and efficiency of a response but it also presents unique risks. 

HDX does not allow data that includes personally identifiable information (PII) to be shared publicly through the platform  (read more in our HDX Terms of Service). However,  with microdata it could be possible to infer confidential information and even re-identify an individual even after the personally identifiable information has been removed. This is done by combining what are known as ‘key variables’ and matching these combinations of key variables with information in an external file or other available information sources. This risk that individuals could be re-identified or other confidential information could be inferred is called the disclosure risk. 

Over the past year, our team at HDX have taken a number of steps to ensure that microdata can be shared on the platform in a way that limits the risk to vulnerable populations and communities. In this post, we will describe the steps that we have taken to provide the opportunity for organisations to share this valuable data on HDX in a responsible way.

Our approach to handling microdata

...

The SDC process in the sdcMicro is divided into three steps: 

  1. Perform a disclosure risk assessment by identifying the key variables.

  2. Apply SDC methods to reduce the risk of disclosing information on individuals.

  3. Re-measure the risk and quantify the information loss. 

...

The sdcMicro tool is a useful start, but we have observed its limitations. There are only three SDC methods available for categorical variables, which are variables that take values over a finite set (e.g., gender). Also, the process can take longer for large microdata. The risk-utility trade-off between lowering the disclosure risk and limiting information loss is tricky to handle for microdata that has a high risk of disclosure. 

  1. New Microdata is Added to the Platform: When a user uploads a resource on the HDX platform, they are asked to indicate if the resource contains microdata. We also manually verify whether a resource contains microdata as part of the standard quality assurance process that we perform on all new resources added to the platform. 

  2. Perform Quality Assurance Checks: Next, we perform a quality assurance checks, which includes assessing whether the dataset includes microdata or other potentially sensitive information. If we determine there is no sensitive information, the data will be made available on HDX. Otherwise, the dataset will remain ‘under review’ will we perform our disclosure risk assessment. 

  3. Assess Disclosure Risk: If we determine that there is potentially sensitivity data in the file, the next step is the disclosure risk assessment. This is done using sdcMicro. Our team will first develop disclosure scenarios, then select key variables and finally run the disclosure risk assessment. If the dataset has a ‘global risk’ of less than three percent, then it is deemed safe to share on HDX and will be taken out of review and made public on the platform. If the global risk is higher than 3% we will talk to the contributor about how to proceed. It is important to note that there are a few different methods for quantifying disclosure risk -  global risk is just one of them. While our threshold for sharing data on HDX is a global risk of under three percent, we also look at the individual risk scores to ensure that no individual has a particularly high risk of disclosure. You can learn more about these methods and details on all the steps of the process by following our Disclosure Risk Assessment Learning Path.

  4. Inform Contributor: If we determine that the global risk of re-identification is above our threshold we will contact the contributor to discuss how to reduce the risk. We may have

...

  1. recommendations for how they can use disclosure control techniques to reduce the disclosure risk, or, in some cases, may advise that they forgo this process to only make the data available ‘by request’. 

  2. Applying SDC: There are a number of perturbative and non-perturbative methods that can be used to reduce the risk of disclosure in data. Through perturbative methods, the data value is altered in order to create uncertainty about what the true value is. On the other hand, through non-perturbative methods, the data’s structure Non-perturbative methods the goal is to reduce the detail of the data by suppressing individual values or combining values by creating intervals or brackets (ie age from 19 → ‘18 - 25’). 

  3. Re-Assessing Risk and Quantifying Information Loss: Applying these methods will necessarily result in information loss. Through this process, our goal is to find a balance between limiting risk and maximising the utility of the data. Therefore, after the techniques have been applied and we are sure that we have successfully lowered the risk, we also assess the data utility of the treated data by quantifying the information loss. In some cases, if the steps we need to take to reduce the disclosure risk of the data result in too much information loss, we may advise that you share the original data ‘by request’ rather than share the treated data publicly. 

  4. Sharing Data Via HDX Connect: Finally, if we determine that data cannot be shared publicly, we provide the option to share only the metadata and make the dataset available ‘on request’. This option allows data contributors to control whether and how to share their data.

Exploring other open-source tools

...