Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When new microdata is shared on HDX, the HDX team follows the process described in the workflow below.

...

    The sdcMicro tool is a useful start, but we have observed its limitations. There are only three SDC methods available for categorical variables, which are variables that take values over a finite set (e.g., gender). Also, the process can take longer for large microdata. The risk-utility trade-off between lowering the disclosure risk and limiting information loss is tricky to handle for microdata that has a high risk of disclosure. 

    1. New Microdata is Added to the Platform: When a user uploads a resource on the HDX platform, they are asked to indicate if the resource contains microdata. We also manually verify whether a resource contains microdata as part of the standard quality assurance process that we perform on all new resources added to the platform. 

    2. Perform Quality Assurance Checks: Next, we perform a quality assurance checks, which includes assessing whether the dataset includes microdata or other potentially sensitive information. If we determine there is no sensitive information, the data will be made available on HDX. Otherwise, the dataset will remain ‘under review’ will we perform our disclosure risk assessment. 

    3. Assess Disclosure Risk: If we determine that there is potentially sensitivity data in the file, the next step is the disclosure risk assessment. This is done using sdcMicro. Our team will first develop disclosure scenarios, then select key variables and finally run the disclosure risk assessment. If the dataset has a ‘global risk’ of less than three percent, then it is deemed safe to share on HDX and will be taken out of review and made public on the platform. If the global risk is higher than 3% we will talk to the contributor about how to proceed. It is important to note that there are a few different methods for quantifying disclosure risk -  global risk is just one of them. While our threshold for sharing data on HDX is a global risk of under three percent, we also look at the individual risk scores to ensure that no individual has a particularly high risk of disclosure. You can learn more about these methods and details on all the steps of the process by following our Disclosure Risk Assessment Learning Path.

    4. Inform Contributor: If we determine that the global risk of re-identification is above our threshold we will contact the contributor to discuss how to reduce the risk. We may have recommendations for how they can use disclosure control techniques to reduce the disclosure risk, or, in some cases, may advise that they forgo this process to only make the data available ‘by request’. 

    5. Applying SDC: There are a number of perturbative and non-perturbative methods that can be used to reduce the risk of disclosure in data. Through perturbative methods, the data value is altered in order to create uncertainty about what the true value is. On the other hand, through non-perturbative methods, the data’s structure Non-perturbative methods the goal is to reduce the detail of the data by suppressing individual values or combining values by creating intervals or brackets (ie age from 19 → ‘18 - 25’). 

    6. Re-Assessing Risk and Quantifying Information Loss: Applying these methods will necessarily result in information loss. Through this process, our goal is to find a balance between limiting risk and maximising the utility of the data. Therefore, after the techniques have been applied and we are sure that we have successfully lowered the risk, we also assess the data utility of the treated data by quantifying the information loss. In some cases, if the steps we need to take to reduce the disclosure risk of the data result in too much information loss, we may advise that you share the original data ‘by request’ rather than share the treated data publicly. 

    7. Sharing Data Via HDX Connect: Finally, if we determine that data cannot be shared publicly, we provide the option to share only the metadata and make the dataset available ‘on request’. This option allows data contributors to control whether and how to share their data.

    ...

    Limitations of sdcMicro & Other Open-Source SDC tools

    sdcMicro is a great place to start for humanitarian organisations new to statistical disclosure control. sdcMicro is an add-on package in R that you simply need to download and install and you are ready to get started. If you are new to R, there will certainly be a learning curve but for R users it should be quite small. sdcMicro does have a few limitations. For example, there are only three statistical methods available for categorical variables. These are variables that take values over a finite set (e.g., gender). Categorical variables also happen to be more prevalent than continuous variables in humanitarian microdata. Furthermore, the process can take quite a long time for large microdata files. Finally, the risk-utility trade-off between lowering the disclosure risk and limiting information loss can be tricky to manage for microdata that has a high risk of disclosure.

    Different research institutions and statistical offices have developed generic or specifically tailored SDC tools and made them openly available to the public. Aiming to find alternatives to mitigate the shortcomings identified with sdcMicro, we explored the ARX- Data Anonymization Tool developed by the Technical University of Munich and the μ-ARGUS tool developed by Statistics Netherlands.

    ...