We recently reviewed how non-identifiable data, such as synthetic data, is regulated in Canadian privacy legislation and interviewed 13 of the 14 provincial, territorial and federal privacy regulators to get their perspectives on the topic. Informed by the results of that study, we examine three specific questions: (a) is data with residual reidentification risk still personal information, (b) who sets reidentification risk-management standards, and (c) should non-identifiable data be regulated to mitigate against inappropriate uses of data and models?

There are multiple privacy-enhancing technologies that can be used to generate non-identifiable data (such as synthetic data generation and differential privacy) or allow the processing of information in a non-identifiable manner, such as federated analysis. Let me emphasize that none of these will ensure that the risk of reidentification is zero — there will always be some residual risk.

One perspective to regulating non-identifiable data is to say that because there remains a residual risk, the information is still personal information and it should be regulated as such (with all of the commensurate obligations). This approach would consider any residual risk to be unacceptable, and it follows what is called the “precautionary principle” which advocates that if the risk is not zero, then an intervention is needed to eliminate the potential harm.

Achieving zero risk is simply not a standard that can be met in practice. Data with zero reidentification risk would have very limited utility. There would be no incentive for a data custodian to create data with no utility. Furthermore, there are always future unknowns that can increase the risk of reidentification, for example, the future availability of public data sets that can be used to perform linking attacks on currently non-identifiable datasets. Future unknowns means that there is always a residual risk.

However, we often make data protection decisions with future unknowns. In the moment, we use the best available data protection solutions with the knowledge that these tools will not necessarily be the best ones in the future. Otherwise, we would approach a state of paralysis because of unknown future risks. For example, we encrypt data sets today with the expectation that, at some point in the future, quantum computing may make it possible to break the encryption that was used. But no one would argue that data should not be encrypted because a future state represents potentially elevated risk.

Given that the residual risk cannot be zero, can we assess it and decide when it is low enough?

There are reasonable models for quantifying the privacy risks, such as reidentification, for different types of PETs. These models make assumptions. It is important that these assumptions can be justified, by:

  • Ensuring the assumptions are grounded in actual published reidentification attacks. Arguably, the published attacks represent the state-of-the-art in this area and therefore represent a good reference.
  • Taking into account the types of auxiliary information available today, which would also be available to an anticipated data recipient.
  • The assumptions are informed by on-going motivated intruder attacks.

Strong precedents exist for what are deemed to be acceptable risk thresholds. These precedents reflect what different organizations, courts and regulators have considered reasonable quantitative values to determine when data becomes non-identifiable. Additionally, in practice, multiple layers of protection are implemented by adding security, privacy and contractual controls when processing non-identifiable, non-public datasets.

An alternate version of the precautionary principle described in this publication is “if a proposed change has a possible risk of causing harm to people or to the environment, the burden of proving that it is safe (or very unlikely to cause harm) falls on those proposing the change,” which in this case means that the burden falls on the data custodian to demonstrate that the residual risk is acceptably small.

Placing the burden on data custodians favors larger organizations with resources to acquire and develop the expertise to perform sophisticated risk analyses. Smaller and resource-constrained organizations can benefit from generally accepted codes of practice and standards for assessing scalable risks down to smaller organizations. There was a strong consensus among the regulators we interviewed that codes of practice are important to provide certainty and guidance for organizations about how to assess and manage identifiability risks. This would be beneficial for both the data custodians and the regulators.

The third concern would be classified as the extent to which we need regulations to mitigate against the unethical use of non-identifiable data. That is indeed a valid concern in that statistical and artificial intelligence models, for example, can be developed with non-identifiable data, and these models can be used in harmful, discriminatory, or creepy decision-making. In such a case establishing guardrails around decisions from data and models would be desirable. The question would be: what are the appropriate guardrails?

One approach requires some form of ethics review whenever data and models built from these data (including non-identifiable data) are used in decision-making. This would allow a case-by-case analysis of the potential harms by individuals who can apply contemporary cultural norms. Another approach is to limit the processing of non-identifiable data to specific purposes an d entities. However, the concern here is that unless these are defined broadly, they may be out of date out of the gate or result in unintended consequences, see "Could privacy law reform accelerate medical AI innovations?”

During our interviews with regulators, there was alignment on many of the above points, but it was not unanimous. For organizations operating at a national level, it is helpful to have consistency in the interpretation of such basic concepts. The societal and economic opportunities and benefits lost from not being able to use and disclose data for secondary analysis, such as medical AI, are too high.

Photo by fabio on Unsplash