There is considerable activity today in regulatory development and updates around the privacy rights of individuals in Canada, Europe, the U.S. and elsewhere. This includes guidelines, standards, and regulator orders and opinions. Many of these efforts will need to (re)define what non-identifiable data is and how its development, use and disclosure should be regulated so the great many societal benefits of using data can be realized while still protecting privacy. Here are some key considerations for regulating the generation and processing of non-identifiable data, in both principle and practice. You can read a more in-depth analysis here. Ultimately, this will be up to legislators and regulators to determine.

Recommendations

We first present three relevant principles to consider when regulating non-identifiable data. These principles would guide the practices and what to prescribe to ensure undesirable side effects don’t emerge. These are followed by specific practices that have sometimes proven controversial because, in some cases, they have been defined in a manner that goes against the principles.

1. Reduce uncertainty

Regulations should, in principle, provide answers to the difficult questions, including those represented in many of the points below. The more precise the answers, the better. Precision gives organizations certainty about the rules to follow. Leaving aspects of the non-identifiable data process creates uncertainty which will result in no action being taken.

2. Create incentives

Regulatory regimes shouldn’t put in place disincentives for good or desirable behaviors. Generating non-identifiable data can be considered a privacy enhancing technology. However, if meeting regulatory requirements would result in non-identifiable data of very low utility, or if the requirements leave obvious cheaper and faster ways to achieve the same objective, that would actually act as a disincentive for generating non-identifiable data.

3. Recognize and calibrate the broad benefits of non-identifiable data

Many industries generate, use and disclose non-identifiable data. While the public discourse around uses of personal data focus on marketing and advertising, there are many other uses, such as health research. For citizens there will be differential benefits for each of these, so they shouldn’t be treated as one. Is the benefit of singling out an individual for the purpose of delivering an ad for a new truck the same as the benefit of singling out a cancer patient to recommend a potentially life-saving clinical trial? We would suggest the benefits should be considered as part of defining what is deemed acceptable non-identifiable information.

4. Enable the creation of non-identifiable data without consent

A key question is whether individual consent is required for the creation of non-identifiable information. Some statutes explicitly say the creation of non-identifiable information is a permitted use and therefore additional consent isn’t needed. Other statutes are silent on this. Some privacy organizations and advocates have made the case for consent under these circumstances. But consider this: If an organization needs to obtain consent for creating non-identifiable data, they might as well obtain consent for processing the identifiable data — another disincentive.

5. Clarify whether destroying original (identifiable) data is necessary

Data is rendered non-identifiable following some transformations applied to the original identifiable data. Some guidelines and opinions over the last decade have argued that if there’s a copy of the identifiable data then it isn’t possible to claim a dataset is non-identifiable. It’s not always clear whether this means the identifiable data exists within the same organization or anywhere even outside the organization, and the latter would be a much more conservative definition of identifiability. If the original data exists in either of these, does it then need to be destroyed for the transformed data to be legitimately claimed as non-identifiable?

If that’s the case, this would be quite challenging to operationalize in practice without undesirable side effects. Lesser requirements would be, for example, requiring processing identifiable and non-identifiable information cannot be done by the same individuals. This way there are concrete separations between the entity processing identifiable and non-identifiable data. This would allow the management of risk in a reasonable manner, but still allow the reasonable processing of non-identifiable data.

6. Risks should be assessed for an anticipated adversary

An individual or entity can potentially launch a reidentification attack on a non-identifiable dataset. That being said, when performing a reidentification risk assessment, the background knowledge of the adversary is an important consideration.

Some guidelines specify that reidentification risk should be low against an “anticipated” adversary. Others say the risk should be low against any adversary. In our experience, the risk levels are very different between these two. The latter is a much higher standard and treats all data as if it’s being disclosed publicly. This is another example of an approach that could have a negative impact on data utility while providing protections against adversaries who are very unlikely to get access to the non-identifiable data.

7. Define acceptable thresholds

An important part of the process of creating non-identifiable data is deciding on an acceptable risk threshold. When the measured risk in the data is below the threshold, the data can be considered non-identifiable.

There are many precedents for what is deemed an acceptable threshold. Having precise thresholds increases certainty and makes it easier to implement methods for creating non-identifiable data. Having a threshold is a key consideration in the overall process and for having the confidence that authorities will deem the information non-identifiable.

8. Require ethics review rather than regulate specific uses of non-identifiable data

Information, whether identifiable or non-identifiable, can be used for good or inappropriate purposes. A machine learning model can be constructed from a dataset and used to make decisions that are beneficial or discriminatory to individuals. The appropriateness of the purposes of data processing are a separate issue from the identifiability status of the data.

Setting up an ethics review process is a way to manage the risk of inappropriate uses of non-identifiable data. This review process would be staffed appropriately and would account for contemporary cultural norms to judge whether a particular data use is appropriate or not.

9. The data processing context should be considered

Common methods for assessing the risk of reidentification involve accounting for the context, such as the security, privacy and contractual controls in place to process the non-identifiable data. If the controls are high, the overall risk is reduced.

That controls are important for managing the residual risk is clear. Without them, the amount of data perturbations and transformations will be so extensive the data utility will diminish, and even more so for complex datasets. There needs to be allowance for the use of controls to manage residual risks, but this also must be throttled so the overall risk management model continues to have credibility.

10. Define the consequences of reidentification attacks

Reidentification attacks on non-identifiable datasets occur on a regular basis. Many are performed by academics and the media. This doesn’t mean others aren’t performing reidentification attacks — just that we don’t necessarily hear about them.

It’s reasonable that reidentification attacks without the approval of the data controller or custodian be considered an offense. We think an exception for research purposes (e.g., white hat attacks) is reasonable, but these should be performed with the approval of a research ethics board. It is not clear that all of the published reidentification attacks have gone through an ethics review. As for all research with data that either is or would become identifiable, ethical oversight is important.

Photo by davide ragusa on Unsplash