Setting the record straight on privacy dimensions in big data

In discussions about privacy and big data and the secondary uses of data, I have observed that a number of distinct concepts are oftentimes confused or treated as if they mean the same thing. Sometimes all of these concepts are lumped together under the "de-identification" or "anonymity" umbrella, when they actually refer to quite different things. This confusion does not help us, as a privacy community, come up with policies and solutions that would allow responsible uses of data and still achieve the societal and business benefits of big data analysis.

I want to examine some of these concepts and argue that they are all orthogonal to each other and therefore must be treated separately. These will be termed the dimensions of privacy in big data. To be clear, these are not the only relevant dimensions of privacy. My intention here is only to focus on ones where more precise definitions and dis-entanglement would be beneficial.

Four Dimensions of Privacy

The assumption is that we are referring to a dataset with individual level information. The information can be demographics or transactions (e.g., financial transactions or insurance claims). We are also talking about uses and disclosures for secondary purposes rather than the primary purpose for which the data was collected.

Linkability

Linkability is when it is possible to link all of the events or records that belong to the same data subject together. For example, in a health insurance claims database, if all of the records that belong to the same patient have the same pseudonymized ID then it is possible to link all of these records together and construct a longitudinal record of that individual. The individual may or may not be identifiable - identifiability is orthogonal to linkability. Some experts have put linkability under the umbrella of anonymization, such as the Article 29 Working Party did in their opinion.

By treating linkability as part of anonymity, the Working Party essentially prohibits, or at least discourages, longitudinal records that are de-identified. This approach may have some justification in the context of open data or public data releases, however, a blanket prohibition on de-identified longitudinal data would be extremely detrimental to a significant amount of health research, as well as analytics in other domains, such as financial services, marketing, insurance, education, to name a few.

Furthermore, it is possible to de-identify longitudinal data. Put another way, de-identified data can be cross-sectional or longitudinal. Linkability makes de-identification more complex, but it is achievable in most cases.

Addressability

This is when you have a pseudonym that can be used to target or address a specific individual (not necessarily an "identifiable individual"). For instance, a pseudonym can be used directly or indirectly to target advertisements to a specific individual or an individual’s device. I may not know the identity of the individual, but I can address that individual anonymously. For example, an advertiser could send the pseudonym and the advertisement to an ISP who then links the pseudonym to a specific device ID and sends that advertisement to that device. The ISP already knows the identity of the consumer, and the advertiser never gets to know the identity of the consumer. In that case the pseudonym is addressable but not identifiable.

Alternatively, it may be possible to determine the identity of that individual, for example, if the advertiser is the ISP itself. Again, addressability is orthogonal to identifiability. I may address someone electronically but have no capability to determine their identity, or I may be able to determine their identity and address them directly because I know their identity.

The key here is whether the pseudonym can be used to address or target a "specific individual" or an "identifiable individual".

Identifiability

This is when we are able to correctly assign an event or record to an identifiable or known individual with a high probability. This is the traditional definition of identifiability. De-identification standards that exist today would typically only address this specific issue of protecting against identity disclosure.

One can have a non-identifiable, non-addressable and non-linkable pseudonym. Or a non-identifiable, non-linkable, but addressable pseudonym - when the pseudonym is regenerated anew repeatedly but every instance of the pseudonym can be used to address an individual. In fact, there are eight possible combinations of scenarios on the three dimensions covered so far.

Inferences

The ability to draw inferences is important for any kind of data analysis. If we wish to use data then we need to allow inferences. But first, let us define the concept of inferences more precisely.

Data analysis means building some kind of model. The model can be simple, such as being only descriptive - for example, “80 percent of our customers are female” - or more complex using statistical and machine learning methods to make predictions. A model can be built from identifiable data or de-identified data; from linkable data or non-linkable data; and from addressable data or non-addressable data. Therefore, model building is orthogonal to the three dimensions above.

Once a model is built, it can be used for good purposes or undesirable purposes, such as discrimination against certain groups of individuals. For example, a model that predicts the likelihood of getting cancer by a certain age can be used to introduce wellness programs in high risk communities or to deny bank loans to individuals deemed to have a high risk. Therefore, undesirable inferences are a function of the data uses rather than the models themselves. This means we need to introduce mechanisms to manage the risks from data uses.

Managing Privacy Risks

To manage the four privacy risks noted above a different set of approaches are needed. A single approach is not going to ensure responsible data uses and disclosures. Below are proposals reflecting some of the current thinking about managing these types of risks. It should be noted that all four dimensions of privacy risks need to be managed adequately to be able to make a credible claim that privacy risks are managed.

Issue

Risk Management Options

Linkability

The disclosure of longitudinal data publicly (ie, open data) should be limited because it is difficult to protect that kind of detailed information within the constraints of a public release.

For non-public data releases this kind of data can be adequately de-identified.

Addressability

Unless there is express consent or at least some form of meaningful notice, mechanisms must ensure that addressability is performed anonymously. This can be achieved through controls on the workflows.

Identifiability

Good practices for de-identification exist, as well as standards and certification programs, and these should be followed.

Inferences

In general, ethics reviews should be performed on data uses.

Some have argued that regulations should prohibit or limit certain types or classes of data uses.

To what extent should data uses be compatible with the original intent of data collection, especially if the data is de-identified? Arguably that compatibility link would be substantially weaker for de-identified data than it would be for identifiable data, otherwise there are no incentives for implementing privacy-protective mechanisms such as de-identification.

Example

Let's consider a health example where data is being used to identify patients suitable for clinical trials. Patient recruitment for trials is a significant problem in that it is difficult and expensive to find trial participants. Multiple solutions have been proposed to help identify patients that may meet trial recruitment criteria.

Recruitment criteria can be complex, including medical history and the history of drugs that a patient has been on. Let's say a fictitious company, TrialCo, gathers de-identified information from hospitals and payers, uses secure linking protocols to create a de-identified profile of patients and then matches that profile to trial recruitment criteria. TrialCo then sends a pseudonym back to the hospital for patients who are good candidates for trials. The hospital is able to match the pseudonym to a patient identity, and that patient's physician is notified. The physician then decides whether to approach that patient about participation in the trial.

This workflow is beneficial for the patient in that the patient may find out about trials that can improve their treatment, maybe save their lives, and it is beneficial for the trial sponsor in that it accelerates recruitment, which would eventually result in getting drugs to market faster.

Let's examine this workflow based on our risk management grid from above.

Linkability

The data that is being analyzed by TrialCo is longitudinal. However that data is not being shared publicly and it is de-identified at every stage of the process.

Addressability

TrialCo is able to anonymously address specific eligible patients through the hospital and the physician responsible for each patient. The company does not have access to the identity of the individual patients. Agreements with the hospital prohibit the hospital from revealing the identity associated with a pseudonym back to the company.

Identifiability

Assuming that appropriate de-identification methods are used, then the data that TrialCo works with would have a very small probability of identifying a patient.

Inferences

As a matter of course, it is expected that this patient recruitment protocol would have gone through an ethics review, most likely by the hospital IRB itself. If that is not the case then it would be prudent for TrialCo to have this protocol reviewed by an external IRB or an internal one if such a committee is appropriately constituted.

photo credit: Halfway via photopin (license)

Setting the record straight on privacy dimensions in big data

Related stories