Open data initiatives are increasing everywhere. These are often driven by the public sector aiming to promote data availability to trigger innovation and eventually better services. The basic concept is to make data available publicly with no constraints on who can access the data and what they can do with it. It is important to ensure that these open data initiatives do not release personal information.
Concurrent with open data are freedom of information laws which allow citizens to request access to government information. This is effectively the same as sharing data publicly because the requester can post the information they obtain on a public website, and many media organizations do just that. Unless that information is de-identified, potentially identifying information may be made broadly available.
A key mechanism to be able to share this kind of information publicly without revealing personal information is to de-identify it. Many organizations are unsure about how to de-identify data appropriately, and in some cases end up releasing personally identifiable information. In this post I will provide an outline of a de-identification protocol for preparing data for public release. I am assuming that the data is at the individual level (i.e., it has not been aggregated).
It should be noted that this process is intended to be conservative, but it does not guarantee zero risk. The process is consistent with best practices and is intended to ensure that the risks are very small.
Step 1: Classify Variables
The first step is to determine which variables are direct identifiers, which are indirect identifiers, and which ones we will not consider in the context of de-identification. This is typically performed by the data custodian. A direct identifier would be a Social Security number or a name, for example. An indirect identifier would be a date of birth or death. This classification is important because it determines how the data will be processed. There may be other variables in the data that are neither direct nor indirect identifiers.
Caution
In general you do not want to have more than six to eight indirect identifiers in a public data set. Because no controls can be imposed on a public data set, it will be difficult to adequately ensure very small risk with a large number indirect identifiers. This also means that in general it will not be possible to share longitudinal data in this public way as longitudinal data will have multiple transactions per individual and therefore the indirect information per person will likely exceed eight.
If it is necessary to have more than eight indirect identifiers or to release longitudinal data then a non-public data release, where other contractual, privacy, and security controls can be imposed, should be considered. Alternatively, aggregated data can be released as part of the open data initiative rather than individual level data.
Step 2: Pseudonymize or Remove Direct Identifiers
If it is necessary to link records that belong to the same individual, for example, if the data has records that span multiple tables, then some direct identifiers should be pseudonymized. Pseudonymization should be relatively straight forward, but there are cases where this has not been performed properly. Pseudonymization techniques include one-way hashing and encryption. If a direct identifier is not going to be used to link records then it should be removed.
Step 3: K-Anonymize the Indirect Identifiers
The k-anonymity methods means that there are at least k individuals who have exactly the same values on the indirect identifiers, for every combination of values on the indirect identifiers. These methods should be used with k>=11 to de-identify public data. The indirect identifiers may be generalized, truncated, or redacted as ways to achieve k-anonymity. The value of k=11 is consistent with current practices for public data release from other organizations, such as CMS (see here and here, for example).
Step 4: Perform a Motivated Intruder Test
This is a process where the data custodian commissions a re-identification attack on the data from step three. This empirical process tests the assumptions that were made. The assumptions include the choice of indirect identifiers and the choice of the k value. The UK Information Commissioner's Office recommended this approach in their code of practice as a risk-management step when sharing information.
Motivated intruder test
The UK Information Commissioner's Office has provided some useful definitions and guidance on conducting a motivated intruder test. The following is from their code of practice.
The ‘motivated intruder’ is taken to be a person who starts without any prior knowledge but who wishes to identify the individual from whose personal data the anonymised data has been derived. This test is meant to assess whether the motivated intruder would be successful.
The approach assumes that the ‘motivated intruder’ is reasonably competent, has access to resources such as the Internet, libraries, and all public documents, and would employ investigative additional knowledge of the identity of the data subject or advertising for anyone with information to come forward. The “motivated intruder” is not assumed to have any specialist knowledge such as computer hacking skills, or to have access to specialist equipment or to resort to criminality such as burglary, to gain access to data that is kept securely.
Carrying out a motivated intruder test in practice might include:
- carrying out a web search to discover whether a combination of date of birth and postcode data can be used to reveal a particular individual’s identity;
- searching the archives of national or local newspaper to see whether it is possible to associate a victim’s name with crime map data;
- using social networking to see if it is possible to link anonymised data to a user’s profile; or
- using the electoral register and local library resources to try to link anonymised data to someone’s identity.
Step 5: Update the De-identification
A motivated intruder test may not be necessary if a data release is similar to a previous one that has undergone that type of analysis.
Based on the results of the motivated intruder test, the de-identification scheme that was applied in steps 1 to 3 is updated, and the data is released.
Exceptions
There are some exceptions to the above protocol that should be considered. We illustrate these exceptions through some examples:
- If the data release contains only indirect identifiers and the k-anonymity criterion is met, then there may be no need for running the motivated intruder test. The reasoning here is that an adversary would not learn something new through a re-identification because an adversary already has all the indirect identifiers. An example of this occurs in public health reporting of illnesses in the population by age and gender.
- For certain types of longitudinal data the event dates are more or less pre-determined. A good example of that is clinical trial data or cancer treatment data. In such a case the key date is the start date, and it may be possible to create a de-identified data set even though it is longitudinal, as long as the other criteria are met.
The above is a pragmatic approach to try to balance re-identification risk against the ability to make data more available publicly. Again it should be noted that this does not guarantee zero risk, but attempts to balance multiple societally beneficial and individual needs.
Inferences
The above protocol does not address concerns around inappropriate uses of the data. For example, a data user may use the de-identified data to build a model that discriminates against individuals in some decision-making situations. It is difficult to control inappropriate data uses for a public-data release without a terms-of-use or other regulatory mechanism. It is possible to modify the data to make it difficult to draw inferences from the data, but that reduces data utility significantly and defeats the whole purpose of open data. If the data custodian believes that the data can potentially be used in inappropriate ways then an open data release may not be the most appropriate.
Conclusions
If the conditions for creating a public data cannot be met, then alternative non-public data releases should be considered. This may include using a terms-of-use for those who download the data (i.e., a quasi-public release), having more stringent contractual controls in place, or creating a data enclave.
photo credit: The True Believers via photopin (license)