We are in the midst of a global pandemic, and the need to access COVID-19-related data has become increasingly important to make evidence-based policy decisions, develop effective treatments, and drive operational efficiencies to keep our health care systems afloat. Accessing personal data comes at a risk to privacy, and there are many unfortunate examples of harm coming to individuals diagnosed with COVID-19. These challenging times are likely putting pressure on many privacy professionals the world over who are on the front lines of supporting data sharing and data releases.
One common refrain we’re hearing is to “aggregate” data to make it safe for sharing or release. Although aggregation may seem like a simple approach to creating safe outputs from data, it is fraught with hazards and pitfalls. Like so many things in our profession, the answer to the question of whether aggregation is safe is “it depends.”
The final count down
In fact, the field of statistical disclosure control, used by national statistical organizations, was largely born out of the need to protect information in aggregated or tabular form. Various methods have been developed. Probably the most well known is to require a minimum number of people to be represented in any categorization of aggregated data, called the threshold rule. For example, applying the threshold rule of 10 would mean that at least 10 people are represented with the same combination of geographic region, sex and age. This seems simple enough, but things get complicated in a variety of ways.
Let’s assume that aggregated data is represented in a table, with columns representing the identifying attributes of region, sex and age. Another column is added representing the count of people for a combination of these identifying attributes. Applying the threshold rule of 10 would mean removing any counts less than 10, as shown in the table below. Oftentimes, however, summaries are included that count down the column to provide a total (known as a marginal in statistics).
Region | Sex | Age | Count |
A | M | 10-19 | * |
A | M | 20-29 | 17 |
A | M | 30-39 | 15 |
A | M | All (10-39) | 39 |
If the summary row is included, the last row in the table above, we can easily determine that the count for region A, sex M and age 10–19 is 39 – 15 – 17 = 7. OK, yes, this was an obvious example. But it’s a common problem, and there are known methods to easily reverse aggregation across many dimensions when marginals are included (such as the shuttle algorithm). Those marginals may also come from published releases, for example, the overall number of positive COVID-19 cases in a region.
Is the lesson here not to include summaries or marginals?
Well, summaries of this type can also come up when aggregated data is produced over time. Say the above table was produced for the month of February, in which {region A, sex M, age 30–39} = 15, and for the month of March, an updated table is produced with {region A, sex M, age 30–39} = 16. It’s easy to see that one person was added from one month to the next, which would violate the spirit of the threshold rule since fewer than 10 people would be represented in the difference between counts.
Fables of the reconstruction
The challenges just described are due to overlapping counts. Things can get more complicated when we consider aggregating numerical data. This is why other methods were developed, such as combining the threshold rule with a proportionality measure to ensure that the aggregation doesn’t end up representing only one or two people, known as the dominance rule. Say, for example, the data included household income. Adding up the income for more than 10 people may not provide sufficient protection if only one person’s income ends up representing 80% or more of the total income.
This leads us to a problem with the word “aggregation.”
Broadly interpreted, it could mean collating data together (the least fitting to our current discussion), counting people in a group, adding up numerical data of people, or, while we’re at it, calculating any statistic of data about people. Statistics can introduce a whole new realm of complexity (and not just because people dread the study of statistics).
The challenge is that all these numerical forms of aggregation can be used to reconstruct the original data, something that has come to be known as the database reconstruction theorem. Generally speaking, the more statistics produced from the same underlying data, the more likely it is that the underlying data can be reconstructed from those statistics. This is because there are only so many combinations of data that could have produced those statistics. This is the reason the U.S. Census Bureau has invested so heavily in developing new and innovative tools using differential privacy.
One or the other
Deidentification or anonymization (terms that are often used interchangeably) is broadly defined as removing the association between data and people. The data can be record-level or aggregated, the same principles apply, as was recognized in the standardization efforts of ISO 20889 on terminology and techniques. It could be argued that applying these techniques to aggregated data is somewhat after the fact, whereas applying them to the record-level data is more akin to the spirit of privacy by design. However, there are methods that build these techniques into the aggregation itself.
Regardless, as privacy professionals, we should not be using the terms deidentified or anonymized when only directly identifying attributes are removed from data. Perhaps it’s time we agree to call this pseudonymized data, given the influence the EU General Data Protection Regulation has had around the world.
While it may seem simple to suggest using aggregated data, things are never as simple as they seem in the world of privacy, and “it depends” is a common refrain. In truth, it may be safer to suggest using aggregated data that has been deidentified or anonymized, or to remove the association between aggregated data and people. But let’s not assume that aggregated data is safe, or we’ll provide a false sense of security in how data outputs are shared or released.
Stuck in the middle
This isn’t to say that producing safe data outputs needs to be overly complex.
In the context of the COVID-19 pandemic, if at no other time, we need to think of ways to produce useful data that is sufficiently protected and in an efficient and scalable manner. For non-complex data outputs, such as counts, keep it simple: a standard set of attributes (e.g., region, sex, age), aggregated by those attributes, with no accompanying summary statistics produced from the underlying data (so that there’s less risk of overlap that may reveal the underlying counts), and for a specific reporting period with no overlap from previous reporting periods.
For scenarios in which more detailed data is needed, a little more sophistication may be required, especially for data that is potentially identifying. However, a great deal can also be done besides just changing data outputs through transformations.
The threat landscape can be reduced by only sharing or releasing data with those people who need it for approved purposes and in suitably protected data environments with appropriate technical and organizational controls (see, for example, the Five Safes). A limited context should also limit the data transformations needed to render data outputs safe.
While this is seemingly more complex than suggesting aggregated data, it does speak to our common answer of “it depends.”
Photo by Lukas Blazek on Unsplash