25 June 2019

Aggregating over anonymized data

Folks who want to make data-driven decisions naturally want data to make those decisions. They want to slice and dice that data to learn every useful trend. Sometimes, that data is public or otherwise not in need of anonymization. However, in many cases, the data needs to be anonymized, transformed so that the original user can be in no way identified, and this is where things turn complicated.

There is a major pitfall in building a data-slicing analytics system over anonymized data: applying the anonymization separately in different views of the dataset. This pitfall applies both to techniques like k-anonymity, which suppress data, and techniques like differential privacy, which add randomly generated noise to the data. I won’t get into the details of the techniques here, just know that these two types exist. Let’s go through those pitfalls and techniques to avoid them.

Pitfalls in data suppression

Consider the example illustrated below: Some people have answered a survey including a question about whether they are or are not a privacy engineer. For privacy reasons, the survey administrator has chosen to suppress all answers given by fewer than 10 people. This sample, sadly, has fewer than 10 responses from privacy engineers, so their responses are suppressed.

Privacy engineers	Non-privacy engineers	Total responses = 5,000
Suppressed	4,995

A viewer of the data breakdown thus sees that 4,995 people have stated that they are not privacy engineers and that a suppressed number (fewer than 10) is privacy engineers. Zooming back out to to the question overall, if suppression of rare responses is performed independently at this level, the viewer would see that 5,000 people responded to the question; because 5,000 is greater than 10, it is not suppressed. The trouble comes in that the viewer can put this information together to determine that 5,000 minus 4,995 people in the sample are privacy engineers.

Attacks in this vein can be much more complex; the example illustrated here is the simplest possible form. A particularly common, slightly more-complex version of this pitfall comes when giving viewers the ability to slice data by time. For example, consider a dataset in which some days have so little data that it would be suppressed if shown on its own, but the system only shows data at the granularity of a week and data suppression is applied at that level. Some viewers, however, want to show weeks starting Sunday, some starting Monday, some even on other days, so the system designers wish to make the week-start configurable. Those two properties, in combination, allow a windowing attack, in which data from overlapping windows can be queried and the answers to those queries compared to obtain data from a single day, data which might be suppressed if properly anonymized.

In this example, the difference in these two queries shows that the total data from both Sundays is 1. Further queries of this sort can be used to further determine per-day data.

Essentially, in these attacks, the system has allowed un-suppressed data to be made implicitly available. Avoiding these attacks requires understanding what data is implicitly available.

Pitfalls in randomness

Privacy engineers	Non-privacy engineers	Total responses = 5,012
7	4,984

When applying randomness, a different pitfall applies: The data shown to the viewer is likely to be self-contradictory and confusing. Consider the first example but with randomness applied to each quantity independently rather than suppression.

Safer aggregation over anonymized data

There is a relatively simple solution to all these issues:

Identify the finest granularity of data that will be available in your system, whether explicitly or implicitly.
Perform anonymization on this finest-grained data.
Build aggregated data displays directly from the anonymized data, rather than the un-anonymized dataset. Consider any suppressed data fields to be equivalent to zero.

Let us consider our example again. Note how the higher-level data displayed, the total number of people who took the survey is consistent with the fine-grained survey answer data.

Suppression:

Privacy engineers	Non-privacy engineers	Total responses = 4,995
Suppressed	4,995

Randomness:

Privacy engineers	Non-privacy engineers	Total responses = 4,991
7	4,984

Photo by Drew Hays on Unsplash

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Aggregating over anonymized data

Related stories

Editor's note: