This is the fourth in a series of Privacy Tech posts focused on privacy engineering and UX design from Humu Chief Privacy Officer Lea Kissner.
Folks who want to make data-driven decisions naturally want data to make those decisions. They want to slice and dice that data to learn every useful trend. Sometimes, that data is public or otherwise not in need of anonymization. However, in many cases, the data needs to be anonymized, transformed so that the original user can be in no way identified, and this is where things turn complicated.
There is a major pitfall in building a data-slicing analytics system over anonymized data: applying the anonymization separately in different views of the dataset. This pitfall applies both to techniques like k-anonymity, which suppress data, and techniques like differential privacy, which add randomly generated noise to the data. I won’t get into the details of the techniques here, just know that these two types exist. Let’s go through those pitfalls and techniques to avoid them.
Pitfalls in data suppression
Consider the example illustrated below: Some people have answered a survey including a question about whether they are or are not a privacy engineer. For privacy reasons, the survey administrator has chosen to suppress all answers given by fewer than 10 people. This sample, sadly, has fewer than 10 responses from privacy engineers, so their responses are suppressed.
|Privacy engineers||Non-privacy engineers||Total responses = 5,000|
A viewer of the data breakdown thus sees that 4,995 people have stated that they are not privacy engineers and that a suppressed number (fewer than 10) is privacy engineers. Zooming back out to to the question overall, if suppression of rare responses is performed independently at this level, the viewer would see that 5,000 people responded to the question; because 5,000 is greater than 10, it is not suppressed. The trouble comes in that the viewer can put this information together to determine that 5,000 minus 4,995 people in the sample are privacy engineers.
Attacks in this vein can be much more complex; the example illustrated here is the simplest possible form. A particularly common, slightly more-complex version of this pitfall comes when giving viewers the ability to slice data by time. For example, consider a dataset in which some days have so little data that it would be suppressed if shown on its own, but the system only shows data at the granularity of a week and data suppression is applied at that level. Some viewers, however, want to show weeks starting Sunday, some starting Monday, some even on other days, so the system designers wish to make the week-start configurable. Those two properties, in combination, allow a windowing attack, in which data from overlapping windows can be queried and the answers to those queries compared to obtain data from a single day, data which might be suppressed if properly anonymized.
In this example, the difference in these two queries shows that the total data from both Sundays is 1. Further queries of this sort can be used to further determine per-day data.
Essentially, in these attacks, the system has allowed un-suppressed data to be made implicitly available. Avoiding these attacks requires understanding what data is implicitly available.
Pitfalls in randomness
|Privacy engineers||Non-privacy engineers||Total responses = 5,012|
When applying randomness, a different pitfall applies: The data shown to the viewer is likely to be self-contradictory and confusing. Consider the first example but with randomness applied to each quantity independently rather than suppression.
Safer aggregation over anonymized data
There is a relatively simple solution to all these issues:
- Identify the finest granularity of data that will be available in your system, whether explicitly or implicitly.
- Perform anonymization on this finest-grained data.
- Build aggregated data displays directly from the anonymized data, rather than the un-anonymized dataset. Consider any suppressed data fields to be equivalent to zero.
Let us consider our example again. Note how the higher-level data displayed, the total number of people who took the survey is consistent with the fine-grained survey answer data.
|Privacy engineers||Non-privacy engineers||Total responses = 4,995|
|Privacy engineers||Non-privacy engineers||Total responses = 4,991|
If you want to comment on this post, you need to login.