Back in June 2016, Apple announced it will use differential privacy to protect individual privacy for certain data that it collects. Though already a hot research topic for over a decade, this announcement introduced differential privacy to the broader public. Before that announcement, Google had already been using differential privacy for collecting Chrome usage statistics. And within the last month, Uber announced that they too are using differential privacy.

If you’ve done a little homework on differential privacy, you may have learned that it provides provable guarantees of privacy and concluded that a database that is differentially private is, well, private — in other words, that it protects individual privacy. But that isn’t necessarily the case. When someone says, “a database is differentially private,” they don’t mean that the database is private. Rather, they mean, “the privacy of the database can be measured.”

Really, it is like saying that “a bridge is weight limited.” If you know the weight limit of a bridge, then yes, you can use the bridge safely. But the bridge isn’t safe under all conditions. You can exceed the weight limit and hurt yourself.

The weight limit of bridges is expressed in tons, kilograms or number of people. Simplifying here a bit, the amount of privacy afforded by a differentially private database is expressed as a number, by convention labeled ε (epsilon). Lower ε means more private.

All bridges have a weight limit. Everybody knows this, so it sounds dumb to say, “a bridge is weight limited.” And guess what? All databases are differentially private. Or, more precisely, all databases have an ε. A database with no privacy protections at all has an ε of infinity. It is pretty misleading to call such a database differentially private, but mathematically speaking, it is not incorrect to do so. A database that can’t be queried at all has an ε of zero. Private, but useless.

In their paper on differential privacy for statistics, Cynthia Dwork and Adam Smith write, "The choice of ε is essentially a social question. We tend to think of ε as, say, 0.01, 0.1, or in some cases, ln 2 or ln 3." The natural logarithm of 3 (ln 3) is around 1.1.

If someone asks you to cross a bridge they built, you might as well ask what the weight limit is. If someone tells you they’ve built a differentially private database that you can use, it is important to ask these questions:

  • What is the ε?
  • Under what assumptions does that ε apply?
  • How do you ensure that that ε is not exceeded?

If the ε is around 1 or less, then you are probably pretty safe. If the ε is much more than that, then there is potentially some risk of individual privacy loss. You need to ask more questions:

  • Under what conditions could an analyst obtain individual data? For instance, what if an analyst has some prior knowledge about some users in the database? Can the analyst then learn more about those users?
  • How do you ensure that those conditions don’t exist for the analysts using the system?
  • Or if the conditions might exist, how do you ensure that the analysts don’t exploit their knowledge?

Given recent publicity around differential privacy, we may soon see differential privacy incorporated as part of commercial data masking or data anonymization solutions. It is important to remember that "differentially private" doesn't always mean "actually private" and to make sure you understand what your data anonymization vendor is really offering.

photo credit: courtesy of Ben Tate