In a recent article, Daniel Solove illustrates how the death of privacy has been predicted by scientists and the media for quite a few decades. Thankfully, privacy is still alive. However, one such recent prediction was made in an article in Science magazine that may have skewed metrics to gain a catchy headline.
The Science article analyzes credit card transaction data on 1.1 million individuals from an unnamed OECD country. One of the author’s key conclusions is that using only four transactions, 90 percent of the people are unique. Actually, they say that 90 percent of the people can be re-identified (ignoring the distinction between uniqueness and re-identification).
This conclusion has then been repeated uncritically by the science and general media communities.
I have written a critique of that article elsewhere. There are a number of fundamental assumptions, presentations and technical problems in that Science article. However, below I will focus on a single point and show why the calculations and statistics in that article are simply wrong under the most likely threat model, and therefore the primary conclusions from that article are not based on a valid or complete analysis.
A credit card transaction consists of a date and a shop. For example, Sally may have gone to Pharmacy-R-Us on December 26th and Butcher Joe on December 27th. This would be an example of a two transaction trace for Sally. If Sally’s transaction trace is unique, it means that she is the only person who has that particular trace (i.e., she is the only person who shopped at these two locations on these two days).
If an adversary wants to re-identify individuals she needs to have background information about the data subject being re-identified. The authors seemed to assume that the adversary would know when the transaction occurred and where, as well as the price.
There is reason to believe that the 1.1 million people are from a country such that they are a sample from a population of approximately 22 million adults who could have credit cards. There are certainly a number of OECD countries that fit that profile. This means that the 1.1 million individuals in the data that was analyzed represent only five percent of the population. A most basic principle in measuring the risk of re-identification is that risk must be measured on the population and not from the sample. If 90 percent of the sample is unique that does not necessarily mean that 90 percent of the population is unique on a trace of four credit card transactions. In fact, the number of unique individuals in the population could be much smaller and you could still have 90 percent unique individuals in the 1.1 million sample.
The best way to illustrate the implications of this is to do a simulation. I created a population of 22 million people with credit card transactions and a five percent sample of 1.1 million people. It is very unlikely that 90 percent of the people in my 1.1 million person sample are unique if my population is also 90 percent unique. In fact, my population needs less than one percent uniqueness to get 90 percent uniqueness in my sample. A much more likely conclusion from our data is that less than one percent of the population is unique on four credit card transactions. The key point here is that having 90 percent unique individuals in our sample data does not translate directly to 90 percent unique individuals in the population—and the discrepancy can be huge.
The authors of the study drew conclusions based on uniqueness in the sample, which inflates the re-identification risks, especially when the sample is as small as five percent. This is a basic disclosure control principle. The actual risk value needs to be computed from the population. The analysis in that article was incorrect and the estimates very likely exaggerated the re-identification risk.
If we, as a community, are going to have an informative debate about how to share financial and health data in a responsible way, we need to start off by using appropriate risk metrics and use cases, and we need to be precise about the methods used. It seems that the hunt for the catchy headline is overriding the requirement of doing sound and defensible analysis. The stakes are high. More sophistication and maturity in dealing with privacy issues is needed.
 In this simulated data set the individuals who were not unique in the population were doubles. This means that there were two people with exactly the same transaction trace. Therefore, the simulated populations had only uniques and doubles.
 The caveat is that we can only go by what was in the published article, and the article had no threat modeling. However, our statement holds under the most plausible re-identification attack use case.
If you want to comment on this post, you need to login.