Editor's Note: The company profiled in this article, Shroudbase, now operates as LeapYear Technologies and has substantially changed its business model, focusing on a different type of differential privacy. You should therefore understand that the information in this article is no longer valid in terms of the company and its operations.
We have left the article up in modified format, however, as it receives a fair amount of traffic from people looking to learn about differential privacy and we believe it still offers value on that front.
In some corners of the privacy world, de-identification has become something akin to the privacy community’s white whale: always just out of reach. Just this past summer, Princeton’s Arvind Narayanan and Edward Felton made waves with their paper, “No Silver Bullet: De-Identification Still Doesn’t Work.”
And while Privacy Analytics’ Khaled El Emam and Luk Arbuckle countered swiftly that “de-identification is a key solution for sharing data responsibly,” there remains an unease among those looking to use big data analytics for any number of purposes.
Thus, into the breach steps the start-up Shoudbase, a firm based in Philadelphia that is currently pitching “differential privacy” as a service, and soon hopes to offer it as a software package.
How does it work? While de-identification involves removing information from a data set, or replacing entries with numbers or hashing, differential privacy is “completely different,” said Shroudbase CEO Ishaan Nerurkar, who studied at the University of Pennsylvania in the Singh Program in Networked and Social Systems Engineering. “We keep almost all the information in the data, but we actually distort the database” via algorithms that create random distribution. “So, if I ask a question of the database … that answer is going to be slightly wrong, but slightly wrong in such a way that isn’t statistically meaningful. I can tell the difference between two individuals, but in a way that’s not going to matter.”
Obviously, this doesn’t work with a data set of 10 people or 20, but Nerukar said the technique works well starting at about 50,000 entries in a structured database, either SQL or Excel.
“The algorithms involved in [differential privacy] make it mathematically impossible to identify any individual in a privatized database and also ensure that aggregate analysis of a dataset is virtually unaffected by this protection,” he said. The company has consulted with lawyers and other HIPAA experts, to make sure the technique complies with current regulations, and the company is now ready to start working in the commercial sphere.
“The aim of differential privacy is future-proofing,” Nerukar said. “There are a lot of cases where datasets were exploited through legitimate querying, not hacking, and then looking at other data sets with different information to gain insights via a linkage attack.” He noted that Netflix’s data release resulted in this kind of breach; as did Massachusetts’ health info release, when Latanya Sweeney famously re-identified the governor.
“We want to make sure regardless of or what information they might get in the future, it’s impossible to identity the individual with high probability,” he said.
If you want to comment on this post, you need to login.