In the early 2000s, internet accessibility made risks of exposing individuals from population demographic data more likely than ever. So, the U.S. Census Bureau turned to an emerging privacy approach: synthetic data.

Some argue the algorithmic techniques used to develop privacy-secure synthetic datasets go beyond traditional deidentification methods. Today, along with the Census Bureau, clinical researchers, autonomous vehicle system developers and banks use these fake datasets that mimic statistically valid data.

In many cases, synthetic data is built from existing data by filtering it through machine learning models. Real data representing real individuals flows in, and fake data mimicking individuals with corresponding characteristics flows out.

When data scientists at the Census Bureau began exploring synthetic data methods, adoption of the internet had made deidentified, open-source data on U.S. residents, their households and businesses more accessible than in the past.

Especially concerning, census-block-level information was now widely available. Because in rural areas, a census block could represent data associated with as few as one house, simply stripping names, addresses and phone numbers from that information might not be enough to prevent exposure of individuals.

“There was pretty widespread angst” among statisticians, said John Abowd, the bureau’s associate director for research and methodology and chief scientist. The hand-wringing led to a “gradual awakening” that prompted the agency to begin developing synthetic data methods, he said.

Synthetic data built from the real data preserves privacy while providing information that is still relevant for research purposes, Abowd said: “The basic idea is to try to get a model that accurately produces an image of the confidential data.”

The plan for the 2020 census is to produce a synthetic image of that original data. The bureau also produces On the Map, a web-based mapping and reporting application that provides synthetic data showing where workers are employed and where they live along with reports on age, earnings, industry distributions, race, ethnicity, educational attainment and sex.

Of course, the real census data is still locked away, too, Abowd said: “We have a copy and the national archives have a copy of the confidential microdata.”

Synthetic data for financial services and AI

Advances in computing power, which enabled the explosion in artificial intelligence development, means synthetic data can now be produced faster and more cost-efficiently than ever.

“This is really only possible now as opposed to a few years ago,” said Mijail Benitez, head of sales at Cvedia, which is among a growing group of companies specializing in synthetic data.

Rather than building synthetic data from real-world datasets, however, Cvedia’s team of designers and data scientists create data from scratch where it never existed before. It might be used to help train systems that operate autonomous maritime vessels or drones, for instance.

“How do you actually go about collecting that type of data?” asked Rebecca Banks, head of marketing for Cvedia. “The short answer to that is you don’t.”

But many synthetic data companies do produce what is sometimes called “digital twin” data, the stuff made from pre-existing information. Facteus is one. The company takes purchase transactions and other data from its payment processor, credit card and banking clients and turns it into synthetic data.

The synthetic data is used for business insight reports showing where people spent before buying lunch at Panera or after a Starbucks pitstop, for example. Financial services firms also use the service to build synthetic data for internal fraud analysis, ensuring employees cannot defy protocol and obtain personal information while performing security tests.

Traditional data privacy techniques strip names, addresses, ID numbers and other personally identifiable elements from raw data, or they might turn a specific age into an age range or truncate Social Security numbers.

Facteus and other synthetic data firms go further by adding what data scientists call “data noise.”

For instance, the Facteus system alters numerical details of transactions by changing them so they cannot be reverse-engineered and potentially linked back to an individual. A purchase of a sandwich for $7.50 at 2:17 p.m. might be changed to show a similar purchase costing $7.62 at 3:32 p.m.

New uses of clinical research data

Replica Analytics operates in another highly regulated industry: clinical research. The company’s clients use its machine learning technology to transform sensitive clinical trial data into synthetic data reflecting the real thing.

In some cases, pharmaceutical firms or medical device makers have clinical trial data they want to revisit for secondary analysis in the hopes of finding new uses for their drugs or products. That requires digging into the data to find cohorts of study participants who are relevant to that new analysis. But consent requirements for using the real, identifiable data for new purposes often deter researchers, said Khaled El Emam, CEO of Replica Analytics.

“The obligations in order to reuse that data are difficult to meet,” he said. “You have to re-consent patients for examinations, which is not feasible.”

Part of the firm’s privacy process involves simulating possible attacks that could re-identify its synthetic data. An attack might involve tying information from media reports or held internally by law enforcement or insurance companies to the synthetic data to expose identities. “Think of it as simulating all possible attacks,” El Emam said.

Nathan Reitinger, a security and privacy attorney and PhD at the University at Maryland’s Department of Computer Science, said these sorts of adversarial attacks are good practice. Ultimately, however, he said even though synthetic data approaches improve upon traditional privacy methods, they aren’t fail-safe when it comes to safeguarding confidentiality.

“If you’re wanting a theoretical guarantee, you’re not going to have it,” Reitinger said.

Photo by Enayet Raheem on Unsplash