The application of artificial intelligence and machine learning to solve today’s problems requires access to large amounts of data. One of the key obstacles faced by analysts is access to this data (for example, these issues were reflected in reports from the General Accountability Office and McKinsey Institute).

Synthetic data can help solve this data problem in a privacy preserving manner.

What is synthetic data?

Data synthesis is an emerging privacy-enhancing technology that can enable access to realistic data, which is information that may be synthetic but has the properties of an original dataset. It also simultaneously ensures that such information can be used and disclosed with reduced obligations under contemporary privacy statutes. Synthetic data retains the statistical properties of the original data. Therefore, there are an increasing number of use cases in which it would serve as a proxy for real data.

Synthetic data is created by taking an original (real) dataset and then building a model to characterize the distributions and relationships in that data — this is called the "synthesizer." The synthesizer is typically an artificial neural network or other machine learning technique that learns these (original) data characteristics. Once that model is created, it can be used to generate synthetic data. The data is generated from the model and does not have a 1:1 mapping to real data, meaning the likelihood of mapping the synthetic records to real individuals would be very small — it is not considered personal information.

Many different types of data can be synthesized, including images, video, audio, text and structured data. The main focus in this article is on the synthesis of structured data.

Even though data can be generated in this manner, that does not mean it cannot be personal information. If the synthesizer is overfit to real data, then the generated data will replicate the original real data. Therefore, the synthesizer has to be constructed in a manner to avoid such overfitting. A formal privacy assurance should also be performed on the synthesized data to validate that there is a weak mapping between synthetic records to individuals.

How can synthetic data be used ?

The use cases that synthetic data can assist with include AI, machine learning and other data science projects that require realistic data for model building and validation, software testing applications, technology evaluations and open data initiatives. For example, Public Health England has released synthetic cancer data that is used to design clinical studies and for exploratory research. The U.S. Census Bureau is planning to share tabulations from synthetic data for the 2020 census.

Because it is no longer personal data, the use and disclosure of synthetic data would not require an additional legal basis. Similar to other privacy-enhancing technologies that render information to be non-personal, it would not require an additional consent to produce. Under the EU General Data Protection Regulation, data synthesis would benefit from the same reasoning as the case for does not require additional authorization by a covered entity.

Why now ?

While the field of data synthesis has been around for a few decades, it was only recently that deep learning methods have enabled the generation of high utility synthetic data. This means there has been an improvement in the ability to generate analytics results from synthetic data that are quite similar to the results from the original data. Furthermore, the ability to handle increasingly complex datasets is also improving at a rapid pace.

There is also quite a bit of ongoing research to enhance data synthesis methods. Therefore, the expectation is that the utility of synthetic data will get better and the range of use cases in which data synthesis is a technically and economically attractive solution will also increase over time.

Responsible use of synthetic data

Synthetic data does not relieve the data consumers from all obligations toward data subjects. The synthetic data will generally retain biases in the original data. Also, synthetic data can be used in inappropriate ways and models built from synthetic data can be used to make discriminatory decisions. This is similar to any real or realistic dataset and similar to issues with deidentified data.

Therefore, a governance overlay is still needed by organizations that use synthetic data.

Photo by Paweł Czerwiński on Unsplash