TOTAL: {[ getCartTotalCost() | currencyFilter ]} Update cart for total shopping_basket Checkout

Privacy Tech | Can synthetic data help organizations respond to 'Schrems II'? Related reading: Accelerating AI with synthetic data

rss_feed

""

Synthetic data has the potential to help address some of the most intractable privacy and security compliance challenges related to data analytics. Advances in machine learning and the availably of large and detailed datasets create the potential for new scientific breakthroughs and development of new insights that can have enormous societal benefits. But legitimate privacy concerns, combined with trends toward broader and stricter privacy and data security laws, have made achieving those benefits costly and risky. 

One area of privacy regulation that contributes to those costs and risks relates to the transfer of personal data across national borders. Most notably, the EU General Data Protection Regulation imposes restrictions on the export of personal data outside of the European Economic Area. 

Those restrictions were recently highlighted and amplified by a decision by the Court of Justice of the European Union invalidating the EU-U.S. Privacy Shield agreement. Privacy Shield was the basis under which thousands of U.S. and global organizations transferred personal data from the EEA to U.S. While leaving other lawful bases for transfers largely intact, the CJEU decision called into question the validity of some mechanisms, such as so-called standard contractual clauses, under some circumstances. In particular, the use of SCCs will now require a case-by-case assessment of their sufficiency to provide adequate protection, and additional safeguards may be necessary.

As companies, research institutions and other organizations struggle to understand and address the uncertainty created by the CJEU ruling, the creation and use of synthetic data can provide a practical and reliable path forward. 

What is synthetic data?

The creation of a synthetic dataset starts with a “real” dataset, which will typically include personal data of real people. A generative model is built from the real dataset and that is used to generate the synthetic dataset. The resulting synthetic dataset has the same statistical properties as the real data. But it is not real data. It is not data about or related to any real individual person or people. A single record in a synthetic dataset does not correspond to an individual or record in the real dataset. 

Further, to ensure that the resulting synthetic dataset does not inadvertently reveal information about a real person from the original dataset, a privacy assurance process evaluates the privacy risk of the synthetic data — comparing the real and synthetic data to assess and remove any such risk. Thus, for any given synthetic dataset, the conclusion that it does not identify or reveal information about any specific individual is testable and verifiable through statistical analysis.

These properties of synthetic data mean that it differs from what is traditionally thought of as deidentification of data. Deidentification is a means of altering a dataset to remove, mask or transform direct and indirect identifiers. But the deidentified data is still real data related to real individuals. It has just made it less likely that any individual in the record can be identified from the data. Depending on the method and strength of the deidentification, it can be an excellent risk-mitigation measure. But under some privacy laws, deidentified data may still be treated as personal data subject to the requirements of those laws.

How are organizations responding to the CJEU decision?

For organizations that relied on Privacy Shield as the basis for transferring personal data from Europe to the U.S., the CJEU court ruling has left them scrambling to find new ways to comply with European data transfer restrictions. And organizations that did not rely on Privacy Shield are also left with a need to reassess the sufficiency of the data transfer mechanisms they use. 

Some organizations are exploring the possibility of storing and processing more of their personal information in Europe or one of the handfuls of jurisdictions that the European Commission has deemed to provide adequate protection. 

However, while such measures can help to reduce the volume of data transfers and thereby reduce the overall risk under data transfer rules, they will rarely be a complete solution. Most organizations face some scenarios in which there is a need to transfer European data out of Europe or provide individuals located outside of Europe with access to data stored in Europe. Thus, organizations will need to adopt additional measures to ensure but these scenarios are carried out lawfully.

Many organizations are moving to a greater reliance on SCCs to fill gaps left by the invalidation of Privacy Shield. But in doing so, these organizations must take into account the parts of the CJEU ruling that held SCCs must be evaluated on a case-by-case basis to ensure they provide adequate protection for the rights of EU data subjects. If they do not, organizations must adopt additional safeguards. And given that the focus on the CJEU opinion was largely on broad government powers to access personal data for national security and law enforcement purposes, those safeguards should serve to address and mitigate those risks.

Therefore, these organizations that are relying on SCCs should be evaluating whether they need to adopt any such additional safeguards to supplement the protections afforded by the terms of the SCCs. Such safeguards might include stronger contractual assurances regarding how the organization will respond to requests for personal data from law enforcement or other government agencies, using end-to-end encryption so that any data intercepted will not pose a threat to the rights of data subjects, adopting strong deidentification techniques to make it less likely that governments will seek to obtain or use the data, or publishing “transparency reports” to provide more information about the quantity and type of government requests the organization receives. 

All these safeguards can have significant privacy benefits for data subjects. However, none is likely to provide a complete solution or be practicable in all situations. Thus, most organizations will have to adopt a range of data transfer solutions and additional safeguards to ensure that the personal data they export from Europe is transferred lawfully. In this context, organizations should explore adding synthetic data as one of the strategies they employ.

How does synthetic data help organizations respond to 'Schrems II?'

The nature of synthetic data makes it a particularly useful tool to address the legal uncertainties and risks created by the CJEU decision. Because an original set of “real” personal data is used in the creation and evaluation of a synthetic dataset, the process of creating and evaluating a synthetic dataset is within the scope of data protection law.

However, the resulting synthetic dataset does not relate to or reveal information about real individual persons. Thus, a properly created and verified synthetic dataset should not be considered personal data. As a result, such data can be freely distributed, including publicly released, and used globally for analysis and research.

If the generation of a synthetic dataset takes place in Europe or another “adequate” jurisdiction, then there is no need to transfer the real dataset to the U.S. or another non-adequate jurisdiction. Thus, there is no export of personal data that requires the data controller to find a substitute for the invalidated Privacy Shield, use SCCs or rely on another basis for making the data transfer lawful. 

If the generation of the synthetic dataset takes place outside of Europe and not in an “adequate” jurisdiction, then there will be an export of personal data that must be addressed. However, if that export involves SCCs, plus additional safeguards, it could be compatible with the CJEU ruling. Safeguards might include that the export is temporary and data will be retained outside Europe for only as long as it takes to generate and validate the synthetic dataset, that the use outside Europe is limited to the generation of synthetic data, and that such generation takes place in a secure environment. 

Even in the unlikely event that the resulting synthetic dataset could somehow be considered to contain personal data, the export of the synthetic data could still be fully compatible with the CJEU decision. It might be necessary to put in place SCCs to apply to the export, but there would be no question that SCCs would be sufficient in this case. 

The transformation of the data into a synthetic dataset would be an extremely strong example of an “additional safeguard” to supplement the protections of the SCCs. Even under the most expansive view of “personal data” imaginable, data exporters could take great comfort in exporting the synthetic data to the U.S. or anywhere else. 

Conclusion

Synthetic data may not be an appropriate tool in every data export scenario. In some cases, real personal data about real data subjects must be transferred outside of Europe. After all, a global company cannot process payroll for European employees based on synthetic data.

But for scenarios like research and data analytics, synthetic data offers a powerful tool that can protect individual privacy and address the legal uncertainties caused by the CJEU decision. It can allow organizations to limit exposure to real datasets that contain personal data, minimizing or eliminating the need to transfer personal data to “non-adequate” jurisdictions, thereby dramatically reducing their legal risks. At the same time, it allows organizations to more freely share and transfer synthetic datasets that can unlock insights, expand knowledge and generate value without creating those same risks. 

As organizations investigate new compliance approaches and new data transfer mechanisms in response to the CJEU decision, they should consider whether synthetic data can play a part in their data and compliance strategies.

Photo by Lukas Blazek on Unsplash


Approved
CIPM, CIPP/A, CIPP/C, CIPP/E, CIPP/G, CIPP/US, CIPT
Credits: 1

Submit for CPEs

4 Comments

If you want to comment on this post, you need to login.

  • comment Srinivas Avasarala • Sep 15, 2020
    Hi Khaled & Mike, Interesting thoughts. Can you elaborate on the generative model you mention? Curious how the synthetic data can have the same statistical properties, yet not have a 1:1 correspondence between records in the synthetic and real data sets.
  • comment Khaled El Emam • Sep 18, 2020
    There are different types of generative models, but essentially they are machine learning or deep learning models that capture the characteristics of the original data. We have a tutorial on-line which describes some of the concepts: https://replica-analytics.com/synthesis-tutorials
    
    And there is also an introductory book on the topic that may provide informative background on the topic: https://learning.oreilly.com/library/view/practical-synthetic-data/9781492072737/
  • comment Khaled El Emam • Sep 18, 2020
    The basic idea is that one builds a machine learning or deep learning model which characterizes the dataset, and then uses it to generate new data. It is the same idea as generating deep fakes (fake images of people, for example). There is more background information in our tutorials here: https://replica-analytics.com/synthesis-tutorials
    
    There is also a book on the topic that may provide some additional details on the topic:  https://learning.oreilly.com/library/view/practical-synthetic-data/9781492072737/ 
  • comment Zameer Razack • Sep 26, 2020
    Hi Srinivas, I stumbled across blog that explains how you can evaluate the accuracy of these synthetic data generators when it comes to their statistical properties: https://mostly.ai/2020/09/25/the-worlds-most-accurate-synthetic-data-platform/
    
    Let's say that you have a dataset. If you split it into a training and holdout dataset, and then train the synthetic data generator with the training dataset, ideally, the statistical properies of the resulting synthetic data would have a similar variance to the original as your holdout dataset.
    
    So, according to this blog, it's not 100% 1:1, but a perfect synthetic data generator can keep the variance well within this error margin.
    
    Pretty neat find, eh? :-)