One of the joys of operational privacy professionals is getting that random, Friday afternoon Slack from someone on the product team asking, "Can we [insert questionable action] with our customer data?"
Our responses are thoughtful yet formulaic, often based on what's been outlined in the company's privacy policy. But wouldn't it be nice if we could simply say, "Of course! Go right ahead!" and insert a cute on-trend gif and win the rare praise from teams who often think of compliance as a roadblock?
Well, synthetic data may get us a few steps closer to becoming BFFs with those profit-center stakeholders.
What is synthetic data?
You've likely heard about this privacy-enhancing technology in the recent artificial intelligence-centered public and scientific discourse, but you may not be aware that most privacy pros are older than its origin, as the technology emerged around the '90s.
And, unsurprisingly, synthetic data as a PET has only gained traction in the last few years, given advances in generative AI.
I know what you're thinking: "What about deidentification and anonymization?"
Indeed, while these privacy-preserving options have been formalized in the California Consumer Privacy Act and EU General Data Protection Regulation, respectively, there are scarce product managers or data scientists with any desire to adopt these methods, given the decreased utility and statistical relevance of the resulting data.
To better understand how companies can practically operationalize synthetic data tooling, I reached out to Gonçalo Martins Ribeiro, a Portugal native/San Francisco transplant and founder of YData, a start-up pioneering commercial adoption of synthetic data generation, for some elucidation (and to practice my very limited Portuguese). Vamos a isso!
Synthetic data versus real data
As the term suggests, synthetic data isn't, well, real. It's data generated through computer programs instead of data collected from people and events happening in the real world.
That said, from a business perspective, synthetic data is considered the most useful when it's derived from "real" data, as it maintains the same statistical properties as the source data.
And in order to maintain this "fidelity," synthetic data isn't created by any run-of-the-mill computer program.
"The computer programs which generate synthetic data model the original statistical distributions and structure of real datasets. So, even though the output data might be 'fake,' it still maintains its utility for extracting insights and training models," Ribeiro said.
I find this point critical — we know that our stakeholders rely on "real" data to support feature development, analytics, testing, forecasting and other vital business-supporting activities, so ensuring the continued usefulness of data is vital.
Ribeiro concurred. "Recent advances in AI have enabled the creation of even more realistic synthetic datasets, with deep learning models leading the innovation in this field. These models can be automatically trained on available data, and then be used to generate new, unlimited synthetic data samples."
Diving a little deeper, I asked Ribeiro for a more straightforward example of synthetic data in the context of privacy compliance.
"Imagine a dataset composed of data you'd typically collect about real social media users, for example the number of times they've clicked on a profile, who they've searched for, what types of ads interest them. These are essentially profiles of real people. Now imagine the ability to build profiles of users who don't actually exist, but that still consist of statistically relevant data with full utility. This is the promise of synthetic data."
Practical use cases and benefits
Given the preserved statistical relevance and utility of synthetic data, applications for its use are endless.
Product and engineering teams would no longer need to rely on production data to test, fix and deploy features, for example. Data scientists could use the same datasets to identify patterns and trends for product improvement and new data products, while marketing teams could better predict which ads result in more engagement among real users, using data generated from them.
In a full-circle fashion, AI-supported techniques to generate synthetic data allow companies to produce immense amounts of data, in turn boosting AI product development and supporting machine learning.
"Above and beyond business function-specific use cases, synthetic data can be used to augment or replace real-world data where data access or the collection of additional data is impossible, impractical, expensive or ethically problematic, thus saving time and money for organizations," Ribeiro said.
While this article isn't meant to focus on the ethics or sensitivities of using synthetic data to support use cases considered "ethically problematic," the potential for synthetic data to improve products and services geared toward children, identify racial bias problems, and replace or augment other more intricate and nuanced data processing certainly exists.
And benefits from a privacy compliance perspective almost seem too good to be true.
"As this data isn't proprietary or considered 'personally identifiable,' it can be shared, stored or even sold without any major blockers or privacy concerns due to legal frameworks," Ribeiro said. "What legal privacy considerations exist for data that's essentially not 'personal' anymore?"
I told him to hold my beer. But let's continue.
Operationalizing synthetic data
So, what's the first step to implementing synthetic data in your organization? It depends on the vendor you're considering.
"Most of the time it's as easy as uploading a CSV file with real customer data, or connecting the vendor's tooling to your own databases," Ribeiro said. "At YData Fabric specifically, we install our product in our customers' infrastructure, so their data never leaves their premises. They can maintain all their security measures."
As with any other vendor, Ribeiro and I both agree a vigorous proof of concept, or "POC," should be conducted, including representation from stakeholders across key functions that support the implementation of the tool itself, like engineering, legal and cybersecurity, and stakeholders who consume synthetic data outputs and reap the benefits, like product, data science and marketing.
"When starting a POC, make sure to have the right people involved that will understand the project outcomes and the success criteria," Ribeiro said. "It's still an emerging technology and a very technical subject, so it's critical that in-house technical personnel be available."
A silver bullet?
While the implementation of a tool that can generate data mimicking the qualities of real data seems like a no-brainer, there are, of course, risks.
A report commissioned by the Royal Society in May 2022 identified several risks, including that machine learning models used to generate synthetic data demonstrated the capability to memorize their training inputs, thereby exposing the training data and undermining the privacy benefit.
"The concerns about machine learning models becoming liabilities themselves aren't unique to synthetic data generators," Ribeiro said. "Data science techniques, like ensuring there is maximum variability in the training data and conducting inference attacks to test data leakage, can be employed to minimize the potential exposure of sensitive information."
He is also keenly aware of the technology's limitations.
"Synthetic data is not a magical tool that automatically converts sensitive data into private-by-design data," he added. "It works just like any other AI-system: garbage in, garbage out. And, the bigger and more complex the datasets are, the more complex the resulting distributions will be, ultimately limiting the fidelity of the synthetic data."
Regulatory perspectives
In light of these limitations, it's no surprise data protection authorities have more tempered positions when it comes to synthetic data.
In a November 2023 blog post, Spain's DPA, the Agencia Española de Protección de Datos made it clear that, while it considers the PET a "powerful tool," it would be necessary to "consider the regulatory provisions of the GDPR," given the "creation of synthetic data from real personal data would itself be a processing activity" under the regulation.
Further, in a June 2022 research paper, the U.K. Information Commissioner's Office unsurprisingly stressed that organizations considering use of synthetic data should continue to "align with the … principles of data minimization and purpose limitation," and "only include the properties needed to meet their specific use case, and nothing else."
So, when it comes to legal privacy considerations, it's abundantly clear to privacy pros that the initial generation, continued use and sharing of synthetic data will require the normal privacy due diligence. Think conducting privacy and data protection impact assessments, and perhaps even obtaining consent, unless you deem the processing to be compatible with the data's initial processing purposes.
Smiling in agreement, Ribeiro handed me back my beer.
Conclusion
While use of synthetic data can certainly decrease organizational privacy risk, and while its utility can indeed help you make friends in product, data science and marketing (it does get ever so lonely in compliance), thorough POCs to test use cases and assessments of legal privacy obligations are required.
And, while there aren't yet any standards or official guidelines around synthetic data, Ribeiro said they're coming.
"We're working with the Institute of Electrical and Electronics Engineers to create the very first ones."
In addition, it's worth considering layering synthetic data with other PETs in your privacy program's arsenal, like initially sanitizing the underlying data (anonymization) or adding calibrated noise to the original dataset (differential privacy) that will be used to generate the resulting synthetic data.
Overall, ensuring your use cases for considering synthetic data are well thought out and cross-functionally supported is critical.
"You have to make sure you have the right data and the right expectations for the technology," Ribeiro concluded. "Synthetic data is very powerful when used properly, but can be a waste of money if poorly implemented and understood."