Despite executive orders and guidance encouraging the use of revolutionary privacy-enhancing technologies to reduce the risk related to the rise of big data and artificial intelligence, institutional barriers impede widespread deployment throughout the U.S. government.
Released over the past year, the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence and the U.S. National Institute of Standards and Technology's Guidelines for Evaluating Differential Privacy Guarantees both explore the adoption of PETs to mitigate increasing privacy risks.
For years, organizations have relied on removing or scrambling personal identifiers to anonymize data and reduce the risks related to public release or leakage. The increasing amount of data collected and made available, paired with advancements in processing, has made it possible to reidentify this ”anonymized” data. The ability to reidentify personal identifiers increases the risk of an individual's sensitive information being linked to them or revealed.
A privacy-preserving algorithmic framework, differential privacy is one of the PETs explicitly referenced by these federal directives. Already studied, documented and deployed in the commercial industry, the framework applies to a wide variety of common government use cases but has not yet been broadly adopted in that space.
Differential privacy quantifies the privacy risk to individuals by comparing the output of an algorithm with the individual's data to the output without their data. It employs quantitative parameters to set the amount an output can change with each additional data point. By modifying these parameters, practitioners can set the allowable privacy risk of an algorithm. Unlike traditional privacy protecting practices, such as masking, encryption or data swapping, differential privacy's guarantee is not conditional and cannot be attacked.
It is used to make statistics, synthetic datasets and machine-learning models private, ensuring the outputs cannot be reverse engineered to identify individuals in source datasets. Eliminating the risk of reverse engineering enables wider release of data, which can lead to better informed political and economic decisions across the federal government.
Differential privacy's quantitative parameters increase product customization and efficiency by maintaining different versions of the same algorithm, enabling varying levels of privacy based on a perceived level of risk. For example, if a health agency built a ML model trained on sensitive patient records, the agency could use differential privacy to release a "public" version of the model to third-party physicians, while a less private and more accurate model would be retained by agency researchers working in a secure environment. This approach would allow the organization to release assets more widely, increasing usage and overall return on investment.
Another potential use involves personnel at a federal defense organization working on modeling pipelines trained on classified data. These personnel must possess a top-secret security clearance due to the ability to uncover raw top-secret data through derivatives. While requiring these clearances is crucial to security, fulfilling the rapid increase in staff required to address mission needs is challenging due to the length of time and rigor needed for the clearance process.
Using differential privacy when creating the data derivatives eliminates the risk of exposing the top-secret data via reverse engineering. Reducing the classification level of the data enables lower cleared staff to work on the asset, resulting in quicker and more cost-effective staffing and increasing the amount of AI work the organization can undertake. These use cases are common across the federal government and demonstrate differential privacy as a top solution for enhancing privacy and increasing the sharing of data and ML models.
Despite the federal government encouraging wider adoption of differential privacy for a variety of use cases, actual usage in the space remains limited. The lack of broad adoption is attributed to large-scale deployment challenges, unclear guidance and advanced expertise needed for usage.
Deployment challenges
Differential privacy is difficult to use for tasks with multiple goals and intermediate steps. For example, protecting sensitive data in the U.S. Census is challenging because the census releases require numerous codependent statistics, such as race, gender, income and age.
To maintain sufficient precision for data concerning smaller communities, the Census Bureau must greatly increase the privacy budget for the dataset. This raises the maximum tolerance of the information revealed about individuals, potentially reducing the dataset's privacy. Differential privacy struggles when many statistics are released, reducing its utility in these scenarios.
Unclear guidance
Federal government security and privacy stakeholders, including information system security officers, privacy analysts and system owners, remain skeptical about the ability of differential privacy to make data products — such as dashboards, ML models and synthetic data — that are impossible to reverse engineer and subsequently reveal unique identifiers.
This skepticism relates to guidance that, while encouraging the investigation of differential privacy, does not explicitly approve its use in reducing underlying data sensitivity levels, or classifications. Clear guidance providing paths for approval of specific differential privacy usage in multiple privacy-critical applications would increase adoption and minimize skepticism.
Expertise needed
Differential privacy is a significant step forward from traditional deterministic privacy-preservation techniques, such as anonymization, masking and K-anonymity, that operate as a dataset pre-processing step prior to the use of algorithms for statistics or ML. These traditional methods are increasingly vulnerable to attacks in an era of powerful processing and computing proliferation.
While offering protection against all privacy attacks, differential privacy deployments require difficult-to-find personnel with significant expertise in probability and statistics. Differential privacy's statistical nature often requires modifying an entire algorithm to produce useful results. Current implementations of differential privacy use functional programming application programming interfaces, which can be less accessible to data scientists than more common object-oriented frameworks. Partnering with leaders in differential privacy and educational efforts focusing on upskilling federal government data scientists and privacy stakeholders will close this knowledge gap.
While the outlined challenges limit the adoption of differential privacy in the federal government, incremental steps encouraged and piloted by the executive branch can alleviate these issues and facilitate widespread adoption. Recommended steps include:
- Agencies should emphasize and fund proof of concept efforts for common use cases to navigate challenges related to large-scale deployments and create reusable roadmaps.
- Government leaders in the space, such as NIST, should draft and release clearer guidance outlining approved use cases, restrictions and benefits.
- Agencies with identified common use cases should partner with subject matter experts who have conducted deployments in a variety of mission spaces and possess the personnel necessary for training the government's data scientists and engineers.
Wider adoption of differential privacy across the federal government, in both the civilian and defense sectors, reduces risk and increases returns on government investment for data systems and products. Accordingly, the federal government must continue to address deployment challenges by investing in unlocking the significant benefits of differential privacy deployments.
Adam McCormick, CIPP/US, is chief technologist and Amol Khanna is a lead machine learning scientist at Booz Allen Hamilton.