IAPP IAPP News

Privacy and responsible AI

2022-01-11 12:26:27

Artificial intelligence and machine learning are advancing at an unprecedented speed. This raises the question: How can AI/ML systems be used in a responsible and ethical way that deserves the trust of users and society?

Regulators, organizations, researchers and practitioners of various disciplines are all working toward answers. Privacy professionals, too, are increasingly getting involved in AI governance. They are challenged with the need to understand the complex interplay between privacy regulations and broader developments regarding the responsible use of AI.

With government authorities increasing their enforcement, rulemaking and legislating in this complex arena, it is critical that organizations understand the privacy requirements that currently apply to AI, those on the horizon and the resources available to build a compliant data protection program for AI and ML applications.

Broader AI governance developments

Over the last few years, numerous good governance guidelines on trustworthy AI were published. Most of these AI governance frameworks overlap in their definition of basic principles, which include privacy and data governance, accountability and auditability, robustness and security, transparency and explainability, fairness and nondiscrimination, human oversight, and promotion of human values.

Some prominent examples of responsible AI frameworks by public organizations include the UNESCO’s Recommendation on the Ethics of AI, China’s ethical guidelines for the use of AI, the Council of Europe’s Report “Towards Regulation of AI Systems”, the OECD AI Principles, and the Ethics Guidelines for trustworthy AI by the High-Level Expert Group on AI set up by the European Commission.

Beyond that, one can find countless self-regulatory initiatives by companies. Additionally, industry has joined forces with academia and nonprofits to advance the responsible use of AI; for example, within the Partnership on AI or the Global Partnership for AI. Standardization bodies such as ISO/IEC, IEEE and NIST also offer guidance.

Present governance initiatives primarily take the form of declarations and are nonbinding. At the same time, various existing privacy laws already regulate the responsible use of AI systems to a considerable extent.

The prominent role of privacy regulators for AI governance is also demonstrated by the release of the Model AI Governance Framework through Singapore’s Personal Data Protection Commission, the U.K. Information Commissioner’s Office extensive work on developing an AI Auditing Framework, and the release of a Guidance on the Ethical Development and Use of AI by the Office of The Privacy Commissioner for Personal Data of Hong Kong.

Privacy regulations and responsible AI

One of the principles of responsible AI regularly mentioned refers explicitly to “privacy.” This is reminiscent of the obligation to apply general privacy principles, which are the backbone of privacy and data protection globally, to AI/ML systems which process personal data. This includes ensuring collection limitation, data quality, purpose specification, use limitation, accountability and individual participation.

Principles of trustworthy AI like transparency and explainability, fairness and nondiscrimination, human oversight, robustness and security of data processing can regularly be related to specific individual rights and provisions of corresponding privacy laws.

With regard to the EU GDPR, this is the case for the right to explanation (Articles 1(1), 12, 13, 14, 15 (1)(h), 22(3), Recital 71), the fairness principle (Article 5(1)(a), Recital 75), human oversight (Article 22), robustness (Article 5(1)d) and security of processing (Article 5(1)(f), 25, 32). Other privacy laws like China’s PIPL or the U.K. GDPR include similar provision which relate to these responsible AI principles.

In the U.S., the Federal Trade Commission holds AI developers and companies using algorithms accountable under Section 5 of the FTC Act, the US Fair Credit Reporting Act as well as the Equal Credit Opportunity Act. In its 2016 report and guidelines from 2020 and 2021, the FTC leaves no doubt the use of AI must be transparent, include explanations of algorithmic decision-making to consumers, and ensure that decisions are fair and empirically sound.

Not being aware of compliance requirements for AI systems that stem from privacy regulations poses risks not only for affected individuals. Companies can face hefty fines and even the forced deletion of data, models and algorithms.

Recent cases

At the end of last year, the Office of the Australian Information Commissioner found Clearview AI in violation of the Australian Privacy Act for the collection of images and biometric data without consent. Shortly after, and based on a joint investigation with Australia’s OAIC, the U.K. ICO announced its intent to impose a potential fine of over 17 million GBP for the same reason. Further, three Canadian privacy authorities as well as France's CNIL ordered Clearview AI to stop processing and delete the collected data.

European data protection authorities pursued several other cases of privacy violations by AI/ML systems in 2021.

In December 2021, the Dutch Data Protection Authority announced a fine of 2.75 million euros against the Dutch Tax and Customs Administration based on a GDPR-violation for processing the nationality of applicants by a ML algorithm in a discriminatory manner. The algorithm had identified double citizenship systematically as high-risk, leading to marking claims by those individuals more likely as fraudulent.

In another landmark case from August 2021, Italy’s DPA, the Garante, fined food delivery companies Foodinho and Deliveroo around $3 million each for infringement of the GDPR due to a lack of transparency, fairness and accurate information regarding the algorithms used to manage its riders. The regulator also found the companies’ data minimization, security, and privacy by design and default protections lacking and data protection impact assessment missing.

In similar cases in early 2021, Amsterdam’s District Court found ride-sharing companies Uber and Ola Cabs didn't meet the transparency requirements under the GDPR and violated the right to demand human intervention. The investigation by the Dutch DPA is ongoing.

In the U.S., recent FTC orders made clear the stakes are high for not upholding privacy requirements in the development of models or algorithms.

In the matter of Everalbum, the FTC not only focused on the obligation to disclose the collection of biometric information to the user and obtain consent, it also demanded that the illegally attained data, as well as models and algorithms that had been developed using it, be deleted or destroyed.

With this, the FTC followed its approach in its Cambridge Analytica order from 2019, where it had also required the deletion or destruction not only of the data in question but all work products, including any algorithms or equations that originated in whole or in part from the data.

Challenges in definition and practice

Despite being liable for not implementing responsible AI principles required by regulations, there are many open questions. While there is a lot of legal guidance around consent and appropriately informing users, the legal interpretation and practical implementation of requirements such as AI fairness and explainability is still in its infancy. Common ground is that there is no one-size-fits-all approach for assessing trustworthy AI principles in various use cases.

AI explainability or transparency aims at opening the so-called “blackbox” of ML models. A whole field of AI research around explainable AI has emerged. There are many answers to what it means to explain a ML model. To explain individual predictions to regulators or users, outcome-based post-hoc local models are common. Here, a surrogate model (or metamodel) can be trained on a dataset consisting of samples and outputs of the black box model to approximate its predictions. Any explanation should be adapted to the understanding of the receiver and include references to design choices of the system, as well as the rationale for deploying it.

AI fairness is another growing field covering a very complex issue. Bias, discrimination and fairness are highly context-specific. Numerous definitions of fairness exist and vary widely between and within the various disciplines of mathematics, computer science, law, philosophy and economics. Some privacy regulators have issued clear guidelines. According to the ICO, fairness means personal data needs to be handled in ways people would reasonably expect and not use it in ways that have unjustified adverse effects on them. Similarly, the FTC explains that under the FTC Act, a practice will be considered unfair if it causes more harm than good. On the other hand, definitions of the fairness principle in the context of the GDPR are still scarce. At the same time, many organizations are unsure how to avoid bias in practice. In general, bias can be addressed pre-processing (prior to training the algorithm), in-processing (during model training), and post-processing (bias correction in predictions).

AI explainability and fairness are only two of many rapidly evolving principles in the field of responsible AI. Other areas, such as securing AI/ML algorithms also require increasing awareness and safeguards, as the EU’s Agency for ENISA emphasized in a recent report. Another challenge is the tradeoff between different principles. Tensions may arise between some trustworthiness properties, like transparency and privacy, or privacy and fairness.

Practical assessment and documentation

Legal definitions are not the only component of responsible AI principles that require greater clarity. The term “Responsible AI Gap” was coined for the challenges companies are facing when trying to translate trustworthy AI principles into tangible actions.

For privacy professionals, it can be a good start to approach the topic from a data governance and risk management lens to ensure accountability. Co-designing an appropriate AI governance process with other teams such as computer engineering, data science, security, product development, user experience design, compliance, marketing and the emerging new role of AI ethicists are the baseline for ensuring that AI/ML systems which process personal data take privacy requirements into account throughout the entire ML pipeline.

Internal policies should ensure humans are in the loop to address data quality, data annotation, testing of the training data’s accuracy, (re)validation of algorithms, benchmarked evaluation and external auditing. Using external transparency standards, such as recently published by IEEE or the U.K., can also be considered.

Data protection impact assessments or privacy impact assessments could be augmented with additional questions relating to responsible AI. In this way, risks to rights and freedoms that using AI can pose to individuals can be identified and controlled. Here, any detriment to individuals that could follow from bias or inaccuracy in the algorithms and data sets should be assessed and the proportionality of the use of AI/ML algorithms documented. The PIA can describe trade-offs, for example between statistical accuracy and data minimization, and document the methodology and rationale for any decision made.

Additionally, organizations can consider privacy-preserving machine learning solutions or the use of synthetic data. While they do not replace policies for responsible AI and privacy, a thorough model risk management, and the use of methods and tools for model interpretability or bias detection, they strengthen a privacy-first approach when designing AI architectures.

The Norwegian DPA highlighted these approaches in a report dedicated to the use of personal data in ML algorithms: “Two new requirements that are especially relevant for organizations using AI, are the requirements privacy by design and DPIA.”

In this context, key questions for responsible AI principles can also be taken into account. Starting-points could be the list proposed by the EU’s AI-HLEG or the ones compiled by Partnership on AI. Interdisciplinary discussions and the deployment of toolkits for responsible AI, AI fairness, and AI explainability like LIME, SHAP or LORE can further contribute to mutual understanding and transparency toward users.

Further nontechnical approaches can include the formation of an ethical AI committee, internal trainings, diversification of team composition, or an analysis of the data collection mechanism to avoid bias. Currently, the public sector is spearheading efforts to inventory all algorithms in use for transparency reasons. Other organizations have begun to release AI Explainability Statements. Regardless of approach, organizations must provide consumers necessary information in case of adverse actions resulting from AI/ML systems and the use and consequences of scoring.

More developments on the horizon

Principles for ensuring trustworthy AI and ML will be reflected in a large variety of laws in the upcoming years. On a global level, the OECD counted 700 AI policy initiatives in 60 countries.

With the new EU Artificial Intelligence Act, high-risk AI systems will be explicitly regulated. In the U.S., President Biden’s administration announced the development of a “AI bill of rights.” In addition to the additional $500 million in funding for the FTC on the horizon, the FTC has filed for rulemaking authority on privacy and artificial intelligence. Additionally, the new California Privacy Protection Agency will likely be charged with issuing regulations governing AI by 2023, which can be expected to have far-reaching impact.

With increasing enforcement and new regulations underway, ensuring privacy compliance of AI systems will become a minimum requirement for the responsible use of AI. With more to come our way, it is important that AI applications meet privacy requirements now. Alignment of efforts and a good understanding of the AI/ML ecosystem will help tremendously with preparing for those new developments.

Photo by Adi Goldstein on Unsplash

Standardization landscape for privacy: Part 1 — The NIST Privacy Framework

2021-12-01 11:09:17

Standards and frameworks provide real benefits for privacy management. Standards are established norms to be applied consistently across organizations, while frameworks are a set of basic guidelines to be adapted to an organization's needs. Both can help to fulfill compliance obligations, build trust, benchmark against industry best practices, support strategic planning and evaluation, enable global interoperability, and strengthen an organization's market position.

Just as in information security, the International Organization for Standardization in cooperation with the International Electrotechnical Commission, and the U.S. National Institute for Standards and Technology, are the main players for offering general guidance for privacy risk management. ISO and IEC are non-governmental international organizations with all member states of the United Nations having a vote in their standardization processes. NIST is a non-regulatory government agency within the U.S. Department of Commerce. In furtherance of its mission to promote American innovation and industrial competitiveness, NIST provides a wide variety of standards and technology resources, tools, and guidelines for use by U.S. federal agencies as well as by private industry, both domestically and abroad.

On a European level, three distinct private international nonprofit organizations are officially recognized by the EU as being responsible for developing and defining voluntary standards. They also collaborate with ENISA, the EU Agency for Cybersecurity. The European Telecommunications Standards Institute covers a variety of privacy-related sector specific standards. The European Committee for Standardization and the European Committee for Electrotechnical Standardization are currently working on privacy information management systems for a European context.

In Asia, the APEC Privacy Framework provides privacy principles and implementation guidelines, forming the basis for a regional system called the APEC Cross-Border Privacy Rules. A more recent development was the approval of the ASEAN Data Management Framework in January 2021, based on the 2016 ASEAN Framework on Personal Data Protection. Those frameworks used the OECD Privacy Framework – the first international consensus on privacy protection in the context of free flow of personal data – as their key reference.

Another prominent global organization in the field is the Standards Association of the Institute of Electrical and Electronics Engineers which has developed a large number of industry standards for privacy and security architectures. Additionally, the Privacy Community Group of the World Wide Web Consortium is chartered to incubate privacy-focused web features and APIs to improve user privacy on the web. Other groups involved in the developments in standards and frameworks include the Internet Engineering Task Force and OASIS Open.

Apart from that, there are national privacy standards, among them the newly developed standards for data privacy assurance by the Bureau of Indian Standards or the German standard data protection model. Also, national standards organizations like the UK's national standards body BSI or Standards Australia partner closely with ISO or CEN in the field of privacy standards.

Despite the abundance of external standards and frameworks, many companies chose to develop their own. Even in those cases, an organization can benefit greatly from becoming familiar with the concepts and thought processes offered by the mentioned bodies and initiatives. Those insights can be used to assess and improve an organization's own privacy program. Improvements could include incorporating additional privacy management principles or closing gaps in internal objectives and controls.

Beginning with this article, we provide a general overview of the existing standards and frameworks in the realm of privacy. This article is the first one in a series of three and will focus on NIST's ground-breaking Privacy Framework, released in January 2020.

A first overview of the NIST Privacy Framework

"The NIST Privacy Framework: A Tool for Improving Privacy through Enterprise Risk Management" is a voluntary framework that helps organizations answer fundamental questions: How are we considering the privacy impacts to individuals as we develop our systems, products, and services? How can we manage privacy risks in a consistent way across business units and markets? How do we ensure a quality privacy program that adapts to business needs and new regulatory requirements?

The intention of the NIST Privacy Framework is to support better privacy practices in enterprises of all sizes, all sectors and all jurisdictions. Organizations can rely on the Framework to create a new privacy program from scratch or to improve an existing privacy program.

The Framework approach to privacy risk is to consider privacy events as potential problems individuals could experience arising from data processing throughout the complete data lifecycle, from collection through disposal. Potential problems range from violating a person's dignity to discrimination, economic loss or physical harm. Privacy risks can arise by means unrelated to cybersecurity risks, which are characterized by a loss of confidentiality, integrity or availability of personal information.

The Framework defines privacy risk as the likelihood that individuals (singly or in groups) will experience privacy problems resulting from data processing and the impact should those problems occur. While individuals experience the direct impact of privacy events, organizations can experience impacts in a big way as well, such as noncompliance costs, loss of clients and customers, a decline in sales, and negative brand image.

Against this backdrop, the NIST Privacy Framework supports ethical decision-making around privacy risk management in the context of enterprise risk management. It enables finding the right balance between building innovative systems, products, and services while protecting individuals' privacy.

NIST acknowledges that privacy risk management is a cross-disciplinary function that requires support and engagement from stakeholders across an organization. Therefore, one of the main purposes of the Framework is to provide a common language for legal, technical, design and product teams to drive internal collaboration. This goal can be achieved if the Framework is used in a lightweight manner or in the context of a more advanced privacy-risk management. In any case, using the NIST Privacy Framework as a reference and guideline for cross-organizational dialogue can strengthen accountability for privacy risk management throughout an organization.

The NIST Privacy Framework was modeled after the widely adopted NIST Cybersecurity Framework. However, the adoption of the Privacy Framework is independent from the implementation of the Cybersecurity Framework. Both Frameworks are designed for guidance only and are not auditable.

The three components of the NIST Privacy Framework

The NIST Privacy Framework consists of three parts, following the structure of the NIST Cybersecurity Framework.

The Core

The first component of the Privacy Framework is called the "Core." The Core consists of a table of Functions, Categories and Subcategories that describe specific privacy activities and outcomes to better manage privacy risks across the whole organization.

The Core first raises a high-level awareness for the different areas of privacy risk management that can be addressed. Those are referred to as "Functions."

Two of the Functions in the Privacy Framework, Identify-P and Protect-P, have identical names as in the Cybersecurity Framework. As a distinction, the Privacy Framework's Functions carry a "-P" at the end.

Assigned to each of the Functions are several key categories. They define general privacy outcomes. In total, the Framework lists 18 categories or privacy outcomes across all five Functions.

The Functions and related categories or privacy outcomes are the following:

Identify-P provides the basis for privacy risks management in an organization. It refers to Inventory and Mapping, Business Environment, Risk Assessment, and Data Processing Ecosystem Risk Management.
Govern-P includes the following outcomes: Governance Policies, Processes, and Procedures, Risk Management Strategy, Awareness and Training, Monitoring and Reviewing.
Control-P refers to Data Processing Policies, Processes, and Procedures, Data Processing, Management, and Disassociated Processing.
Communicate-P points at Communication Policies, Processes, and Procedures, as well as Data Processing Awareness.
Protect-P regards data processing safeguards and is where privacy and cybersecurity risk management overlap, including Data Protection Policies, Processes, and Procedures, Identity Management, Authentication, and Access Control, Data Security, Maintenance, and Protective Technology.

The Categories are then further broken down into 100 Subcategories. The Subcategories are the building blocks between policies and capabilities so that legal, compliance and engineering domains can collaborate on implementation with more specificity to achieve the organization's desired privacy outcomes.

For example, the Function "Identify-P" refers to developing the organizational understanding to manage privacy risk for individuals arising from data processing.

One of the categories of this Function is "Inventory and Mapping," described as: "Data processing by systems, products, or services is understood and informs the management of privacy risk."

To achieve this outcome of "Inventory and Mapping," the following subcategories could be prioritized: "Systems/products/services that process data are inventoried" or "The purposes for the data actions are inventoried."

While they represent best practices for privacy risk management, the Core is not meant to be a checklist. Rather, the individual elements of the Core can be selected as priority outcomes and activities according to an organization's needs. Ideally, the selection in made through a constructive dialogue from the executive level to the implementation/operations level.

Finally, it is good to know that the subcategories are not controls. According to the NIST Privacy Framework, controls get selected in the context of privacy capabilities and requirements that describe the "what" and "how" of systems, products, or services helping achieve the desired privacy outcomes in the Framework Core.

As support for determining privacy capabilities, NIST created the privacy engineering objectives of predictability, manageability, or disassociability, introduced in NISTIR 8062, An Introduction to Privacy Engineering and Risk Management, in Federal Systems from January 2017. The selection of appropriate controls is addressed in the Crosswalk of NIST Privacy Framework and Cybersecurity Framework to NIST Special Publication 800-53, Revision 5.

Profiles

The second component of the NIST Framework is called "Profiles." Profiles are the next step in helping organizations have a privacy risk management conversation. An organization can use the Core like a menu and select which Functions, Categories, and Subcategories to prioritize to help it manage privacy risk.

Organizations can build a Current Profile to reflect where its privacy program is and a Target Profile to reflect where it needs to be. When building its Profiles, the organization takes into account its business objectives, privacy values, risk tolerance, legal and regulatory requirements, industry best practices, priorities and resources, types of data processing, and individuals' privacy needs.

In this step, an organization might want to add its own Functions, Categories and Subcategories to account for unique organizational risks.

An organization's definition of the privacy goals it wants to achieve forms the basis for appropriate strategic planning and implementation processes, supported by the third and final part of the Privacy Framework.

Implementation Tiers

The third and final component of the Privacy Framework is called "Implementation Tiers." They serve as a benchmark of current privacy risk management practices. Tiers classify an organization's overall privacy risk management practices and help to determine if an organization has sufficient processes and resources to achieve its Target Profile.

Internal roadblocks like the lack of integration of privacy risk into an organization's enterprise risk management portfolio, insufficient staffing or lack of training could prevent an organization from reaching its Target Profiles. The Implementation Tiers can help start an internal communication process about those or other reasons that challenge an organization's capability to manage privacy.

The Privacy Framework describes four progressing Tiers, from informal, reactive responses to more agile and risk-informed approaches. In Tier 1 ("Partial"), organizations have adopted risk management methods only to some extent. Tier 2 ("Risk Informed") indicates informal risk management methods that do not entail an organization-wide approach. Risk management methods within Tier 3 ("Repeatable") are well-defined and structured, and in Tier 4 ("Adaptive"), organizations actively adapt to evolving privacy risks.

Each of the Implementation Tiers looks at four main components of an organization's privacy program: Privacy Risk Management Process (the degree of development of the privacy risk management in particular), Integrated Privacy Risk Management Program (the extent to which the privacy risk management is integrated in an organization-wide approach), Data Processing Ecosystem Relationships (understanding of the role of the organization in relation to buyers, suppliers, service providers, etc.), and Workforce (roles dedicated to privacy and training of employees).

Tiers help determine whether risk management practices in those different areas are sufficient given their mission, regulatory requirements, and risk appetite or if additional processes or resources are needed. This can support investigating specific aspects of the overall privacy program management or help with budget planning, hiring plans or setting up training curriculums.

At the same time, the Implementation Tiers are not meant to be a comprehensive privacy maturity model. An organization may be at Tier 2, which could be sufficient to manage the types of privacy risks it has. On the other hand, another organization may be at Tier 2 but really need to get to Tier 3 to manage their privacy risks.

Integration of compliance and risk management

While the NIST Privacy Framework is agnostic to any particular law, it can help organizations fulfill compliance obligations. The Framework enables the creation of a foundational privacy program that can be tailored to different jurisdictions.

To help with this task, NIST provides a variety of so-called "regulatory crosswalks." These crosswalks map the risk management outcomes and activities in the Privacy Framework Core to the requirements of a specific law or regulation. In the Privacy Framework's online Resource Repository, one can already find references regarding the CCPA, CPRA, GDPR, LGPD and the VCDPA.

With the help of the crosswalks, the Framework's privacy risk management building blocks can be mapped back to the organization's compliance obligations to demonstrate the activities and concrete steps taken to meet those obligations. This is meant to bridge the gap between compliance requirements and system, product, and service design or deployment with a forward-looking risk management approach to privacy.

The crosswalks also facilitate the integration of the Privacy Framework and other standards and frameworks. Available mappings can be found for ISO/IEC 27701, NIST Cybersecurity Framework, FIPPs, the IAPP-CIPM, and NIST 800-53, Rev 5, Security and Privacy Controls for Information Systems and Organizations.

The Privacy Framework in action

NIST emphasizes that its Privacy Framework is a flexible and practical tool that is adaptable to any organization's role in the data processing ecosystem.

One example of this is the well thought out adoption achieved by Booking Holdings, a Fortune 500 global travel retail company. "The NIST Privacy Framework provided a scope to standardize our privacy obligations throughout our six international brands," explained Global Privacy Senior Manager Michelle Gall.

For this purpose, the categories and subcategories were re-grouped and reformulated into internalized objective statements and risk statements, still grouped under the NIST Privacy Framework's Function structure. To further aid the organization in standardizing the expectations to operationalize mitigating the risk statements, the Company adopted and customized two additional collaborating tools: 1) a people, process, technology framework for measuring maturity and capability across the risks, and 2) an attribute library, which provides providing descriptive and actionable operationalizable expectations to provide an even clearer picture of the business units' responsibilities for risk mitigation and operationalization of the objectives for each risk area. "These tools allow us to facilitate standardizing risk mitigation objectives without defining the processes or controls for the brands."

The people, process, technology framework was designed by combining the Capability Maturity Model Integration with the NIST Privacy Framework's Implementation Tiers as a crosswalk. In this way, the granularity of the self-assessment was increased to a quantitative approach allowing for monitoring and measuring success of the global program throughout all affected business units and brands.

"By adapting the NIST Privacy Framework to our company's needs," Michelle summarized, "we could provide our organizational business capabilities with a holistic and clear guidance where they are and where they need to go in terms of their privacy risk management."

Further steps ahead

The NIST Privacy Framework is relatively nascent. Over the next few years, we will very likely witness its increased use and adoption. With a growing number of freely available Crosswalks, mapping the Privacy Framework to international regulations and standards, NIST provides a sophisticated, flexible, and far-reaching approach to manage privacy risk. Everyone can contribute to the guidance and tools to support the Framework's usability.

Another part of the broad Roadmap for Advancing the NIST Privacy Framework is the NIST Privacy Workforce Public Working Group. It supports the development of a workforce that has the knowledge and skills to manage privacy risk. In all its efforts, NIST is pursuing an open and accessible approach, seeking dialogue with the private and public sector, academia, and civil society, to collaboratively improve the implementation of the Privacy Framework. The NIST Privacy Workforce Public Working Group, too, welcomes stakeholder inputs from a wide range of roles.

Photo by FORREST CAVALE on Unsplash

Privacy as code: A new taxonomy for privacy

2021-11-11 11:54:02

“Privacy by design” implies putting privacy into practice in system architectures and software development from the very beginning and throughout the system lifecycle. It is required by the EU General Data Protection Regulation in Article 25. In the U.S., the Federal Trade Commission included an entire section on privacy by design in its 2012 report on recommendations for businesses and policymakers. Privacy by design is also covered by India’s PDP Bill and by Australia’s Privacy Management Framework, to name just a few. Privacy by design has come a long way since its original presentation by Ann Cavoukian, former Canadian privacy commissioner of Ontario, in 2009.

While privacy as design is conceptually simple, its reduction to practice is not. System developers and privacy engineers responsible for it face simple but hard-to-answer questions: Where is the actual data in the organization? What types of information fall under personal data? How does one set up a data deletion process for structured as well as unstructured data?

Three years ago, Cillian Kieran and his team at Ethyca embarked on a quest to develop a unified solution to those questions. Their vision? Nothing less than privacy-as-code – privacy built into the code itself. This revolutionary approach classifies data in such a way that its privacy attributes are obvious within the code structure.

Their efforts were led by some big questions: How can systems developers describe the types and purposes of personal data in a consistent manner? Is there a way privacy rules and policies can get defined and enforced throughout the software development process? What if configurable tools could uphold data subject rights such as data access, erasure, portability, and retention as a system feature?

Last week, Ethyca celebrated an additional $7.5 million in funding and announced the first release of Fides. Fides is named after the Roman god of trust.

Fides is an open-source, human-readable description language based on the data-serialization language YAML. Fides allows one to write code with privacy designed in. It is based on common definitions of types, categories and purposes of personal data. Developers that use this language can easily see where privacy-related information is at any point in the software development. For any given system, engineers shall be able to understand at a glimpse whose data is in the system and what it is being used for.

Ethyca’s goal is to establish consistent standards around personal data processing which can describe the privacy characteristics of applications, datasets and broader tech stacks, and identify privacy risks. For the first time, this creates a standardized and interoperable approach towards privacy engineering from the ground up.

The privacy-related characteristics and behaviors of code and databases are derived from a new privacy taxonomy. While many of us are familiar with Daniel Solove’s pioneering taxonomy of privacy problems in the context of privacy-risk modelling, Fides’ taxonomy is different.

Fides’ privacy taxonomy is used to label and classify data, to quickly understand what, whose and why data is processed or shared.

The taxonomy distinguishes four levels of hierarchy: data categories, data uses, data subject categories and data qualifiers. Each of those hierarchical levels can be broken down into a variety of subclasses of annotations that allow for the needed granularity.

For example, data categories are classifying different types of data: account, system or user data. For data uses – describing for what purpose the data is used – examples for labels could be: personalize, third_party_sharing, or train_ai_system. Data subject categories cover anyone from anonymous_user, customer, employee and more. Finally, data qualifiers express the degree of identification, such as identified, pseudonymized or anonymized data.

Eventually, the description of the data results in a written statement that is easy to understand and concise. For example, the hierarchical notation to label cookie IDs would look like this: user.derived. identifiable.device.cookie_id.

While this is an example for a fully qualified category, going through all the categories, one could also describe the data on a higher level. An example could be the description for all the data that is shared with third parties for the purpose of personalized advertising: third_party_sharing.personalized_advertising. The different degrees of granularity allow for a flexible description of the data and adapted to match the specific needs of a project (try for yourself on the Privacy Taxonomy Explorer on GitHub).

Now, how is this done in practice?

The Fides description language uses YAML declaration files to store the data characterization as privacy metadata directly in a project’s Git repositories (an open-source version control system for managing source code history). In this phase, the management tool Fides Control (Fidesctl) is used to implement privacy requirements in the code before moving into production.

But this isn’t all. Many times, privacy engineers get involved only after data collection has already started, its source and use are unclear, and data management must be done manually requiring joint efforts of legal and data engineering teams. Therefore, in one further step, Fides expands beyond a sole taxonomy into a privacy ontology focusing on the production environment. Fides’ privacy ontology describes roles and relationships in a runtime environment, allowing for a variety of trailblazing applications.

Imagine these two examples:

Evaluating risk against policy while code is being written: On the Fides Control server, predefined privacy policies — formalizing business decisions and regulatory compliance requirements in accordance with the standardized privacy ontology — get stored. By using Fidesctl, the YAML declarations (based on the privacy taxonomy) will get compared against those policies on the Fides server. Any inconsistencies between the privacy declaration and the policies will get flagged as risks automatically. If the condition of the policies on file aren’t met, necessary changes can get investigated in the code base and made accordingly. Once the evaluation is passed, the changes can be committed to the code. In this way, Fides will be able to make sure any code that is shipped or merged is compliant with the policies of the organization.
Automated data subject rights requests: Once these YAML declarations are approved in the above step they form part of a metadata view of where information is across all business systems. Using Fides Operations (Fidesops), developers and legal teams can write policies for data subject request to automate access, deletion and other complex procedures consistently across all databases and connected systems. In this way, Fides makes a typically complex and heavily manual process fully automated and a feature of the system.

This emerging ontology and its many applications might be among the most interesting aspects for both privacy engineers and legal privacy professionals. Complex legal regulatory requirements can become synthesized in the predefined policies. Once those policies are defined and deployed, developers can rely on the automated control of the data processing against the requirements laid out in the policies. While developers can already compile such policies with Fidesops on an open-source basis, it is on Ethyca’s roadmap to provide an accessible user interface for everyone to write such policies.

Checking code against semantic privacy policies has the potential to be a true game changer. This would allow not only to identify and manage privacy risks in the development phase of the code before production but become the basis for generating privacy reports to document compliance of the code basis at the touch of a button, or handle privacy requests such as access and deletion automatically.

The prospects that Ethyca is offering with Fides are astonishing. With Fides, Ethyca seems to have taken the first step of building privacy into the very language of the code and have a practical and standardized approach towards privacy management in system and software development.

Now it is up to the community to use the tools offered on GitHub and get engaged in their further development by contributing for Fides to become the tool for interoperability its creators have envisioned. All our feedback is crucial to achieve privacy by design at the very core of development processes.

Photo by Sai Kiran Anagani on Unsplash

Multiparty computation as supplementary measure and potential data anonymization tool

2021-10-27 13:15:09

Privacy-enhancing technologies like secure multiparty computation, homomorphic encryption, federated learning, differential privacy, secure enclaves, zero-knowledge proof or synthetic data are becoming increasingly relevant in practice and considered by regulators.

Approaching the challenging trade-off between data privacy and data utility for a vast variety of use cases, privacy-enhancing technologies embed important privacy-by-design principles in the data life cycle. They aim at enabling increased collaborative information-sharing while mitigating the risk for privacy and security in previously unknown ways.

This is particularly true for secure multiparty computation, also known as one of the most influential achievements of modern cryptography.

Previously unimaginable, MPC allows for sharing data insights while keeping the data itself private. Two or more parties can receive an output of a computation based on their combined data without revealing their own data to the other parties. All inputs remain private. At the same time, the data remains protected by encryption while in use. The participant's input data doesn't need to be transferred to a central location but can be processed locally. The trusted third party — which is usually needed for processing data from different sources — is emulated by the cryptographic protocol used for MPC.

The resulting ability to break through data silos in a private and secure way is increasingly identified by the market. According to Gartner, half of large organizations will have implemented privacy-enhancing computation for processing data in untrusted environments and multiparty data analytics use cases by 2025.

Multiparty computation can be deployed in many different use cases, ranging from distributed signatures, key management, privacy-preserving machine learning, blockchain and financial fraud detection to digital advertising and medical research.

The first large-scale practical application of MPC was used in Denmark in 2008. In a commodity trading exchange with only one buyer, MPC allowed for an anonymous price setting that was trusted by farmers. Another great example is the annual study by the Boston Women’s Workforce Council on the gender wage gap.

Regulators are catching up

These developments didn't go unnoticed by regulators.

In Europe, increasing legal clarity lays ground for the adoption of multiparty computation.

In June 2021, the European Data Protection Board recognized multiparty computation (or “split processing,” as they call it) as a supplementary technical measure for international personal data transfers from Europe to states that do not offer an adequate level of data protection.

This acknowledgment was not accidental. The European Research Council has been funding the development of practical MPC systems for some time (e.g. here, here and here). Correspondingly, the European Union Agency for Cybersecurity lists MPC as advanced pseudonymization technique in its report from January 2021.

In the U.S., the recognition of privacy-enhancing technologies and MPC is gaining momentum as well.

In February 2021, the Promoting Digital Privacy Technologies Act was introduced in the U.S. Senate. It would provide support for research, broader deployment and standardization of PETs, defining them as “any software solution, technical processes, or other technological means of enhancing the privacy and confidentiality of an individual’s personal data in data or sets of data” which “includes anonymization and pseudonymization techniques, filtering tools, anti-tracking technology, differential privacy tools, synthetic data, and secure multi-party computation.”

Previously, the use of MPC was already suggested explicitly in the Student Right to Know Before You Go Act, designed to protect privacy in a federal student record database. In a report prepared for the U.S. Census Bureau, MPC was identified as secure computation technology with the highest potential. Several projects by the U.S. Defense Advanced Research Projects Agency are focusing on MPC as well (e.g. “Brandeis”).

Solid technology with many protocols

MPC originated from solving a classic computer science problem called the “millionaires' problem,” introduced by Andrew Yao in 1982: Two millionaires would like to know who has more money without revealing how much they each have. The “secure two-party computation” Yao came up with is still the foundation for many of the most efficient cryptographic protocols for multiparty computation known to date.

In general, MPC is working on either the basis of Garbled Boolean circuits, introduced by Yao, or on the basis of Shamir's Secret Sharing and arithmetic circuits over the large field. Additionally, oblivious transfer protocols are used in both cases.

There are numerous different protocols developed and used in each approach, and the schemes can also be combined. Basically, one can summarize:

Garbled Boolean circuits are encrypted versions of digital logic circuits, consisting of hardware or programmed wires and logic gates that follow a prescribed logic when computing a function. To “garble” the circuit means encrypting the possible input combinations and possible outputs, described in the so-called truth tables at the logic gates. Then, each logic gate outputs cryptographic keys used to unlock the output of the next gate, a process set forth until arriving at the final result.
With Shamir's secret sharing, data (for example, personal data or a machine learning model) is split up into fragments, which in themselves do not contain any usable information. The secret shares are distributed amongst a set of parties to perform secure computation over the shares, releasing output to a designated party once done.
Oblivious transfer protocols allow two parties to transfer two encrypted messages from one party to the next in a way that ensures the messages were sent and received, but the sender doesn't know which one of the messages the receiver chose to open.

MPC protocols vary in their efficiency, security or robustness. The protocols can be set up for different scenarios, depending against how many adversaries operating in the system the solution should be secure. In the definition of security no adversarial success is tolerated.

To elevate the privacy posture and cover more use cases, MPC is often combined with federated learning, homomorphic encryption and differential privacy.

Increased legal clarity, but open questions remain

Within the scope of the EU General Data Protection Regulation, MPC can be considered a technical safeguard and pseudonymization technique for the processing of personal data. The recognition of MPC as a supplementary measure by the EDPB and its presentation as a pseudonymization technique by ENISA underlines its capacity as a Data Protection by Design and by Default tool in accordance with Article 25 of the GDPR.

Article 25 states that appropriate technical and organizational measures have to be implemented before and while processing personal data. In doing so, the “state-of-the-art” has to be taken into account. Data controllers have to consider the current progress in technology available in the market and stay up-to-date on technological advances.

The requirement of “state of the art” is stressed further in Article 32 of the GDPR, which specifies the technical and organizational measures. In this context, the guidelines about “State-of-the-art technical and organizational measures” by the European Union Agency for Network and Information Security and the German TeleTrusT from Feb. 2021 refers to state-of-the-art as the “best performance available” of a IT security measure “on the market to achieve” an IT security objective.

In general, this is when “existing scientific knowledge and research” reaches market maturity or is at least launched on the market. For evaluating the state of the technology, one has to look at the degree of recognition and proof in practice. Insofar as MPC meets those criteria, it will have to be considered for privacy and security risk mitigation.

But MPC might offer more than being a technical safeguard and technology for pseudonymization in the context of the GDPR.

MPC is also discussed as a tool for anonymizing personal data, which as a consequence would fall outside the scope of the GDPR. This is a very practical concern, particularly in the context of international health research impeded by the GDPR's strict requirements.

In general, the GDPR takes a risk-based approach for determining whether data should be considered anonymized or pseudonymized. All means to reidentification that could reasonably likely be used, by the data owner or a third party, to identify an individual, must be taken into account.

In reality, the question where to draw the line between pseudonymization and anonymization differs widely within EU institutions and member states. The absolute approach insists that to classify data as anonymous, no remaining risk for reidentification is acceptable. The relative approach accepts that there is always a remaining risk of reidentification. Only attempts to reidentify data by the controller themselves with the legitimate help of a known third party should be taken into consideration.

The legal point of view that MPC leads to anonymization of personal data (see here, here, here and here) can hold particularly true in the relative approach. It stresses that the private inputs or data fragments exchanged during MPC cannot identify any individual independently. Also, as long as the participants of the MPC don't have lawful access to the decryption keys, the chance of collusion is highly improbable.

Diving deep into the legal aspects of MPC in the context of GDPR, support for MPC as a means to anonymize personal data also comes from a project funded by the EU's Horizon 2020 research and innovation program. Working on practical privacy-preserving analytics for big data using MPC, SODA concludes in its legal assessment: “Cryptographic solutions such as multi-party computation have the potential to fulfil the requirements for computational anonymity by creating anonymized data in a way that does not allow the data subjects to be identified with means reasonably likely to be used.”

The definition of MPC the EDPB uses in its recommendations on supplementary measures does not seem to preclude this interpretation. It states that the final result of a computation conducted with MPC “may” constitute personal data:

The data exporter wishes personal data to be processed jointly by two or more independent processors located in different jurisdictions without disclosing the content of the data to them. Prior to transmission, it splits the data in such a way that no part an individual processor receives suffices to reconstruct the personal data in whole or in part. The data exporter receives the result of the processing from each of the processors independently, and merges the pieces received to arrive at the final result which may constitute personal or aggregated data.

The EDPB further specifies that for MPC being considered an effective supplementary measure, the data inputs need to be processed in a way that no information can be revealed about specific data subjects, even when cross-referenced, and no input information is leaked to other participants in the MPC protocol. The data being processed should be located in different jurisdictions and public authorities should not have the legal means of accessing all the necessary shares or input data.

Under all circumstances, it is important to keep in mind that encrypting or anonymizing personal data itself qualifies as “processing” and can only be done on a clear legal basis according to Article 6 of the GDPR.

In conclusion, MPC is righteously considered by many as a privacy-enhancing technology that could transform private and secure information-sharing. The acknowledgment of MPC as an admissible technical safeguard for international data transfers by the EDPB highlights MPC´s potential as a state-of-the-art privacy-by-design tool. This is encouraging for organizations who want to rely on mathematical models for joint processing that go beyond written agreements of private and secure collaboration.

To unlock the full value of multiparty computation and privacy-enhancing technologies, it would be crucial to have EU regulators provide more legal clarity on pseudonymization and anonymization. Particularly scientific research could benefit tremendously from a granular framework for the assessment of identifiability in the context of ready-to-use privacy-enhancing technologies.

Photo by Oscar Nord on Unsplash

Podcast: The privacy, ethical issues with empathic technology

2021-05-14 12:45:16

Artificial intelligence and machine learning technologies are rapidly developing across virtually all sectors of the global economy. One nascent field is empathic technology, which, for better or worse, includes emotion detection. It is estimated that emotion-detection technology could be worth $56 billion by 2024. However, judging a person's emotional state is subjective and raises a host of privacy, fairness and ethical questions. Ben Bland has worked in the empathic technology space in recent years and now chairs the Institute of Electrical and Electronics Engineers P7014 Working Group to develop a global standard for the ethics of empathic technology. IAPP Editorial Director Jedidiah Bracy, CIPP, recently caught up with Bland to discuss the pros and cons of the technology and his work with IEEE.
Full Story

Setting data retention timelines

2020-07-15 12:20:38

When setting retention timelines for your data, start with this plan: If you don’t need to have it, then delete it.

However, figuring out the exact timelines you can adhere to is more complicated than you might have planned on when dealing with computer systems, especially if they’re large or distributed.

Keeping data

You probably know all the legally required reasons to keep data, such as whether it is required for financial records compliance or because of discovery for a court case or HR compliance. There are several more systems- and product-oriented reasons that you might need to keep data.

Your user asked you to keep the data. It’s rude to delete someone’s data when they didn’t ask you to and they’re likely to be upset. Let me differentiate here between primary and side-effect data. Primary data is the particular data they asked you to keep, such as their email or baby pictures. Side-effect data is data created as a side effect of interaction with a service, such as logs of those interactions. Side-effect data does not count as something the user asked you to keep unless that request is extremely explicit.
You need the data for security/privacy/anti-abuse. If you’re keeping data only for these purposes, then make sure to segregate it from your other data and reduce access so that it is only used for those purposes. Because abusers will do things like rotate their attack patterns and “age” accounts specifically to get beyond your data retention periods, you may be cautious about specifying the exact retention period for this particular data.
Other necessary business purposes. Those break down largely into two categories:
- Data necessary to run a system. There are certain types of data that are necessary to keep a system functional. For example, without debugging logs, your engineers are going to be pretty much out of luck when they try to fix their broken system. Every system breaks. Without data, it is extremely difficult to effectively load-test a complex system, especially when it’s being changed or before it’s in use. Without load-tests, systems tend to break at precisely the worst times: when they’re in heavy use. Engineers also need to analyze their system and its use to keep it working in the future. For example, they have no way to know how many more servers they’ll need next week or next month unless they have a fairly accurate picture over time of the system traffic and load.
- Data used to improve a system. Most users expect a system to improve over time. If you have permission to do so, then you’ll need to keep logs of system interactions over some time frame in order to do this work.

Once data is truly anonymized, then it’s no longer user data. For example, it’s important to know approximately how many requests your servers received at a particular time on a particular day, because that’s part of how you understand whether your product is effective and part of how you plan to buy the appropriate amount of server capacity. Don’t just remove the identifiers and call it good, though.

How long to keep data

The obvious bits: If a user or customer asks you to keep certain data, you should keep it as long as they’ve asked you to and delete it promptly when they ask you to delete it. If you have to keep data for legal reasons, keep it for as long as is required and then delete it. Past there, things get fuzzier.

Here are some guidelines I’ve found helpful.

Five weeks is good default timeframe for data kept for analysis, system maintenance and debugging. A lot of the analyses listed above need to be able to compare what is happening today to what happened a month ago. Daily statistics have a lot of noise and many systems have swings in usage on a month-over-month basis. Because analysis pipelines take time to run and sometimes fail, five weeks gives time to perform month-over-month analysis even in the face of those failures and working around holidays — a standard month-over-month analysis may be skewed during a holiday, so you might need to do the analysis for a slightly skewed timeframe.

A more limited-amount need analysis over a longer timeframe. For those, I would default to 13 months. Thirteen months gives enough data to provide a year-over-year analysis with enough wiggle room to work around holidays and system failures.

There are two major places where these analyses are important: in systems that have strongly year-oriented behavior, it can be extremely difficult to understand their behavior without comparing it to the behavior of the system from a year ago. But in most cases, these analyses are not sufficiently valuable to justify keeping data for that long of a period. The biggest exception to this in the case of fighting fraud and abuse. Attackers to a system will specifically take advantage of limited data retention by recycling old attacks periodically — when your abuse-fighting system “forgets” certain kinds of spam, spoofing or other forms of abuse, they’ll crop right up again. Longer retention periods reduce the ability of attackers to take advantage of the system.

A huge factor in setting timelines on data retention is how long your system takes to actually delete data completely, which has to be added to your chosen amount of time abuse. It’s almost never immediate and, in a complex system, can take a while.

Take into account system factors, such as:

How long you keep your backups. Data in backups isn’t deleted.
How long do your data centers go down for maintenance? During those times, you won’t be able to delete data.
How long does it take to rebuild your machine learning models? A machine learning model that is built with user data almost certainly isn’t anonymous and must be rebuilt (this generally requires certain bleeding-edge research techniques or pre-anonymizing your techniques).
How long does your deletion pipeline take to run? How often does it fail? You should plan on running that deletion pipeline multiple times, even in a generally stable system, because random failures happen.
How long does it take between asking your storage system to delete and that deletion actually taking place? Depending on what you’re using, this can be far longer than “immediately.” For example:
- If you are using a distributed storage system, then registering a deletion with the system does not mean that every distributed replica is deleted immediately. Consult whoever is running your storage system to figure out how long this takes, even in the face of a flaky network or other errors.
- If you are deleting data off of a hard drive, that doesn’t mean that the data is actually removed from the hard drive, generally. It means that the file system marks the places where that data was stored as available to be overwritten. You have to actually overwrite the data in those areas for it to be reasonably/fully gone. (I’m not going to get into forensic techniques for recovery from different storage media here.) That means you either need to overwrite the data on purpose or figure out how long it takes for normal operation of your system to overwrite that data (with a safety margin built in for things like maintenance periods or holidays).
Sadly, few storage systems natively support a “time to live” that you can set on data. If you set a TTL of one week, for example, then data you write is deleted after one week. TTL can be extremely helpful both in reducing errors in data deletion, as you don’t need to track down every last piece of data written by every single intermediate stage of your pipeline, and in speeding up parts of your data deletion process, depending on implementation.

Photo by Mika Baumeister on Unsplash

Data retention in a distributed system

2020-07-14 12:20:28

For people who haven’t tackled this, it may be difficult to believe, but deleting data reliably from a large and complicated distributed system is hard. Deleting it within a specified retention time frame is harder. At this point, though, we privacy engineers have spent enough years on this to have worked out the principles.

There can be only one primary data source for every piece of data. Know where it is.

If two different pieces of storage (e.g., files, database tables) claim to be the canonical source for a piece of data, start by disabusing them of this notion. In a distributed system, you should assume that computers will crash at any and all times, including in the middle of writing data. If two data sources each claim to be the primary, either one of them is wrong or you’re going to have a bad, bad day when the system goes haywire. When one of those “canonical” servers crashes in the middle of writing data (or if both try to write contradictory data!), you will not be able to tell which one is right. At that point, you have two different pieces of data both claiming to be right, and you are having a bad, bad day filled with code archeology and the distinct possibility that you won’t be able to reconstruct the correct answer.

Knowing the canonical source of each piece of data is also crucial for constructing your data portability system and responding to data subject access requests, but useful automated portability is a whole different topic.

Every other copy of a piece of data is a secondary datastore. Know where and how stale they are.

Every datastore, which isn’t the primary, is a secondary datastore. There are two main types of secondary: copies and derived data. Copies of the primary datastore are just that: copies. They usually exist for caching (where some data is copied closer to where it will be used for speed), redundancy (where some data is copied as a backup in case of errors), or manual use (usually debugging or analysis).

Derived data is anything computed from some dataset that isn’t a straightforward copy. For example, a machine learning model, as well as both the intermediate and final results in some kind of computational pipeline are derived data. If this derived data is anonymous (and not just deidentified), then it is no longer user data and may have a different allowed retention period. In almost all cases, though, derived data is not anonymous and retains the same characteristics (e.g., user data) as the primary datastore, though it may not be easy to recognize.

For each secondary datastore, calculate its maximum staleness. For data derived or copied from the primary datastore, this is pretty easy: What is the maximum time lapse before that data is replaced with a new copy? For manual copies, the time frame is by default infinite. If a cache is replaced with a new copy of the primary every 24 hours, the maximum staleness is 24 hours.

Here is where things start getting tricky.

Secondary datastores often have their own secondaries. Perhaps someone is caching a machine learning model. Perhaps a data processing pipeline has multiple stages, each with its own results. Perhaps the two have been mixed somewhere. In these cases, the maximum staleness is the largest of the sums of the maximum staleness along all the computational paths to the datastore.

Recombinant datastores, ones that bend back on themselves, can be even trickier. Perhaps data from one of these secondaries is in your logs somewhere, like a log of what suggested cat pictures you have offered to a user and that they chose to download. That log now contains data that is from, say, a recommendations datastore. The log’s maximum staleness is naively the length of time the log is retained plus the maximum staleness of the datastores derived therein, like the recommendations datastore.

However, if you’re using this log as part of the inputs to the recommendations datastore, you’ve come around in a circle. Putting that data in a log doesn’t make it new. Coming around in a circle generally means, in practical terms, that you need to add anonymization somewhere to avoid infinite retention.

The primary must remain within the appropriate allowed timeframe. Each secondary must also be completely synced to the primary within the appropriate allowed retention time frame.

If your calculated retention timeframes are larger than allowed for the particular types of data in particular datastores, then you have to fix that. But you also have to ensure that your sync is complete. It’s surprisingly easy to think you’ve synced your datastore to the primary without it having been done completely in all possible cases.

For example, if you use a publish-subscribe system to push deletion notifications from one datastore (the publisher) to another (the subscriber), it will usually work quickly. But when things go really sideways, the system could lose deletion notifications. The subscriber might crash in the middle of handling a deletion notification but after having promised to take care of it (software bugs happen). Notifications are stored in a queue in case the subscriber goes offline for a while, but the queue can be overwhelmed and run out of storage in the case of too many published notifications (data centers go down, systems get swamped). Those failures don’t always give the system operator enough information to debug and fix the error, keeping the system from fulfilling its deletion promises.

Another way that data deletion might be accidentally incomplete is in the handling of data items that are edited rather than being fully deleted. Unless there is an explicit document history feature (like in Google Docs), when I change Alice’s phone number in my address book from “555-1212” to “867-5309”, then that’s effectively the same as deleting the old phone number and adding a new one. Edits, in most cases, need to be handled just like deletions with all the retention timeline calculation and system-checking that implies.

The easiest way to be sure that you’ve synced every item of a secondary datastore is to delete that secondary datastore entirely and recreate it. At that point, you can be quite sure that it’s not any more stale than the dataset it was copied from. In the case of processing pipelines or caches, this is by far the easiest approach. It is always possible or practical: If the data is very large compared to the amount of copying bandwidth available or recreating a computation is too large to be done in one fell swoop, you’ll need to do a per-item sync.

If you need to sync two datastores, if your datastore has a real 100%-effective per-item data sync available, then use that. Data syncing protocols can be a serious pain to debug. If you don’t, then you’ll need to build your own.

As I describe above, publish-subscribe protocols generally cannot be relied upon to reach 100% effectiveness in all circumstances. They are, however faster than the 100% effective methods and thus make a great part of an overall data deletion system.

Algorithms that are 100%-effective rely on one fundamental truth: Something is going to go wrong eventually. So they rely on persistently stored data that can be checked.

For example, one might need to delete all data related to a particular set of users with randomly generated user IDs. If you store those random user IDs, your system can check again and again that all relevant data has been deleted. If there’s a bug, engineers can fix it and the deletion job or sync job can be retried. Or your system can take the opposite approach: Store only the data you wish to keep and delete everything in the secondaries that doesn’t match. For instance, your photos' database only stores the photos that have not been deleted; all others should be deleted from the secondaries.

The standard, robust way of performing this sync is to use the mark-and-sweep algorithm. In its simplest form, this means “marking” every data element in the secondary datastore. If an element should not be deleted (i.e., it is not deleted in the primary datastore), then remove the mark. When the un-marking is completed, “sweep” the marked elements away (delete them). This algorithm may over-delete but is very unlikely to under-delete. In the case of error, a secondary datastore can be re-created from the primary. Running this algorithm periodically is a robust way of syncing deletions from a primary datastore.

No matter which technique you are using, whether periodic deletion or a per-item sync, I highly recommend performing data deletion somewhat more frequently than required. Computers are a part of the messy world in which we live. Computers break. Backhoes take out fiber and leave data centers temporarily isolated. Software goes haywire and has to be debugged. Build slack into your schedule to account for this.

… and if you’re not monitoring this, don’t assume it’s working.

No matter how robust your algorithm is, you cannot assume it’s working. Like everything else you care about in a computer system, you should monitor its performance. You might carefully check the deletion code for correctness, but if your code isn’t actually running because job scheduling isn’t working, then your deletion code is still broken.

Monitoring requires that you check the actual state of the system in a way that is independent (or as independent as possible) of the code you are monitoring. In this case, that means not looking at the deletion code directly. You need to check the datastores directly.

If a particular store is being periodically deleted and re-created, you may perform this monitoring (at least primarily) on the filesystem or database logs. Do you see the datastore being deleted at least as frequently as is required?

If a datastore needs a per-item sync, then you need per-item monitoring. Take as an example our lists of user IDs whose data should be deleted. Do they appear as an owner of data in the datastore? If so, then your data deletion job has a problem somewhere. Building a predictable monitoring job of this sort requires building some understanding of the data (e.g., where the user IDs live) into the monitoring job.

Even better than monitoring for noncompliance is monitoring in advance of noncompliance so that bugs can be fixed before the system blows its retention window. If you plan to have data deleted within days before the retention timeline expires, let the engineers know with increasing urgency that there may be a problem as the deadline approaches. These alerts must be carefully tuned; if they are fired often when there isn’t a problem, they aren’t helpful and are likely to be ignored.

Humans are really good at accidentally ignoring data that is generally irrelevant.

Predictable monitoring is conservative; it looks for clear violations of the invariants involved and fires an alert. Accidentally retained data is not always so straightforward to track down. In some cases, you need heuristic monitoring, as well.

For example, if the identifiers have been stripped from data but it is not anonymized, you can employ a scalable data scanner to find possible joins between the deidentified data and identified datastores — or even more deeply transformed data. A less-sophisticated but more common scanner uses regular expressions to identify data that “looks like” whatever you’re looking for, such as government IDs, email addresses, IP addresses and phone numbers.

Heuristic monitoring can find issues that more straightforward monitoring may miss, but this characteristic leads to two further problems: false positives and false negatives. False positives are errors made by the monitoring job where it flags a possible violation that is not, in fact, a problem. These noisy alerts face the same problems described above: They waste time and attention and, if fired often, are counterproductive. The best part of false positives is that they are visible. That means that the engineers have some hope of fixing them.

False negatives are trickier. False negatives are when your monitoring does not a flag a problem that really exists. Because these errors are silent you don’t know they’re there and thus cannot feed corrections or fixes into the monitoring algorithm.

The only way to find such errors to correct them is to perform very careful manual checks, preferably by privacy engineers with strong skills in statistics.

Photo by Markus Spiske on Unsplash

Deidentification 201: A lawyer’s guide to pseudonymization and anonymization

2020-05-28 11:25:46

We previously wrote this 101-level guide to deidentification, hoping to make it easier to understand how deidentification works in practice. This article is meant to be a 201-level follow-up, focused on what deidentification is, what it isn’t and how organizations should think about deidentifying their data in practice.

So let’s dive right in.

What are direct and indirect identifiers?

Identifiers are personal attributes that can be used to help identify an individual. Identifiers that are unique to a single individual, such as Social Security numbers, passport numbers and taxpayer identification numbers, are known as “direct identifiers.” The remaining kinds of identifiers are known as “indirect identifiers” and generally consist of personal attributes that are not unique to a specific individual on their own. Examples of indirect identifiers include height, race, hair color and more. Indirect identifiers can often be used in combination to single out an individual’s records. Carnegie Mellon University's Latanya Sweeney, for example, famously showed that 87% of the U.S. population could be uniquely identified using only three indirect identifiers: gender, five-digit ZIP code and birthdate.

What’s pseudonymization?

Pseudonymization can be thought of as the masking of direct identifiers. As we explained in our 101-level guide, there are a variety of different masking techniques that can be used to hide direct identifiers (and some are stronger than others). While pseudonymization can reduce re-identification risk — which is why recent data protection laws, like the EU General Data Protection Regulation and the California Consumer Privacy Act, incentivize pseudonymization — it does not by itself meet the level of protections required for true anonymization.

So what’s anonymization?

Anonymization is the process through which personal data are transformed into non-personal data. From a technical point of view, a big part of this process involves altering the data to make it difficult to match any records with the individuals they represent.

Legal standards for what counts as anonymization vary. Some laws, like the U.S. Health Insurance Portability and Accountability Act, set this threshold as to when a statistical expert attests that a “very small” risk of re-identification exists through something known as the expert determination standard — highlighting how subjective the legal concept of anonymization is in practice.

Regulators in other jurisdictions refer to remote re-identification risks, such as the U.K. Information Commission's Office, or to “robust[ness] against identification performed by the most likely and reasonable means the data controller or any third party may employ,” in the case of the former EU Article 29 Working Party. In all cases, however, there’s always some level of risk assessment involved and always some consequent level of uncertainty. This means that knowing whether anonymization has been achieved is rarely a black-and-white proposition.

What’s a risk-based approach to anonymization?

If we’re realistic about anonymization — and realism is the job of all lawyers, in our view! — the best we can hope for is getting the risks of re-identification low enough to be reasonable or functionally anonymized. Here, the concept of “functional anonymization” means that the data is sufficiently anonymized to pose little risk given the broader controls imposed on that data. This risk-based approach finds its roots in statistical disclosure methods and research, considering “the whole of the data situation,” to quote the U.K. anonymization framework. This type of risk-based approach is grounded in statistical methods — with a healthy dose of realism — and tends to be favored by regulators in multiple jurisdictions, from the U.S. Federal Trade Commission to the U.K. ICO and more.

Can mathematical techniques help with anonymization, like k-anonymization and differential privacy?

Yes! But the key caveat is this: not on their own.

There are a host of statistical techniques that can help preserve privacy and lead to functional anonymization when combined with additional controls. For this reason, these techniques are often referred to as privacy-enhancing technologies. But these tools also require oversight, analysis and a host of context controls to meaningfully protect data, meaning that it is not enough to run fancy math against a dataset to anonymize it (as nice as that would be). These types of techniques include:

K-anonymization, which is a data generalization technique that ensures indirect identifiers match a specific number of other records, making it difficult to identify individuals within a dataset (the total number of matching records is referred to as “k,” and hence the name). For example, in data that’s been k-anonymized, if k is set to 10 and where indirect identifiers include race and age, we would only see at least 10 records for each combination of race and age. The higher we set k, the harder it will be to use indirect identifiers to find the record of any specific individual.
Differential privacy, which is a family of mathematical techniques that formally limit the amount of private information that can be inferred about each data subject. There are two main flavors of differential privacy, offering slightly different privacy guarantees: “global,” which offers data subjects deniability of participation, and “local,” which offers deniability of record content. Despite being slightly different, both operate by introducing randomization into computations on data to prevent an attacker from reasoning about its subjects with certainty. Ultimately, these techniques afford data subjects deniability while still allowing analysts to learn from the data.

While PETs like k-anonymization and differential privacy can offer mathematical guarantees for individual datasets, it’s important to note these guarantees are based on assumptions about the availability of other data that can change over time. The availability of new data, for example, can create new indirect identifiers, exposing formerly k-anonymized data to attacks.

Differential privacy faces similar issues if an attacker is allowed access to other differentially private outputs over the same input. In practice, this means that context is always a critical factor in applying PETs to data. You can read more about PETs in this UN report on privacy-preserving techniques.

So if I want to functionally anonymize data, what should I do?

When performing functional anonymization, we recommend, for starters, combining controls on data and on context. Data controls include the types of operations we’ve discussed in the 101-level guides and in this post, like masking and differential privacy, and which we might simply refer to as “data transformation techniques.”

But equally important are controls on context, which include items like access controls, auditing, query monitoring, data sharing agreements, purpose restrictions and more. Context can be thought of as the broader environment in which the data actually sits — the more controls placed on the data, the lower the re-identification risk will be. Within this framework, it’s also helpful to think about all the different types of attacks and disclosures you’re trying to avoid and assess how likely each scenario is given your controls.

As with our 101-level guide, we’d like for this post to be interactive, so if you think we’re missing an important area or simply have feedback for us, please comment below or reach out to governance@immuta.com.

Photo by Charlie Egan on Unsplash

What lives between data privacy and data governance? Better compliance

2020-04-28 12:10:22

Data is powerful. It is used by organizations to make better business decisions, streamline operations and reduce overall operating costs. Many of today’s Fortune 1000 companies transformed their business by embarking on a digital journey that aligned data as their most valuable asset.

Of course, things that are valuable need to be protected. Data has the power to be transformative because it often contains sensitive information that could bring harm to the individuals it concerns.

Chief privacy officers face new regulatory requirements for protecting and reporting on that sensitive data, which has created an urgent need for companies to better manage their data assets in the first place. Previously unregulated organizations are enhancing their data governance programs to address this need. As part of that effort, it’s necessary for CPOs and chief data officers to collaborate more efficiently to manage, protect and report on their organizations’ data.

The GDPR and CCPA wake-up call

With the recent adoption of the EU General Data Protection Regulation and California Consumer Privacy Act, U.S. privacy regulations reached beyond the previously regulated sectors of finance, health and children’s data to specify that any organization processing “personal data” or “personal information” must meet new compliance standards in their data practices or submit to costly fines.

With data privacy under the spotlight and regulations evolving across the globe (as of this writing, 61 countries have privacy regulations in consideration), data-driven organizations are getting more strategic and forward-thinking about their data governance. Companies can no longer afford to treat each new privacy regulation as a standalone project or spend hours manually collecting and aggregating data for custom reporting on individuals. They need the right solutions to operationalize and automate their data assets at scale.

Unveiling the blind spot of data governance

Enter data governance and the role of the CDO. Data governance is the management of the quality and integrity of data across an organization. It ensures there is a consensus and truth in the data and that it can be relied on to be accurate and complete for all functions in an organization.

The CDO is responsible for executing the activities necessary for managing data and shaping the data policies and data sharing agreements. For any organization that collects and processes customer, employee or business-sensitive data — and wants to ensure that data remains as accurate, complete and “true” as possible — the CDO can be the CPO’s best friend. And in just about every organization, there’s a growing need for them to work together to achieve ongoing compliance.

As things stand, companies — especially those outside of previously regulated sectors, like health and finance — may have gaps in their existing data management programs. These organizations either lack historical knowledge and documentation on the full breadth of their data assets or that data is spread out across a diverse technological landscape. Plus, the sheer amount of metadata that is generated on a daily basis can create issues in efficiently fulfilling requests, including data subject access requests, and that can only be fixed by addressing data governance.

Governance gets its day

Properly managed and governed data can support all the organization’s business functions, including data privacy management.

So, while privacy regulations may be the catalyst, it turns out that one solution for achieving compliance comes down to the responsible handling of data. For many companies that have previously failed to build a sustainable data program, data governance is enjoying a moment in the spotlight. This is thanks to funding devoted to GDPR compliance and the game-changing formalization of data processing the regulation essentially demands.

The legal language surrounding these regulations fails to capture the complete and holistic picture of what governing an entire organization's data assets looks like. For example, data discovery of personal information under the CCPA is only a small portion of data governance activities. Data found near personal information (aka proximity data) expands the type of data that needs to be cataloged and categorized for further documentation on its availability, usage and context.

Proximity data can include an IP address for a person, related health records and even cookie settings, for instance. Since these expanded datasets also need to be included in the governance program specific to the CCPA, a proactive approach is to build a flexible and expansive data program that can proactively prepare for various privacy-related reporting requirements.

Overall, organizations must make the best use of limited resources to support a variety of requirements. This translates into building a mature framework with repeatable and efficient processes that quickly respond to new — and sometimes conflicting — regulatory requirements.

When privacy and governance work together

There are several methods that privacy and data officers can use to create defensible programs for responding to imminent regulatory and privacy threats.

1. Define and classify. The most important focus should be on building a data foundation represented by discrete building blocks of data elements. These attributes include but are not limited to:

Definitions.
Purpose of use.
Business rules.
Business and data controls.
Business processes.
Data quality rules and scores.
Risk impacts.
An ownership matrix.

In addition, a data catalog is an inventory of available data and associated attributes, including classification, which describes data settings as confidential, sensitive, internal and so on.

For the data governance officer: This identifies how to treat data that is classified as confidential for access level or even the prioritization of governance projects.
For the privacy officer: This helps spell out the risk associated with processing activities involving that data.

2. Tagging. The second data governance method for privacy regulation is the inclusion of a category in the data catalog.

For the data governance officer: This attribute describes the purpose of usage for the data.
For the privacy officer: Both the GDPR and CCPA mandate that an entity must describe the purpose for how that data is used.

3. Identify data lineage. The third method that aligns governance and privacy together is documenting how data flows from upstream to downstream.

For the data governance officer: Data lineage documents and illustrates the full end-to-end journey of a data element, starting from the “authoritative” source that created the data to downstream sources and applications that store it, display it or both.
For the privacy officer: This reveals the original data source or provenance of where the data is collected and its lifecycle throughout the business.

Full compliance, real insights

Any entity that processes data must do so in a responsible manner that puts the data of its customers and employees first. Data privacy and governance form an important intersection where that can happen and where countless opportunities to address regulatory compliance live. While privacy may be the financial and regulatory impetus for a company’s decision to better evaluate its data assets, a solid data governance program can serve as the bedrock to manage and protect those data assets.

As such, it’s crucial that CDOs and CPOs collaborate effectively and frequently to develop new internal processes and procedures that efficiently manage, protect and report on data. Organizations can implement technology software to map both structured and unstructured data, operationalize and automate all data holdings, eliminate duplication of data, manage breach investigations, and assist with required reporting activities.

By taking a bottom-up approach to data, the CPO and CDO together can create a defensible privacy framework that not only puts its business into full compliance, but also provides value by creating real insights derived from data.

Photo by Stephen Dawson on Unsplash

Aggregated data provides a false sense of security

2020-04-27 11:50:15

We are in the midst of a global pandemic, and the need to access COVID-19-related data has become increasingly important to make evidence-based policy decisions, develop effective treatments, and drive operational efficiencies to keep our health care systems afloat. Accessing personal data comes at a risk to privacy, and there are many unfortunate examples of harm coming to individuals diagnosed with COVID-19. These challenging times are likely putting pressure on many privacy professionals the world over who are on the front lines of supporting data sharing and data releases.

One common refrain we’re hearing is to “aggregate” data to make it safe for sharing or release. Although aggregation may seem like a simple approach to creating safe outputs from data, it is fraught with hazards and pitfalls. Like so many things in our profession, the answer to the question of whether aggregation is safe is “it depends.”

The final count down

In fact, the field of statistical disclosure control, used by national statistical organizations, was largely born out of the need to protect information in aggregated or tabular form. Various methods have been developed. Probably the most well known is to require a minimum number of people to be represented in any categorization of aggregated data, called the threshold rule. For example, applying the threshold rule of 10 would mean that at least 10 people are represented with the same combination of geographic region, sex and age. This seems simple enough, but things get complicated in a variety of ways.

Let’s assume that aggregated data is represented in a table, with columns representing the identifying attributes of region, sex and age. Another column is added representing the count of people for a combination of these identifying attributes. Applying the threshold rule of 10 would mean removing any counts less than 10, as shown in the table below. Oftentimes, however, summaries are included that count down the column to provide a total (known as a marginal in statistics).

Region	Sex	Age	Count
A	M	10-19	*
A	M	20-29	17
A	M	30-39	15
A	M	All (10-39)	39

If the summary row is included, the last row in the table above, we can easily determine that the count for region A, sex M and age 10–19 is 39 – 15 – 17 = 7. OK, yes, this was an obvious example. But it’s a common problem, and there are known methods to easily reverse aggregation across many dimensions when marginals are included (such as the shuttle algorithm). Those marginals may also come from published releases, for example, the overall number of positive COVID-19 cases in a region.

Is the lesson here not to include summaries or marginals?

Well, summaries of this type can also come up when aggregated data is produced over time. Say the above table was produced for the month of February, in which {region A, sex M, age 30–39} = 15, and for the month of March, an updated table is produced with {region A, sex M, age 30–39} = 16. It’s easy to see that one person was added from one month to the next, which would violate the spirit of the threshold rule since fewer than 10 people would be represented in the difference between counts.

Fables of the reconstruction

The challenges just described are due to overlapping counts. Things can get more complicated when we consider aggregating numerical data. This is why other methods were developed, such as combining the threshold rule with a proportionality measure to ensure that the aggregation doesn’t end up representing only one or two people, known as the dominance rule. Say, for example, the data included household income. Adding up the income for more than 10 people may not provide sufficient protection if only one person’s income ends up representing 80% or more of the total income.

This leads us to a problem with the word “aggregation.”

Broadly interpreted, it could mean collating data together (the least fitting to our current discussion), counting people in a group, adding up numerical data of people, or, while we’re at it, calculating any statistic of data about people. Statistics can introduce a whole new realm of complexity (and not just because people dread the study of statistics).

The challenge is that all these numerical forms of aggregation can be used to reconstruct the original data, something that has come to be known as the database reconstruction theorem. Generally speaking, the more statistics produced from the same underlying data, the more likely it is that the underlying data can be reconstructed from those statistics. This is because there are only so many combinations of data that could have produced those statistics. This is the reason the U.S. Census Bureau has invested so heavily in developing new and innovative tools using differential privacy.

One or the other

Deidentification or anonymization (terms that are often used interchangeably) is broadly defined as removing the association between data and people. The data can be record-level or aggregated, the same principles apply, as was recognized in the standardization efforts of ISO 20889 on terminology and techniques. It could be argued that applying these techniques to aggregated data is somewhat after the fact, whereas applying them to the record-level data is more akin to the spirit of privacy by design. However, there are methods that build these techniques into the aggregation itself.

Regardless, as privacy professionals, we should not be using the terms deidentified or anonymized when only directly identifying attributes are removed from data. Perhaps it’s time we agree to call this pseudonymized data, given the influence the EU General Data Protection Regulation has had around the world.

While it may seem simple to suggest using aggregated data, things are never as simple as they seem in the world of privacy, and “it depends” is a common refrain. In truth, it may be safer to suggest using aggregated data that has been deidentified or anonymized, or to remove the association between aggregated data and people. But let’s not assume that aggregated data is safe, or we’ll provide a false sense of security in how data outputs are shared or released.

Stuck in the middle

This isn’t to say that producing safe data outputs needs to be overly complex.

In the context of the COVID-19 pandemic, if at no other time, we need to think of ways to produce useful data that is sufficiently protected and in an efficient and scalable manner. For non-complex data outputs, such as counts, keep it simple: a standard set of attributes (e.g., region, sex, age), aggregated by those attributes, with no accompanying summary statistics produced from the underlying data (so that there’s less risk of overlap that may reveal the underlying counts), and for a specific reporting period with no overlap from previous reporting periods.

For scenarios in which more detailed data is needed, a little more sophistication may be required, especially for data that is potentially identifying. However, a great deal can also be done besides just changing data outputs through transformations.

The threat landscape can be reduced by only sharing or releasing data with those people who need it for approved purposes and in suitably protected data environments with appropriate technical and organizational controls (see, for example, the Five Safes). A limited context should also limit the data transformations needed to render data outputs safe.

While this is seemingly more complex than suggesting aggregated data, it does speak to our common answer of “it depends.”

Photo by Lukas Blazek on Unsplash

Microsoft launches open-source privacy mapping tool

2020-02-21 12:25:41

Microsoft has launched a new open-source tool mapping ISO's global privacy standard, ISO/IEC 27701, to nine different privacy laws from around the world.

The “Data Protection/Privacy Mapping Project,” as it is named, maps ISO/IEC 27701 to the EU General Data Protection Regulation, California Consumer Privacy Act, Brazil’s General Data Protection Law, Australia’s Privacy Act, Canada’s Personal Information Protection and Electronic Documents Act, Singapore’s Personal Data Protection Act, Hong Kong’s Personal Data Ordinance, South Korean’s Personal info Protection Act, and Turkey’s Data Protection Law.

Privacy teams have been clamoring for such global charts, pinpointing where and how privacy laws line up.

With the number of data protection laws reaching into the hundreds and new legislation introduced weekly, privacy professionals are struggling to keep up. This new initiative could prove useful in addressing this challenge but is certainly not the first such attempt to create a global key to this complex map.

Privacy teams across industry often create their own internal rubrics as compliance reference points and privacy tech vendors incorporate such maps into their own software. As privacy professionals consider this new tool, one important consideration will be whether ISO/IEC 27701, designed intentionally to align with the GDPR, is the right reference point against which to map all these laws.

That question aside, one element that differentiates this initiative is the plan to crowd-source future contributions. While the current version maps the alignment of only nine laws, it is designed as an open-source tool, hosted on GitHub, to which anyone with relevant expertise can contribute.

The project was launched by Alex Li, Microsoft’s director of certification policy within its privacy and regulatory affairs team. Li explained the goal of the project is “to provide the global privacy engineering community a shared understanding of how the ISO/IEC 27701 controls relate to global regulatory requirements.”

When ISO/IEC 27701 was released, it already included a mapping to the GDPR. But, as Li notes, “however prominent, (the) GDPR is one of many data protection regulations. This project aims to engage the global privacy community to expand the scope of mapping to additional regulations to build a shared understanding of regulatory requirements and consistency in regulatory accountability around the world.”

Microsoft has also started hosting workshops with EU data protection authorities to discuss the potential of using ISO/IEC 27701 as the basis of a GDPR certification to achieve a similar aim.

Microsoft has enlisted two initial “data curators” to review and vet submissions of mappings to data protection laws from around the world.

Eric Lachaud, a senior IT consultant and guest researcher at the Department of Law, Technology, Markets, and Society at Tilburg University in the Netherlands, is one of these curators. Lachaud's research, which focuses on the contribution of certification to the data protection regulation, aligns closely with the projects’ goals. He explained his interest in the initiative and value he thinks it offers to the privacy community. “There is a recurring need to bridge regulations worldwide in order to ensure interoperability between the frameworks,” he said, referencing the initiative launched by the French regulatory authority, the CNIL, several years ago to explore alignment between EU binding corporate rules and APEC Cross Border Privacy Rules.

“This tool could help the authorities and practitioners to build, test and possibly recognize 'official' correspondences,” he said, noting the tool “could help to identify certain patterns that could be used by the authorities to draft auditable and certifiable provisions.”

While the project aims to bridge the divide between technical and legal professionals, it could help the broader privacy community better understand how to implement a global privacy program. In the 2019 IAPP-TrustArc Measuring Privacy Operations Survey, 56% of respondents indicated they have a global privacy strategy.

Anecdotally, though, practitioners have said that a hybrid approach is often necessary to reflect local nuances or, in some cases, strategic deviations, such as in markets with data localization requirements.

The release of this open-source tool could help privacy professionals better understand those areas of convergence and divergence in today’s legal landscape. IAPP’s Westin Research Center has also published a mapping of ISO/IEC 27701 to IAPP’s CIPM and CIPP/E certifications to shed light on the professional skill set needed to implement a global privacy standard and will submit the mapping to this new initiative.

To learn more about the new tool or contribute to its development, visit here or, if attending RSA 2020, join our panel discussion on ISO/IEC 27701.

Photo by NASA on Unsplash

A closer look at Carnegie Mellon's privacy engineering program

2019-09-20 12:28:56

Though the privacy engineering field is relatively new, there’s been at least one school that’s been at the forefront of this nascent profession. Carnegie Mellon University launched its MSIT- Privacy Engineering Program in 2013. It remains the only privacy engineering-focused master’s program in the country. As industry demand for privacy professionals with technical skills accelerates, it is worth taking a deeper look at the program and how it prepares new professionals for this burgeoning field.

CMU’s MSIT-PE program website describes the degree as one “designed specifically for computer scientists and engineers who want to make [a] meaningful impact as privacy engineers or technical privacy managers.” CMU is the number one-ranked computer science school in the country, according to U.S. News and World Report, and employs numerous privacy scholars, so is well-placed to craft such a curriculum.

Lorrie Cranor, a co-director of the MSIT-PE program and member of IAPP’s Privacy Engineering Section Advisory Board, shared her thoughts about the program.

“The goal of the program is to graduate students who are well prepared for privacy engineering jobs,” Cranor said. “That was the premise when we started the program. That led to the next question - what do privacy engineers need to know how to do? We did spend a lot of time talking with companies that were hiring privacy engineers to find out what they were looking for. I think that gave us some basic ideas for our curriculum. But, the role of privacy engineers is evolving over time, so our curriculum also evolves with it.”

Until recently, “privacy engineers” were rarely found outside the biggest tech companies. But, according to Cranor, that is changing quickly. “We are seeing all sorts of companies that are hiring privacy engineers,” she said. “Privacy lawyers don’t get them everything that they need. They are interested in going beyond the idea of legal compliance. They want to build products and services with privacy built in, and they need people with technical expertise that also understand privacy in order to do that.”

And many of those hiring are looking to CMU to find a workforce with the skills to accomplish that.

Carnegie Mellon is one of several universities participating in the IAPP’s Privacy Pathways program, which aims to build on privacy curricula in universities to launch students into privacy-related careers. IAPP CEO Trevor Hughes, CIPP, has followed CMU’s work closely since the inception of its MSIT-PE program. "We are seeing a huge need in industry for technical professionals with an understanding of the legal and policy context underpinning privacy,” he said. “Going forward, we expect to see an even greater demand for privacy professionals with an interdisciplinary education, combining privacy law, tech and design elements. CMU is making great strides toward meeting that need.”

CMU’s 12-month curriculum combines courses in technology with law and policy as well as usability and human factors. To graduate, students must also complete a privacy-by-design practicum project, where they work in teams with an outside company to tackle a real-world privacy challenge. In one past project, students worked with a social media company to design prototypes of user interfaces to help address the privacy concerns of elderly users. In another, students explored different means of communicating about privacy and gaining consent from drivers of vehicles deploying software collecting personal data.

Cranor believes these projects are useful for both the students and the companies. “It is great to be able to tackle a real-world problem, not just something their professors made up, but a real problem that a company has and learn about the kind of real-world constraints that companies actually face in addressing these privacy challenges.”

In addition to working directly with companies employing privacy engineers during their time at CMU, students can participate in faculty research. Cranor says a number of students take advantage of this every semester.

Where do CMU’S MSIT-PE graduates end up? Many, though certainly not all, find their way to Silicon Valley.

James Arps, a current student in the program, who has seen strong interest in MSIT-PE graduates from potential employers, commented on his own career search and the leg up that CMU provided. “In addition to sending out applications, I was also fortunate enough to receive interview opportunities from CMU's strong network of privacy-focused alumni,” he said. “The employers I spoke with are interested in candidates who are not only familiar with privacy-by-design principles and the current legislative environment, but who are also passionate about user advocacy and have effective communication skills. CMU’s privacy engineering program helped me develop those qualities through a healthy mix of topical lectures and portfolio-worthy class projects which challenged me to think critically about real issues facing industry professionals today.”

Cranor says CMU graduates take on many different types of privacy engineering roles. Some “are assisting with legal privacy compliance efforts;” others “are on product development teams,” she said. “We have some, particularly in big companies, who are building privacy tools that can be used by developers who are not experts in privacy in order to build [it] in throughout the products and services offered by their company.”

Lea Kissner, chief privacy officer at Humu, spoke to the need for privacy engineers in industry.

"The demand for privacy engineers massively outstrips supply,” said Kissner. “I get asked where to hire good privacy engineers at least once a week." In her previous position at Google, Kissner hired numerous graduates of CMU's privacy engineering program. "They've been excellent, with a good grounding in both the needs of users and the technical chops to make that happen," Kissner said.

CMU’s program has started small, but Cranor hopes to see it grow in the coming years. The challenge, she says, is getting enough qualified applicants. CMU looks for students with strong technical skills and an interest in privacy. Previous job experience is not needed, though some students have come to the program as mid-career professionals looking for a change.

She is also working with her colleagues to explore other non-classroom-based means of offering education in privacy engineering. “We have had quite a few inquiries from companies that would like their employees to be educated, but don’t want to send them to Pittsburgh for a year. So, they have asked us if we could put together some sort of certificate or exec ed program that we could provide training either with some of our faculty coming in person for a week or so or recording some lectures that their employees could watch online.”

Cranor said that they are currently exploring options and would love to hear from companies that would be interested in this.

Top photo: Lorrie Cranor

Privacy engineering: The what, why and how

2019-08-08 10:05:10

Privacy engineering will be central to the privacy profession going forward.

That is an easy assertion to make. Privacy professionals have long discussed the importance of building privacy in rather than bolting it on — aka privacy by design. But as technology has raced ahead, the need for privacy engineering has evolved and intensified.

When I began working at the IAPP, I was tapped to lead our privacy engineering initiative — i.e. better define what it is, why it matters and how we can support the professionals doing it. I want to share some of what I have learned from leading practitioners and renew our call for continued engagement from those working in this exciting field.

What is it?

In short, privacy engineering is the technical side of the privacy profession. Privacy engineers ensure that privacy considerations are integrated into product design. The longer answer is that it depends who you ask. Some practitioners view it as process management and others see it more as technical knowhow. Both views seem equally valid and integral. Privacy engineers today work as part of product teams, design teams, IT teams, security teams, and yes, sometimes even legal or compliance teams. The Privacy Engineering program at Carnegie Mellon describes the need for practitioners who “understand technology and [are] able to integrate perspectives that span product design, software development, cyber security, human computer interaction, as well as business and legal considerations.”

Regardless of where they sit, privacy engineers must serve as translators between these teams, turning privacy requirements into technical realities.

Why does it matter?

Privacy engineering matters not only because it leads to better products, but because it can significantly influence a company’s bottom line.

Increasing lawyers’ technical knowledge, helping engineers understand the “why” behind privacy requirements, and ensuring that everyone considers user experience will lead to better products from a consumer perspective. Consumer trust can be a market differentiator so that is clearly one good reason to invest in privacy engineering. Increasingly, though, it is only one of many.

Today, it matters because laws, regulators, and automation demand it.

Legal requirements

Privacy laws today mandate privacy engineering in practice.

The EU General Data Protection Regulation's Article 25, “privacy by design and by default,” comes close to requiring it by name. It demands that organizations “implement appropriate technical and organisational measures, such as pseudonymisation, which are designed to implement data-protection principles, such as data minimisation, in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of [the GDPR] and protect the rights of data subjects.”

Organizations must implement technical and organizational measures to ensure that “only personal data which are necessary for each specific purpose of the processing are processed. That obligation applies to the amount of personal data collected, the extent of their processing, the period of their storage and their accessibility.” The GDPR is clear that requirements must not only be integrated into policies and processes, but also built into products. When lawyers lack an understanding of new technologies or technical approaches or product, design, or engineering teams lack an understanding of privacy principles, meeting these requirements is tough.

The California Consumer Privacy Act calls for privacy engineering less directly. It requires privacy professionals to develop a deeper understanding of how vendors analyze, use and share the personal data businesses provide to them. This means understanding technologies and business practices, particularly those related to data analytics and targeted advertising. That technical understanding will inform the category into which vendors are placed — service provider or third party — and whether “do not sell” requirements apply to them. It will help organizations determine whether contract updates are needed and the extent to which doing business with certain vendors increases privacy risk. The CCPA also places deidentified and aggregate data outside the scope of the law (as does the GDPR). Understanding how and when data is deidentified will increase the options available to businesses seeking to use data in privacy protective and legally compliant ways.

Many other laws and legislative proposals are approaching these issues similarly.

Regulatory enforcement

Perhaps more importantly, regulators are increasingly backing up demands for privacy engineering with enforcement actions.

The U.S. Federal Trade Commission recently demonstrated its willingness to hold companies to account through the largest privacy enforcement action in history — its $5 billion settlement with Facebook. That settlement faulted Facebook for failure to translate public statements and privacy policies into technical realities for individuals, lack of due diligence concerning third-party data handling, inappropriate default privacy settings associated with facial-recognition technology, and deceptive reuse of personal information provided for authentication purposes.

The FTC and other regulators around the world have invested directly in understanding privacy engineering to inform and assist in such regulatory actions. The U.K. Information Commissioner’s Office recently introduced a regulatory sandbox to help organizations and regulators incorporate privacy protections into innovative new products and services. The Irish Data Protection Commission introduced the Technology Research Unit to maximize “the effectiveness of the DPC’s supervision and enforcement teams in assessing risks relating to the dynamics of complex systems and technology.” The French CNIL’s employs a director of technology and innovation to oversee the IT experts department, the IT operations department, the innovation and foresight unit, and the CNIL labs. Other regulators have likely done the same.

Automation needs

Lastly, if another reason is needed, the ability to make wise investments in automation should be considered.

Privacy professionals have suggested that 2019 is the year for automation. In 2017 and 2018, countless businesses significantly revamped their privacy programs to comply with the GDPR. Now, many are turning to the CCPA.

Practitioners have found that the manual processes underpinning those programs are wholly insufficient. Automation has become a must. Perhaps that is why the growth of privacy tech vendors has been so dramatic. The IAPP’s just-released 2019 Tech Vendor report surveys more than 200 companies. These vendors offer services to automate or assist with access requests, activity monitoring, data mapping, consent management, data discovery, website scanning, de-identification, and incident response services. Any of those services might or might not meet an organization’s needs. Understanding which will help and which could create new complications requires at least basic familiarity with legal requirements for privacy, business processes and the system and technologies with which they must integrate.

Privacy engineering knowledge is again key.

How can it be done?

Professionals who want to improve their understanding of privacy engineering have a growing number of options.

Higher education is a great one, but, due to time constraints, is not accessible to many working professionals. Those with several years to devote could study privacy engineering with Lorrie Cranor, CIPT, at Carnegie Mellon or law and computer science with Woody Hartzog at Northeastern. Others could chart their own path, building a hybrid curriculum that integrates courses in business, law, design, and computer science — a program that doesn’t exist today, but hopefully will at some point. An increase in graduates who understand how to bridge these disciples is sorely needed.

For those who need to integrate privacy engineering into their knowledge base and organizations more quickly, three new options are worth exploring.

It seems noteworthy that the first overarching U.S. privacy framework is likely to be one focused on privacy engineering rather than legislative principles.

First, the National Institute of Standards and Technology will release its Privacy Framework by year’s end. The framework, modeled on NIST’s Cybersecurity Framework, lays out a set of privacy controls to help organizations identify, internalize and address privacy risk. Some controls are more technical and others less so. Overall, the framework is designed to provide legal, technical, design and product teams a common rubric to pursue privacy engineering. NIST sought the IAPP’s feedback on whether the current privacy workforce is equipped to implement the forthcoming framework.

In response, the IAPP mapped the Privacy Framework’s Core to our Certified Information Privacy Management body of knowledge and found that they align closely, suggesting that a growing number of privacy professionals are already approaching privacy management as NIST envisions. It seems noteworthy that the first overarching U.S. privacy framework is likely to be one focused on privacy engineering rather than legislative principles.

Second, last week, the IAPP announced a significant update to its Certified Information Privacy Technologist certification. IAPP Certification Director Douglas Forman said that the updates are based on the recognition that privacy tech and engineering are "taking more of a central place with how organizations meet regulatory obligations." Half of all topic content will be new and designed to better reflect evolutions in the field.

Third, just this week, the International Standards Organization released what it characterized as “world’s first International Standard to help organizations manage privacy information and meet regulatory requirements.” ISO/IEC 27701, Security techniques — Extension to ISO/IEC 27001 and ISO/IEC 27002 for privacy information management — Requirements and guidelines, outlines requirements for a privacy-specific information security management system. In announcing the standard, Clare Naden at ISO touted the standard’s benefits in helping organizations meet regulatory requirements across jurisdictions, including those stemming from the GDPR. ISO is also working to a develop ISO/PC 317, a global consumer privacy standard that can be embedded into the design of products and services. The IAPP recently joined the Technical Advisory Group and will continue follow this work closely.

What’s next?

Privacy engineering, like the privacy profession writ large, is a constantly evolving discipline. While I hope my take on the what, why and how is useful to those just beginning to explore it today, it will likely be different tomorrow. I have greatly appreciated the insights shared with me by privacy engineering practitioners. Continued engagement by privacy pros will be critical to help us and our members understand the issues and initiatives on which to focus going forward.

Photo by Bill Oxford on Unsplash

IAPP announces revamped CIPT certification

2019-07-30 12:10:05

The IAPP announced Tuesday that it is updating its Certified Information Privacy Technologist certification in the coming months. The facelift to the CIPT program will include changes to both the exam and training products for the certification.

IAPP Certification Director Douglas Forman said that the updates are based on the recognition that privacy tech and engineering are "taking more of a central place with how organizations meet regulatory obligations."

To bring the program up to speed with the current landscape of privacy technology, the CIPT "Body of Knowledge" and "Exam Blueprint" will have its content and sections modernized. Thirteen of the 26 topics featured in the BoK and blueprint are new, including the addition of "Privacy Engineering" and "Privacy by Design Methodology" as two new fields.

Despite the changes, current CIPT holders will not need to re-test in order to maintain their certification.

Forman knows the wholesale content changes will be a challenge for those attempting to obtain the CIPT certification, but he acknowledges that those seeking the certification will benefit in the end. "We want this to have value and really get into the 'how' of privacy tech," Forman said, noting that leaders in the privacy tech field approved of the changes. "We didn't want to back off the difficulty of something that is so important."

Forman added that the updates reflect a broader group of professionals seeking to become CIPT holders. "While many of the current CIPT holders are working in privacy most of the time, further analysis has shown that there are numerous infosecurity and IT professionals who find themselves working on privacy requirements part of the time," Forman said. "We feel these professionals will find tremendous value in the updated CIPT."

The updated materials for CIPT training will start rolling out in November with new live and online training programs. The new CIPT textbook will be published in January 2020.

Changes to the CIPT exam won't come until 2020, and the current version of the CIPT exam will run until Dec. 31, 2019. The exam will be blacked out for the first month of 2020 before beta testing begins in February, and the official exam will be made available in March.

Thinking through ACL-aware data processing

2019-07-17 12:49:00

Large cloud computing services are generally run for multiple users. In a few cases, all the data processed by that service is public. In virtually all cases, users have an expectation that some of the information about them is kept private. Even if the data store itself is public, logs about access to that data are generally not. Keeping each person’s information separate is most simple in the primary data stores, where each object can easily have its own access control list.

Once we step into the computational pipelines, however, it’s a whole new world.

For all but the smallest cloud services, they can’t effectively run a separate processing pipeline for every single one of their users or customers. It would be an awful lot of pipelines with an awful lot of overhead. Instead, a cloud service will have combined pipelines processing data from multiple sources and producing results for each user/customer.

In order to do this safely, I recommend the "access control list-aware" data processing model.

In ACL-aware data processing, when we compute a result for Alice, we use only data that Alice would be allowed to see, such as:

Her own data.
Data that has been made available to her (shared from other users, or, in an enterprise setting, provided by her employer).
Data that is public (data that is publicly available on the internet).
Data that is anonymous.

Training an anonymous machine-learning model and then querying it with Alice’s data is a common case of ACL-aware data processing, providing that the model is truly anonymous.

Consider friend suggestions on a social network as an example of where ACL-aware data processing is useful. Alice has put some information into the system, perhaps her contacts, perhaps her activity, such as interacting with certain people on the system. Other users have also put in some data. Some of it can be seen by Alice, such as posts shared with her. Other information is public; names and profile pictures, for example, are public on many social networks. Yet, other information might be anonymous, like an anonymous machine-learning model that suggests that if one user has interacted with another user four times, they will tend to accept the friend suggestion. If the friend suggestion pipeline is ACL-aware, then that sort of data is what can be used to make the friend suggestions.

Other users have also put in information that should not be seen by Alice, such as which IP addresses they have used to access the service or their contact information. A friend suggestion pipeline that uses private data like this is likely to leak data between users, sometimes only in subtle ways that may be difficult to test. Difficult-to-test doesn’t necessarily mean impossible-to-attack, sadly. An attacker may create new accounts with carefully crafted contents in order to exploit uncommon corner cases.

As an example, let us take users Eve and Alice. They have both interacted with the service using particular IP addresses. Let us assume that access logs are private, as they generally are, and neither should be able to see the other’s IP address. Both hand-coded friend-suggestion signals and machine-learning models that utilize IP addresses are likely to believe that Eve is more likely to accept Alice as a suggested friend if they log in from the same network subnet, as real-life friends are reasonably likely to share a physical location from time to time.

Thus, if Eve logs in from the same network subnet as Alice, she is likely to see Alice appear as a suggested friend or rank more highly in a list of suggested friends. Eve can wander around, log in from different IPs, and watch what happens to her friend suggestion list to try to learn more about Alice’s IP address use. The same might be true of Alice starting to write private posts about puppies or politics; Alice having certain people in her contacts; or, basically any form of private information used in these recommendation algorithms.

The failure modes of not using ACL-aware data processing are subtle and usually easier to exploit than avoid. Sticking to ACL-aware data processing makes your processing pipelines more robust, especially if you add hardening technology to them, like labeling the type of data and adding checkers to avoid mixing different users’ private data.

Photo by Joshua Sortino on Unsplash

NIST Privacy Framework nearing completion

2019-07-12 10:39:35

A new U.S. privacy framework is quickly approaching completion. The National Institute of Standards and Technology, which holds the drafting pen, is encouraging stakeholders to share their feedback soon.

Since last October, NIST has been working to develop its Privacy Framework to help organizations identify, internalize and address privacy risk. The framework presents the building blocks of a comprehensive data management program that can be implemented across an organization. NIST aims to bridge the legal/IT divide.

They seem to be succeeding.

On July 8 and 9, NIST presented its latest draft of the framework during a workshop in Boise, Idaho, and solicited feedback from participants. Throughout the two-day event, both technically and legally oriented privacy professionals engaged in extensive discussion. The framework has evolved considerably during the now-10-month old consultation process, demonstrating that NIST is serious about taking stakeholder input onboard.

This consultation process, though, will soon end. For those who have not yet focused on this important work, it’s time to key in and share your thoughts. Here’s an overview of where things stand and the timeline moving forward to help.

NIST has asked for any final comments on the latest version by July 18. Comments can be emailed to privacyframework@nist.gov. By summer’s end, NIST plans to release a “preliminary draft” and conduct a formal public comment process. At that stage, all comments received will be posted publicly on the agency’s website. NIST’s Privacy Framework 1.0 is slated to be released by the end of the year.

Structure

As discussed in an earlier article, the framework is broken into several pieces, including the Core, Use Case Profiles, Implementation Tiers, Informative References, a Roadmap, and a Glossary, among other explanatory material. In the latest version, NIST fleshed out four of these components: the Core, Use Case Profiles, the Roadmap and Glossary. They are now nearly complete. In Boise, workshop participants characterized their suggestions as “refinements” noting that there was “not a need for wholesale changes.”

The Core

The heart of the framework is the Core. The Core presents the “Functions, Categories, and Subcategories that describe specific privacy activities that can support managing privacy risks when systems, products, and services are processing data.” In short, it offers the elements of a privacy management program. The framework is not prescriptive. Rather than presenting requirements, it offers technical, legal and structural privacy controls for consideration based on the context in which the organization is operating. These controls will be grouped into either four or six “functions” depending on the feedback NIST receives. These are identify, govern, control, communicate and potentially protect and respond — both more focused on security considerations.

One of the remaining decisions NIST faces is whether the framework offers a “separated Core” or an “integrated Core.”

Stakeholders debated the options during the event in Boise. Participants in earlier workshops expressed concern that the Core’s inclusion of some security controls from NIST’s companion Cybersecurity Framework would cause confusion among organizations that sought to implement the Privacy Framework after deploying the CSF. Others felt that the inclusion of security elements was crucial to any comprehensive privacy program. While the two frameworks are meant to be closely connected and implemented together, both are also designed to stand on their own.

To address the feedback received, NIST presented two options: a separated core excluding all security elements and an integrated core including baseline security controls. In Boise, workshop participants were divided on the best approach, with preferences tipped slightly toward the integrated version. Proponents of the integrated core argued that it was better suited to enhance collaboration across an organization, bringing the security team together with the legal team to discuss shared goals. Those in favor of the separated core felt it better delineated roles within an organization and made clear that the Privacy Framework should be implemented together with the CSF. Some suggested NIST offer both in the final version and let users decide.

The latest version of the framework introduced another structural change to the Core — the addition of “Govern” as a function. The Govern function groups controls related to legal requirements and organizational privacy policies, which had previously been spread among other functions. NIST made this change in response to feedback from the legal community that earlier drafts did not resonate and that governance elements should be elevated. Stakeholders in Boise welcomed the change.

Use Case Profiles

Use Case Profiles, introduced in the latest version, were similarly well received. NIST presented several examples of how organizations of various sizes and types might use the framework. The profiles clarified that an organization need not adopt all controls presented, but rather only those relevant to its business context. The examples also demonstrated how the functions could be divided between different teams within an organization. This highlighted the broad array of organizational stakeholders that could benefit from and use the framework.

The Glossary

The Glossary generated relatively limited debate, with one exception. Participants suggested that defining “data” was too tall an order. They also noted that new terms should generally be avoided.

The Roadmap

The new Roadmap presents a list of future projects NIST could undertake in collaboration with other partners to support continued development and implementation of the framework. These include mechanisms to provide confidence, such as conformity assessment activities, research into emerging technologies, uniform concepts of privacy risk factors, tools for managing re-identification risk, technical standards, and further development of a skilled privacy workforce.

The Privacy Workforce

NIST expressed particular interest in understanding how the privacy workforce might use this new framework and whether organizations have the knowledge and skills necessary to deploy it.

It invited the IAPP along with industry and civil society representatives to present on the topic in Boise. In preparation, an IAPP team assessed how current privacy certifications align with the NIST Framework. We found that the Certified Information Privacy Manager body of knowledge aligns closely with the Integrated Core and that the Certified Information Privacy Technologist covers the more detailed how-to knowledge relevant to the framework’s technical elements.

We presented our findings to stakeholders in Boise and shared a chart comparing the Integrated Core with the CIPM body of knowledge. Our findings suggest that a growing number of privacy professionals are approaching privacy management as NIST envisions and that the forthcoming Privacy Framework could integrate well with organizations’ current privacy programs.

During the workshop, NIST began exploring whether to map the Core to privacy workforce roles, much like the cybersecurity roles outlined in the NICE Framework. This will likely be a key consideration in the months ahead.

A US privacy framework is on the horizon

Once complete, the Privacy Framework will become immediately available for organizations’ use. Like NIST’s Cybersecurity Framework, it will be voluntary, so no Congressional action is needed to make it a reality.

But, the framework’s voluntary nature is no reason to discount its potential impact. Its companion Cybersecurity Framework has become a de facto standard in many circles, as government and private-sector clients increasingly insist that it be implemented by partners and service providers. Participants in NIST’s workshop series expressed hope that the Privacy Framework might enjoy similar uptake, not only in the U.S., but internationally, as well.

Photo by Steve Johnson on Unsplash

Deidentification versus anonymization

2019-06-18 11:58:07

Anonymization is hard.

Just like cryptography, most people are not qualified to build their own. Unlike cryptography, the research is far earlier stage, and the pre-built code is virtually unavailable. That hasn’t stopped people from claiming certain datasets (like this) are anonymized and (sadly) having them re-identified. Those datasets are generally deidentified rather than anonymized — the names and obvious identifiers are stripped out, but the rest of the data is left untouched. Deidentification doesn’t tend to successfully anonymize data because there are so many sources of data in the world that still have identifying information in them; figure out where some identified dataset and the deidentified data align, and you’ve re-identified the dataset. If the dataset was anonymized, it would have been transformed such that re-identification was impossible, no matter what other information the attacker has to hand.

But fear not! Good anonymization does exist!

This is an important point to recognize because anonymization is a deeply important technique in making the world work with greater privacy. Differential privacy holds great promise and is a critical building block behind techniques like federated learning. Note that differential privacy is not a panacea; there are complex discussions lurking there, including appropriate “privacy budget.”

There are also simpler techniques that are also useful. For example, if I run a popular web service, I need to have enough server power to run that service. My users would strongly prefer that I predict how much I’ll need and provision that amount in advance rather than periodically running out. In order to do that prediction, I need to know how much traffic I had at points in the past and how much in the present. Storing that I had approximately 55 million pages loaded on a particular day with a peak of 5,000 requests per second gives me enough information to project load. The crowds are large, the rounding substantial, and — for some traffic measurement techniques — the measurement may not be precise. We have removed the user data and rendered the number anonymized.

Deidentification doesn’t tend to successfully anonymize data because there are so many sources of data in the world that still have identifying information in them.

Deidentification is not anonymization (in virtually all cases), but it’s still useful as a data minimization technique. Anonymization is not always an option: If I buy software from an app store, I would be exceedingly displeased if the app store anonymized those records so I couldn’t run the software any more! Anonymized pictures of my kids would defeat the point. But deidentification is practical for certain types of processing. When training a machine-learning model to recommend new apps, the training data doesn’t need to include who I am, just what I have.

There’s a counterintuitive pitfall to avoid in deidentification: Overdoing it can cause other privacy problems. When someone asks you to delete their data, you need to delete it completely, not just from the primary data store but from all these secondary data stores, like caches, analytics datasets, and ML training datasets. If you’ve deidentified that data in a meaningful way, then it’s awfully hard to figure out what part of the dataset to delete. Because the data is not anonymized, you are still responsible for doing this deletion. The most effective way to do this is to delete the entire dataset periodically. Some datasets only need to exist for a limited time, like slices of server logs used for analytics. Some datasets can be periodically recreated from the primary data store, like ML training datasets.

In some cases, deidentification is useful in long-term datasets. For example, we live in a world where each user may have multiple mobile devices. In some cases, a server needs to keep track of where (which user, which device) certain pieces of data came from, for example, so that data can be automatically deleted when that particular device is no longer in use. That means the server needs some kind of device identifier. Looking more closely, however, the server only needs a few bits of information to differentiate between the relatively small number of devices owned by one user. Thus, the server doesn’t need that whole device identifier: They can hash the device identifier and throw away most of the bits of that resulting hash. All that the server really needs, in this case, is the ability to differentiate between devices from the same user and retain enough bits of the hash to do that; there’s no need for more.

Photo by Llanydd Lloyd on Unsplash

Encryption, redaction and the CCPA

2019-05-28 11:51:38

There appears to be consistent confusion with regard to the California Consumer Privacy Act and its incentives to encrypt and redact personal information wherever possible.

Specifically, the CCPA encourages security through two means. First, non-encrypted and non-redacted information that is breached results in fines of up to $750 per consumer. Data that is encrypted and redacted may potentially avoid such fines in the case of a breach.

Second, deidentified or aggregate data is not subject to the CCPA’s rigorous obligations. Broadly, deidentified data refers to information in which individual identities have been removed and that is reasonably unlikely to be linked to any consumer or household.

While encryption and redaction certainly sound similar, the strategy and implementation for each are quite different. In sum, when you confuse encryption for the purposes of redaction, you can create more problems. This short post is meant to clarify what each does and how to think about both encryption and redaction in the context of the CCPA.

Understanding encryption

Encryption is a security strategy. It protects your organization from scenarios like a devastating breach where, if the adversary were to gain access to your servers, the data stored would be of no use to them, unless they have the encryption key. It’s an all-or-nothing security posture: You either get the see the data unencrypted, or you don’t.

Decryption is typically handled at what’s called the “storage layer” — where the operating system or similar decrypts information as it reads it from disk. Data is stored encrypted and then decrypted as it is read so that data can be used for computation.

As you can imagine and what CCPA appropriately incentivizes is that organizations always encrypt their data on disk to avoid breaches such as these.

Understanding redaction

While encryption on disk is critical to avoid disclosure by auxiliary methods of accessing the data, it does not mitigate many other forms of attack that can undermine privacy. And this is where redaction comes into play, which can be also termed “deidentification” or “anonymization.” Redaction is ultimately a privacy strategy.

So how does it work? Redaction hides specific values within data using a variety of techniques — such as by “masking” those values, for example, through hashing (where instead of the raw value a user will see a random string of values) or by rounding (where numbers might be rounded to the nearest tenth of a decimal point, for example). Redaction is less black and white than encryption — which means that data can be accessed and used even while it is redacted with different levels of utility based on how much manipulating of the data is done to protect privacy.

With redaction, you can build policies on what data is shown to whom at query time. You might use redaction to build rules such as “mask Social Security number unless the user running the query is in the group HR.” This is a strategy not simply to protect against breaches but also to maintain consumer privacy.

Embracing encryption and redaction

The difference seems pretty clear, right? The confusion occurs between when and how far to enforce encryption. Because some users will, at some point, need to access the raw data — otherwise, why store it? Some would argue that raw data needs to be encrypted in the database rather than at the storage level — which is misconstrued as a masking technique. If the values are stored encrypted, one needs to access the database through an alternate interface that understands how to deal with the encryption. At this point, you can think of the system boundary for the database as the database plus this interface, and now this combined system has all the same semantics as the database, and so, inherits all those problems and solves nothing.

The way to balance all these needs is, in my view, to dynamically enforce deidentification at query time, combining the security strengths of both encryptions with the privacy protections of redaction.

This is a powerful and critical approach to data protection for a variety of reasons. To start with, you can enforce highly complex controls across users, and those controls can be different based on what the user is doing or what their attributes are. If you simply encrypt, the user either has access to the data or they don’t. There’s no way to add the needed granularity, nor can you change policies down the road easily when the rules on your data inevitably change.

Query-time deidentification enables organizations to think about risk in the context of how the data is actually used and accessed. If a certain way of accessing the data is too risky, you can minimize the number of users that have access to that view and ensure that users accessing the data risks in doing so.

Second, with query-time deidentification, you can mask data using techniques that are impossible or computationally infeasible to reverse. This avoids worrying about protecting decryption keys, as the users will get masked values and have no way to reverse that mask (think of techniques like hashing or, even more simply, making the value “null”).

Third, you can maintain visibility into the policies that are being enforced at a highly granular level, knowing what policies were applied to which users and when. A central component of the CCPA is transparency — mandating that organizations provide consumers with a clear understanding of what data is held about them and how it’s being used. In order to provide consumers with this information, organizations must have a uniform logging system to understand their own data environment.

In sum, organizations should encrypt their data on a disk as a required security measure. But they must not stop there. In fact, the CCPA is clear that they should go further. Without relying solely on encryption, organizations can improve security, privacy and business insights. Query-time deidentification is a key tool to meet the CCPA’s redaction requirements.

Steve Touw is co-founder and chief technology officer at Immuta.

Photo by Markus Spiske on Unsplash

Industry groups build awareness of Domain Name System expansion

2019-05-10 12:03:02

The Universal Acceptance Steering Group, which includes Apple, GoDaddy, Google, the Internet Corporation for Assigned Names and Numbers, Microsoft and Verisign as stakeholders, has launched an awareness campaign to inform organizations about the expansion of the Domain Name System into what it calls Universal Acceptance. The UASG, established in 2015, characterizes UA as “the simple concept that all domain names and all email addresses work across all applications.” It is a prerequisite to a multilingual internet. This means that applications recognize and work with both common domains (.com, .org) and newer ones using non–American Standard Code for Information Interchange scripts, such as Arabic or Cyrillic, as well as email addresses using non-ASCII characters. According to the group, this has data protection implications because ensuring an application is built for UA can help strengthen passwords, ensure accurate validation, and enable individual participation for those using such non-ASCII characters.
Full Story

NIST Privacy Framework recognizes critical need for workforce development

2019-03-20 10:27:28

On Feb. 27, the National Institute of Standards and Technology released an outline of its forthcoming Privacy Framework, providing the first real glimpse of what the framework might include and calling attention to the need for privacy workforce development. NIST Senior Privacy Policy Advisor Naomi Lefkovitz and NIST Senior IT Policy Advisor Adam Sedgewick presented that outline during a public webinar March 14. Nearly 400 stakeholders participated.

In the webinar, Lefkovitz and Sedgewick focused on NIST’s interest in an open, transparent and collaborative framework development process. They emphasized the importance of privacy professionals’ engagement in NIST’s process, helping to answer the questions: “Is this going to be a good communication tool? Is it going to be accessible to non-privacy professionals, as well as to privacy professionals? Will [this] help privacy professionals talk with business lines or senior executives?”

One opportunity for such engagement will be during NIST’s session at the IAPP Summit May 3 and another during NIST’s next Privacy Framework workshop May 14–15 at Georgia Tech.

NIST’s planned Privacy Framework similar to Cybersecurity Framework

The Privacy Framework outline NIST presented is structurally similar to its 2014 Cybersecurity Framework. It is organized around the functions an organization must undertake to manage privacy risk, the profile of the organization using it, and a tiered implementation structure. The outline suggests that the framework will be designed to meet an organization where it is and help it improve its privacy protections, based on its business requirements, risk tolerance, privacy objectives and resources.

The five functions or activities identified in this initial outline are: identify, protect, control, inform and respond. In the forthcoming discussion draft, which NIST plans to release prior to the May workshop, each function will be divided into categories linked to programmatic needs and specific privacy outcomes.

Lefkovitz offered an example to illustrate the framework’s outcomes-based approach. She said that one outcome tied to control might be the ability to delete data. That will not be prescribed but is a capability that an organization may want, particularly if it is subject to certain laws or regulations. She cited the EU General Data Protection Regulation’s right to be forgotten as one such reason organizations might want such a capability. In using this example, Lefkovitz made clear that the framework will not be designed around any specific law, standard or regulation. “Compatibility doesn’t necessarily mean mirroring any one law,” Lefkovitz said. Rather, the framework will serve as a tool to help organizations achieve outcomes that could be necessary to comply with laws, address risks or fill business needs.

Risk-based approach with new emphasis on workforce

The framework’s tiered structure will guide organizations through framework implementation. Organizations will be able to assess themselves against and progress through tiers in four areas of focus, depending on the nature of the risk they face and their desired privacy outcomes.

The first two areas will be familiar to those who have implemented the Cybersecurity Framework. They are Risk Management Process and Risk Management Program, each with a privacy slant.

The third area, titled Ecosystem Relationships, has some similarity to the Cybersecurity Framework’s External Participation component. Rather than focusing on sharing information on risks with external partners, it refers to the organization’s understanding of its role and contribution to privacy risk management in the broader ecosystem. This difference recognizes the increasingly complex and interconnected environment organizations face today and the exponential growth of third parties engaging with data.

The fourth area is more novel, having received only minimal attention in the Cybersecurity Framework: Workforce. NIST added Workforce as a fourth area of focus due to “RFI responses recognizing that privacy workforce development is a critical need.” NIST highlighted the importance of coordination between training certification organizations, academic institutions and organizations implementing the framework. NIST also called on organizations to communicate their privacy risk management needs and desired skill sets.

Opportunity for input

In presenting their outline, Lefkovitz and Sedgewick emphasized that this is only a proposal, which will continue to evolve as the Cybersecurity Framework itself did. Sedgewick indicated that they are looking for consensus. While that is an elusive concept in the field of privacy, so far commenters have largely agreed that this framework must be compatible with state, national and international laws, regulations and standards, as well as NIST’s Cybersecurity Framework. NIST aims to achieve that level of compatibility but stressed that to do so, they need stakeholder input, including on the range of international standards to which they should look.

Lefkovitz and Sedgewick solicited input via email at privacyframework@nist.gov, during the IAPP Summit and NIST’s May workshop. Sedgewick emphasized the “work” in workshop, noting that they hope people will come ready “to roll up their sleeves.” NIST aims to complete the framework by October.

Top image taken by Jedidiah Bracy at NIST's initial 2018 hearing on the planned Privacy Framework in Austin, Texas.

IAPP IAPP News

Privacy and responsible AI

Broader AI governance developments

Privacy regulations and responsible AI

Recent cases

Challenges in definition and practice

Practical assessment and documentation

More developments on the horizon

Standardization landscape for privacy: Part 1 — The NIST Privacy Framework

A first overview of the NIST Privacy Framework

The three components of the NIST Privacy Framework

Integration of compliance and risk management

The Privacy Framework in action

Further steps ahead

Privacy as code: A new taxonomy for privacy

Multiparty computation as supplementary measure and potential data anonymization tool

Regulators are catching up

Solid technology with many protocols

Increased legal clarity, but open questions remain

Podcast: The privacy, ethical issues with empathic technology

Setting data retention timelines

Keeping data

How long to keep data

Data retention in a distributed system

There can be only one primary data source for every piece of data. Know where it is.

Every other copy of a piece of data is a secondary datastore. Know where and how stale they are.

The primary must remain within the appropriate allowed timeframe. Each secondary must also be completely synced to the primary within the appropriate allowed retention time frame.

… and if you’re not monitoring this, don’t assume it’s working.

Deidentification 201: A lawyer’s guide to pseudonymization and anonymization

What are direct and indirect identifiers?

What’s pseudonymization?

So what’s anonymization?

What’s a risk-based approach to anonymization?

Can mathematical techniques help with anonymization, like k-anonymization and differential privacy?

So if I want to functionally anonymize data, what should I do?

What lives between data privacy and data governance? Better compliance

The GDPR and CCPA wake-up call

Unveiling the blind spot of data governance

Governance gets its day

When privacy and governance work together

Full compliance, real insights

Aggregated data provides a false sense of security

The final count down

Fables of the reconstruction

One or the other

Stuck in the middle

Microsoft launches open-source privacy mapping tool

A closer look at Carnegie Mellon's privacy engineering program

Privacy engineering: The what, why and how

IAPP announces revamped CIPT certification

Thinking through ACL-aware data processing

NIST Privacy Framework nearing completion

Structure

Deidentification versus anonymization

Encryption, redaction and the CCPA

Industry groups build awareness of Domain Name System expansion

NIST Privacy Framework recognizes critical need for workforce development