In the final part of a three-part series on the people, process and technology impacts of Europe’s forthcoming General Data Protection Regulation (GDPR), Steve Kenny looks at how the new rules will affect technology.
Part Three: Technology
So here’s the bombshell: any data used to make inferences linked tenuously or otherwise to a living person is personal data under GDPR. Cookie IDs, IP addresses, any device identifier? All personal data. Even metadata with no obvious identifier is caught under the GDPR's definition of personal data. Truth be told, such assertions are not entirely new. The difference under GDPR is that they will be enforced.
And this is a big deal. Today swathes of business practices unlocking data monetization rely upon data not being considered personal. So they apply weak consent, onward transfer and data reuse concepts. These models are going to change; either by choice, or by obligation.
GDPR Risks and "Data Science"
The term data science describes a process from data discovery, to providing access to data through technologies such as Apache Hadoop (open source software for large data sets) in the case of Big Data; and distilling the data through architectures such as Spark, in-memory and parallel processing. That data science creates value is understood. What isn’t are the risks it exposes investors to under the GDPR, of which there are principally three:
Risk 1: The Unknown Elephant in the Room – Unicity: a general misunderstanding in monetization strategies is that stripping away identifiers of a data model renders the data set anonymous. Such a belief is flawed. So-called anonymous data sets can often, without implausible effort, be re-identified. Unicity is a measure of how easy it is to re-identify data. It quantifies additional data needed to re-identify a user. The higher a data set’s unicity, the easier it is to re-identify. Transactional and geo-temporal data yield not only high monetization potential, they carry statistically unique patterns which give rise to high unicity.
Risk 2: Relevance & Quality: Income, preferences and family circumstances routinely change, and preference data on children is difficult to ethically justify processing. While this creates a problem for predictive analytics, that data and the inferences it engenders can be considered inaccurate at a given point in time, which creates a GDPR cause-of-action. Data quality needs to stay aligned to business objectives.
Risk 3: Expecting the Unexpected: When data science creates unexpected inferences about us, it tends to invalidate the consent that allowed data to be captured in the first place, which, again, is a big deal. Data collected today, particularly from mobile devices, is subject to a constant stream of future inferences that neither the customer nor the collector can reasonably comprehend. Consider a car-sharing app that can model propensity for one-night-stands from usage patterns. While that data may not result in propositions today, the market will consider upside risk/option value to have been created (the market still does not seem to believe in GDPR impact), but this incremental data coming into existence creates downside risk (such data is difficult to find a legal-basis for, given the vagaries of a given consented disclosure).
More generally, the problem of negative correlations is brought to the fore by algorithmic flaws, biased data and ill-considered marketing or risk practices, the enduring example being U.S. retailer Targets’ predictive campaigns to pregnant teenagers, spotted by parents. These are examples of a new form of systemic control failure, leading to potentially actionable GDPR claims.
The BIG 5 Impacts for Chief Information Officers
There are five key changes flowing from GDPR that affect IT:
1) Shattering the risk/reward trade off: Privacy enhancing technology (PET) and Privacy by Design (PbD) are obligatory and mandated requirements under the GDPR. There remains no generally accepted definition of PET or PbD, but PbD is considered an evidencing step for software development processes to take account of privacy requirements. So the incorporation of what can broadly be defined as PET in such solutions represents PbD.
The potential of PET is to enhance relevance, price efficiently and reduce losses with less likability and identification in an explicit consent and ubiquitous personal data world. PET moves you in a Northeasterly direction on the Privacy Grid through anonymized data analytics (fuzzy matching of cryptographically altered data), false negative biased methods, and self-correcting false positives (rapid repair of incorrect assertions).
Two particular PET techniques that control downside and enable upside risk are differential privacy & homomorphic encryption.
Differential privacy counters re-identification risk and can be applied to anonymous data mining of frequent patterns. The approach obscures data specific to an individual by algorithmically injecting noise. More formally: for a given computational task T and a given value of ϵ there will be many differentially private algorithms for achieving T in a ϵ-differentially private manner. This enables computable optima’s of privacy and also data utility to be defined by modifying either the data (inputs to query algorithms) or by modifying the outputs (of the queries), or both.
Searchable/homomorphic encryption allows encrypted data to be analyzed through information releasing algorithms. Considered implausible only recently, advances in axiomatizing computable definitions of both privacy and utility have enabled companies such as IBM & Fujitsu to commercially pioneer the approach.
2) Data portability: Empowers customers to port their profiles and segmentation inferences from one service provider to another. This is a reflection by lawmakers that data is relevant to competition law, whilst not conceding an imbalance between a companies ability to benefit from data at expenses of us all as citizens.
Scientists working in the Luxembourg Privacy Cluster consider portability in terms of a common data exchange format, enabling consumers to seamlessly transfer their personal data from one provider to another. This is achieved through structured ontologies and schemas for characterizing personal data, as well as the rules and constraints that need to be met. These ontologies are not meant to replace existing data storage schemas, but rather to define a common intermediary to enable data exchange. It is envisioned that different providers will create mappings to common ontologies and schemas which handle end-to-end data exchange from one provider to another.
3) The Right To Be Forgotten: Provides a rationale for organizations to respond up to real time by removing identifiers from all instances in which they can reasonably permit re-identification–inclusive of third-party assets. A framework to comply with this obligation would include the following steps:
- Spot identifiers which tie together datasets, e.g: machine identifiers link together our social media experiences;
- Prescribe how re-identifiable data flows in and outside the organization;
- Document a scalable process to overwrite identifiers in all datasets where re-identification can be established, upon the validated request of a user, and
- Third party contracts and SLAs should be adjusted to ensure compliance with validated requests.
4) Data Bookkeeping: Field level data, linked to an identifier, flows across geographies and legal entities, processed by machines and people. Organizations will account for these flows with evergreen reporting. It stands to reason that these flows will be threat-modeled for integrity and confidentiality so controls can be readily evidenced upon request.
5) Consent: Customers consent to privacy policies that change. Being able to prove which contract was agreed to, in court or to a regulator, requires registration timestamping and tamper resistant logs become de rigueur.
As we move into an opt-in world of explicit consent and ubiquitous personal data, data transmissions beyond a website visit must be explicitly permissioned and controlled. In this world, default browser values de-link machine identifiers from search queries. In other words, in this new world, online advertising to EU citizens is in line for fundamental change.
And given particular regulatory emphasis on profiling, explicit consent will require loyalty programs to differentiate consent between general and personalized marketing consents. Those consent flags must cascade through registration, reporting and analysis, targeting and profiling, contact center operations and all other processes that handle such data.
Planning Change
CIOs need to consider how the GDPR is prioritized amongst their operational excellence, growth, digital, data and cyber security transformation objectives. With GDPR soon to be, but not, yet ratified, a subsequent two-year implementation window requires credible planning strategy now.
This is accomplished through a blended approach. First, scenario planning to cope with unstable Big 5 requirements within multi-year roadmaps. Second, tactical quick wins: a large body of requirements are not changing, which means under the GDPR, existing weaknesses become intolerable. Here there are five tactical areas typically to consider:
Encrypt more: Today encryption still is largely applied to PCI-DSS mandated data such as credit cards. Given constant data breaches, the advent of crippling penalties will stimulate larger scale efforts to embed cryptosystems.
Delete more:When key performance indicators incentivize the collection and retention of data that has more downside than upside, business rules should be reconsidered. Deleting data on inactive customers at the point where re-activation is unlikely focuses attention on valuable customers and reduces exposure risk. Data that remains post deletion should have very low levels of unicity in order to power monetization strategies, given appropriate safeguards.
Shore up IT governance:Health checks on management information from service provider service level agreements, production system change controls, embedded PIAs, access controls and control procedures should yield improvement opportunities. Investment in electronic records management systems will likely be prioritized due to incremental litigation risk, so understanding the implication of these programs now will be useful.
Reflect upon skills: The application of PET skills to business problems will be held in highest regard for organizations. Where a company's data posture is tepid, PET tends to be more of a cost, and external service providers may be best placed to add expertise. Where there is a burning data ambition, PET is more of a capability, and hiring objectives should be considered in terms of a concerted effort to propel the organization on a Northeasterly direction along the Privacy Grid, transforming software engineering capabilities across delivery managers, programmers, scrum masters and data scientists themselves.
Keep your head in the clouds: a cloud service provider (CSP) who can attest GDPR compliance for a given service presents its client with a remarkably efficient way to hedge GDPR liability and outsource compliance cost. Demand for such services will be price-insensitive. Major concerns such as multi-tenant data exfiltration attacks and limiting data propagation based on data provenance are becoming resolvable, further stimulating demand. There are signs in the CSP sector of understanding the opportunity, with Microsoft Azure becoming the first CSP to adopt ISO 27018--the most advanced set of security/privacy controls in the cloud, though a long way short of GDPR. So consider how the market can provide (turnkey) GDPR solutions. It may just be your best bet.
This series on GDPR impacts started out by defining a decision framework from which different risk drivers could be reconciled. Reconsidered risk management philosophy sets the tone from the top. What followed in parts two and three considered how to implement this in an organizational construct and in technology. Both are best considered in light of an organization’s contribution to a broad constituency of interests. We are still at the start of a data revolution; let’s get it right together.