TOTAL: {[ getCartTotalCost() | currencyFilter ]} Update cart for total shopping_basket Checkout

Privacy Perspectives | One Way To Bridge the Legal-Engineering Gap? Define Personal Information Related reading: Dilemmas facing a DPO




We have had a number of very thoughtful articles recently that highlight language issues within the privacy sphere as barriers to progress.

Ian Oliver wrote about the problems software engineers face when trying to code the law into information systems. This struck a chord with me as a major part of my job is to translate legal requirements into software specifications. It also reminded me of Brooks Dobbs’ earlier article about the struggles privacy professionals sometimes have defining what is or is not personal information under regulation.

These issues highlight differences in thinking that originate from the language used in the major legal codes dominating the privacy debate—U.S., EU and Canadian laws. This in turn is a constant reminder that while many people see the importance of a multidisciplinary approach to privacy, it is difficult to think of a true “privacy profession” when it is, albeit with very good historical reasons, dominated by language and thinking that is perhaps legal first, privacy second. In fact, a profession only achieves its own identity when it has a language that is all its own—with defined terminology that is distinct from everyday language or that of other professions.

With all these things in mind, we can take a great leap forward if we can find a way of describing what lies at the heart of privacy—the data itself—in a way that both lawyers and engineers will find useful, and that might be more jurisdiction-independent whilst still being interpretable within different legal traditions.

So here goes ...

Personal Is Contextual

Everyone agrees that context is vital in determining whether data is personal or not. It is context that turns an IP address from a string of numbers into a piece of information that can identify an individual. It is context that turns the geolocation history of a mobile device into a behavioural profile.

This means that in most cases it is fruitless to think of personal information in terms of isolated data items; instead, we should think of “data sets,” collections of individual data points that have some kind of identifiable relationship with each other.

In traditional software terms, a relational database would be a data set. In more modern big data terminology, it might be a set of semi-structured tables in a Hadoop cluster. Either way, it is a relatively discreet store that contains many different types of data, which can be queried in some way to extract usable information.

It is likely that most organisations will have many such data sets, some nothing to do with personal information. It is also likely that many data sets will overlap in terms of the data items they contain. Data sets might be created by splitting one set into two or combining data from different sources. It might be used for one or more purposes, but generally speaking, a data set will have defined boundaries, member and nonmember items and, therefore, can be considered as a having a single identity and set of characteristics.

By finding a way to describe these characteristics in privacy-relevant ways, we can not only start to bridge the gap between what is meaningful for the legal and software professionals but also make reasoned decisions about the treatment of those data sets.

As a starter, I propose doing this in three key dimensions, which, until better terms can be suggested, I am going to call Classes, Categories and States.

Below I present descriptions of the dimensions and try to illustrate them by example. I do not believe these descriptions are robust enough to be called definitions, though that would be a logical next step.

I also don’t think any of these ideas are particularly novel; many of them have been described before. However, I have not seen anything that tries to bring all these ideas together in one place—which is what I am hoping to achieve here.

Classes of Personal Data

A class is a high-level categorisation that looks to define the relationship between the data and the individual the data is about. There are four of these, and any data set containing information from, about or connected to an individual in some way could be defined as belonging to one of these classes. By inference, a data set that does not fit into any of these classes would be nonpersonal data.

Directly Identified—a data set that includes information which identifies individuals in terms of making them directly addressable/contactable. This means the data includes traditional identity data like name, postal address, telephone number, email address, etc.

Indirectly Identified—a data set which is about individuals and contains identifiers that enable it to be connected to another data set that renders those individuals directly identified, even if that connection is not actually made in a given use case or even that the relevant data set is unavailable to a particular organisation to make the connection. The crucial element is that the connection is possible. This includes data like national identifiers, IP and MAC addresses, licence-plate numbers, membership numbers.

Directly Identifiable—a data set that enables decisions to be made that will have an impact on an individual, even without actually identifying them in the more traditional sense. So this would include OBA profiles linked to a cookie-type identifier that enables an ad network to choose which adverts to display to a web user.

Indirectly Identifiable—a data set that may be about an individual, which on its own can neither single out nor have an impact on an individual but could do so in combination with other appropriate data sets. An example here might be “anonymous” geolocation records that could be correlated with other data to create a “segment of one" or at least a high enough probability that the data is about an individual that decisions can be made as if this was the case. This is likely to describe a large number of big data sets aggregated and used for things like predictive analytics. It might also describe a lot of data sets from Internet of Things sensors.

Categories of Personal Data

Categories are a way of describing the data itself. I see categories as potentially being hierarchical, providing different levels of granularity as appropriate to the data set. Also categories may be a way of linking data to jurisdictional boundaries and requirements.

So the highest level categories would be Sensitive and Nonsensitive—bearing in mind that there are differences between jurisdictions in what is classed as sensitive.

Lower level categories would then be Health, Financial, Contact, Biographical, Communications, etc. In our own work we have identified 16 such categories. There could be a case for attempting to standardise these to create a common structure—but equally I could see that may be difficult and would not make sense in all situations.

It is of course likely that any data set will contain multiple categories of data within it, perhaps with different levels of legal requirements attached to them.

States of Personal Data

I see states as being a description of how data is stored or transmitted. The state also relates directly to the security of the data in the event of unauthorised access.

Clear—Data is stored and normally accessed in its raw form, which would also mean human readable where text is concerned.

Masked—Data is stored in unencrypted format but obscured in reading/viewing it, like when entering a password the characters are replaced by black dots.

Encrypted—Data is encrypted in storage and can only be read through use of the appropriate decryption key.

De-identified—Data is altered from its raw form so that individuals are no longer identified but remain directly identifiable. This would include things like replacing people’s names with an ID number within the data set but retaining the ability to re-identify by reference to a separate data set that contains the names linked to the ID.

Pseudonymised—Data is altered from its raw form so that individuals are only indirectly identifiable within the data set. However, it remains possible to make individuals directly identifiable or identified through aggregation with one or more additional data sets.

Anonymised—Data is treated so that no individual can be either directly or indirectly identified or is identifiable. As technology advances, true anonymisation is increasingly seen as an unreachable goal. It may be that this state has to be tempered with a qualification, “anonymised according to best available practices and knowledge at the time.”

Using "anonymised" rather than "anonymous" is deliberate in this context. Anonymous data is something that never in its life contained personal information and as such generally falls outside of the scope of most privacy regulation. Anonymised acknowledges that it once had personal information and, therefore, retains a risk that it might be shown to be able to reveal that again in the future.

As with categories above, a data set might contain data in more than one State.

Differentiating the Risks

The purpose of describing data sets along these dimensions is that it helps to separate out the different risks present both in type and severity level, as well as identify the most appropriate strategies to mitigate those risks.

For example, if you have a large data set of directly identified individuals with multiple categories of data and all stored in a clear state, the potential harm caused by a breach would be more significant than if the data was only indirectly identifiable and had been encrypted.

As such, these data sets have different needs in terms of security, access controls, consent for use and notice. By creating a better structure for how we describe personal information, what we do is break down that un-codeable “reasonable” that exists in the law into much more actionable and context-specific privacy requirements.

By adopting a standardised structure across the profession, it enables the better sharing of experience, solutions and best practice. This is, of course, another core activity for a professional body.

I offer this up not as a finished or fully researched model but as a contribution to the debate about how we move from the principles of the law to the practical application of information engineering in a way that helps the many audiences within this profession communicate and solve the problems of privacy better.


If you want to comment on this post, you need to login.

  • comment Gillian • Aug 8, 2014
    Thank you Richard for a very helpful article.  I would also support the development of consistent terminology between privacy and technology professionals.  This is particularly timely for privacy practitioners in Switzerland as we are considering the new FINMA regulations on the treatment of electronic client data which contain similar classifications.
  • comment Richard • Aug 8, 2014
    I should add that I think there is another key dimension of data that I didn't cover in the paper, mainly because of length. This is 'Source' or 'Origin' - where the data came from.  I think the work of the Information Accountability Foundation is the best place to start with this and its origins: Provided, Observed, Derived, Inferred -
    However to this I would add a fifth - Acquired - being data that has come in from a third party (whether paid for or otherwise).
  • comment Marc • Aug 8, 2014
    This is a great article and is consistent with how I view privacy and responsible data governance.  We could debate the names of the categories, but the analytical approach to both information and risk is, in my view, dead on.  It is also consistent with the framework in the Network Advertising Initiative's self-regulatory code of conduct.  We have four categories of data that are very similar to the categories above, although we label the buckets of data as: (1) PII; (2) Non-PII; (3) De-Identified; and (4) Sensitive.  I confess that these names are less than ideal, but in the framework different obligations attach to the different categories of data because there are different levels of risk. Or, as you so articulately stated, "these data sets have different needs in terms of security, access controls, consent for use and notice."  We should resist the urge to treat all data the same and all uses of data the same.  It will lead to absurd results. 
    Finally, putting aside names, the important point is that "[b]y adopting a standardised structure ... it enables the better sharing of experience, solutions and best practice." And that is one of the key objectives of our program. I'm eager to continue the discussion and work on the vocabulary so that we can indeed solve the problems of privacy better.
  • comment Richard • Aug 11, 2014
    Marc - thank you for the comments. With respect to the 'less than ideal' NAI labels - my guess is these are derived from the legal status of the data in question, within a particular jurisdiction.  This obviously works within that context - but makes it more problematic outside of that jurisdiction - where what goes in what category might be different, and also creates difficulties for software engineers who don't have (and shouldn't need to develop) the legal knowledge.  This is where I was hoping to promote the idea of descriptions of personal information that are independent of the nuances of individual legal codes.  It would also make it easier to put the data into categories specific to a particular context - such as the NAI code of conduct.
    I certainly welcome all input - what I would really like to see is the IAPP or a similar organisation leading efforts to agree such standardisation.