TOTAL: {[ getCartTotalCost() | currencyFilter ]} Update cart for total shopping_basket Checkout

The Privacy Advisor | Structured versus unstructured data, an engineer’s take on the privacy implications  Related reading: The disciplines of modern data privacy engineering

rss_feed

""

""

For engineers who have to write privacy regulations into their code and lawyers who need engineering support to create scalable privacy programs, the difference between structured versus unstructured data makes a world of difference. 

Anyone who works in privacy will likely agree that unstructured data is one of the harder problems we face. Yet, privacy laws don’t care whether data is structured or unstructured – the regulations are still the same.

As an engineer and founder of a data privacy infrastructure company, here’s my first-hand account of the privacy implications for structured versus unstructured data. 

I’ll focus on exploring answers to three central questions that I hear time and time again from privacy teams building automated privacy requests at the code-level: 

1. What are the basics of structured data versus unstructured data?

2. What are the engineering implications for the two different types of data? 

3. How do engineers solve the challenges that unstructured data represent to privacy issues, like access and deletion requests? 

First, the basics of structured and unstructured data 

Let’s quickly unpack the difference between structured and unstructured data using relevant engineering examples. 

Structured data is data or user information represented inside a system in a strict, field-based format, making it easier to operate on data inside that structure. An overly simplified example of structured data is data filed under the First Name, Last  Name and Email fields in a database.  

Unstructured data is when data or user information is stored in a disorderly or unregulated way. Examples of this could include free form text in a software system, such as messages in a workplace chat software or text in a Word document. It is unstructured because it may appear randomly and isn’t arranged in any form or fashion, such as being linked to a particular user. 

What are the engineering implications of structured data? 

There are a few key engineering implications to consider for structured data. First off, structured data is typically stored in a “relational database,” such as MySQL or PostgreSQL. 

You can think of a relational database as an Excel spreadsheet. Imagine a table that keeps track of registered users, where the columns have headers for “User ID,” “First Name,” “Last Name” and “Email,” and each row contains the information of one registered user. 

Structured data also has relational structures. Continuing with our Excel spreadsheet analogy, at an e-commerce company, there will be a second table describing all purchases, where the columns have headers for “Order Number,” “Total Price” and “User ID,” and each row contains the details of a purchase and a reference to the User ID of the registered user who made the order. This shows the "relations" between the purchases table and the "registered users" table. When engineers talk about structured data for data privacy, they’re discussing data that is easier to operate on because engagement rules are more precise. 

For this reason, structured data is easier for engineers when handling a data privacy request. An engineer can write code to delete everything related to a particular user and watch with joy as it cleanly finds and deletes the appropriate data without harming or messing with anything else in the database. 

That’s not to say that every structured data system is easy. The application programming interfaces to operate on data in a given system can be tricky to work with, but the engineering concept is relatively straightforward: The data is stored in a strict format; an engineer can operate on it with consistency. 

In general, the most common engineering implication for structured data is a question about whether it is worth the effort to build versus buy all of the integrations needed to process a data subject access or deletion request across the many data systems a company uses. API integrations come with maintenance (e.g., APIs can change), and the build cost can require an engineering investment of many months.  

For companies that have hundreds of different data systems, engineers may look for privacy developer tools that can empower them to have structured data system integrations that easily process thousands of access and deletion requests a month with absolutely no humans in the loop. 

What are the engineering implications of unstructured data? 

Unstructured data, in contrast to structured data, is a much harder challenge. That said, all is not lost. The good news is the engineering implications of unstructured data may not be as dire or impossible as you might think. 

First off, if you’re frustrated by unstructured data, you are not alone. In an IAPP-EY governance survey, more than half of respondents (56%) said “locating unstructured personal data” was the most difficult issue in responding to data subject access requests. Unstructured data far outpaced the challenges of “monitoring data protection/privacy practices of third parties” (36%), “data minimization” (28%), or developing a centralized opt-out tool (25%).

It’s no surprise unstructured data can be challenging for engineers and data privacy program managers because unstructured data offers fewer engagement rules to guide the writing of code. To put it another way, it is harder to write code to automate user deletion and access requests if it is harder to identify where a user’s data is, find all of it completely, untangle it from the other data it may live within, and write clean "rules of engagement" for performing the necessary privacy task.  

To bring this problem to life, you can imagine an unstructured data system as an artistic drawing as opposed to our neat structured excel sheet discussed above with all sorts of different users and user information looped together and inter-related in all sorts of swoops and patterns. The data might depend on each other to tell a bigger story, or it may be presented in random brush strokes with no apparent rhyme or reason for an engineer to code against. 

When I think of unstructured data, I often think first of the data system of a messaging application in which millions of messages are being populated about various topics and people across hundreds of channels, from main channels, direct message channels, specific channels and more. Unstructured data is also common in customer support tickets or emails. Finding and deleting all the mentions of one person’s data from that mess can feel impossible.

Locating where the user data is in an unstructured data system is a part of an engineer’s challenge. If you don’t have some sort of user identifier to search for, your engineers are going to struggle. Imagine an employee’s data as an example: someone talking about a particular employee but with their first name and not their corresponding email address or phone number. How is an engineer supposed to find that when they are given no user identifier that locates the information? In general, it’s easier to find if there is some sort of unique search term, such as a phone number, email address or account name. The key or identifier plays a big first step in helping engineers create powerful systems that operate on data in unstructured data systems. 

To dive deeper into unstructured data, let’s look at processing a deletion request. An engineer working to encode deletion requests into unstructured data systems will need to find a way to erase at scale one person’s data from a big, messy pool of mixed data. For the engineer or, more likely, an engineering team, there will not be a neat and tidy entry point to make a clean cut — i.e., a way to reliably get in and get out without impacting the rest of the data, doing an incomplete job or executing deletion on the wrong data. In unstructured data, an engineer can’t just delete an individual user’s row or the data fields within a larger dataset.

And if you thought the engineering implications of deletion in unstructured data was hard, access requests are even harder. Often in unstructured data, the data doesn’t belong to just one person. Engineers can’t give users access to mixed data, like a chat message, because it doesn’t just belong to them. So instead, with access requests and unstructured data, engineers often have to create a system to give data access to one specific person. 

For most privacy teams, unstructured data stands in the way of being automated and fully privacy compliant across their privacy program. Many companies find the task so daunting that they are pushing it to the background or working through inefficient manual searches. But that’s part of a false option that exists in the market today. If you are a large company that wants to get to more efficient and resource-light privacy compliance, you need the ability to handle unstructured data with automation — and the tools exist to do so.

Tying it all together

From an engineer’s perspective, the difference in difficulty between structured and unstructured is immense. In fact, it is so stark that on the engineering side, I often see teams prematurely attempt to build their own automated solutions when they have only dipped their toes into structured data systems. A common assumption is they’ll handle a few APIs here and there and perhaps some overhead maintenance and done! But when these same teams start to encounter unstructured data, they quickly realize that doing this on their own may be more costly than they expected in terms of the financial and staff investment. 

Many legal privacy leaders, as well, have been taught that unstructured data is too complicated to be automated, but this is no longer true. There are engineering options that remove ever-growing headcount costs, negating the need for workflow tooling or shoulder-tapping, and creating fully automated end-states that operate on access and deletion requests in the data layer. The best part is that it’s not just access and deletions; operating on unstructured data can also automate litigation holds and other legal operations so they run and compile as quickly as the software performs the task (often in seconds or minutes). 

The takeaway is that despite all of the engineering challenges, unstructured data is not something that should continue to stump privacy teams. For example, our own tools power privacy requests for unstructured data in numerous data systems for other companies in finance, retail and consumer technology. 

Lastly, when looking at engineering implications, it’s always worth considering what’s coming around the corner. In the future, I believe unstructured data systems will only continue to get easier to grapple with, as engineers continue to apply advancements in artificial intelligence and machine learning models that can help offer predictive insights to where user data might be. 

Today’s data-mapping capabilities can give way to future predictions and modeling for where data could end up later on. We can already locate any key instance in an unstructured data system anywhere for access or deletion. In the future, we may want the ability to create forward-looking maps of where data might be. 

In the end, with structured and unstructured data, both can present their own engineering challenges unless you have sufficient data privacy infrastructure in place and a plan for the future. 

Photo by Sai Kiran Anagani on Unsplash

'Strategic Privacy By Design'

“Strategic Privacy by Design” is a new handy guide to implementing privacy by design, written from a practitioner’s perspective. Authored by R. Jason Cronk, CIPP/US, CIPM, CIPT, FIP, this is the first IAPP book to get into the details of how privacy by design works, with dozens of sample scenarios, workflows, charts, and tables.

Print version | Digital version


Approved
CIPM, CIPP/A, CIPP/C, CIPP/E, CIPP/G, CIPP/US, CIPT
Credits: 1

Submit for CPEs

Comments

If you want to comment on this post, you need to login.