Redefining data mapping

Reflecting upon the 2015 U.S. Office of Personnel Management security breach, former U.S. Department of Homeland Security chief privacy officer Nuala O’Connor, CIPP/G, noted, “The OPM breach was not only the product of bad security practices but also of poor privacy practices … OPM didn’t have the most basic data map.”

“Data mapping” is typically considered the foundational step for any privacy program — but what exactly is it?

To me, data mapping is an umbrella term that depends on context, stakeholders and needs of the organization. Data governance teams have different requirements from privacy teams. And how both of these teams conduct the exercise can have various layers — from reviewing the data assets themselves to classification and inventory, an official record of processing activity and leveraging lineage for an automated and holistic view of the organization’s data flows. Data mapping is not just a compliance exercise; it also demonstrates the more you know your data, the better you can protect and leverage it for greater business insights.

Data mapping through a governance lens

For those in data governance, data mapping links the data sources and/or locations back to data attributes and corresponding definitions. Just as people may need a translator to comprehend different languages, data mapping standardizes the understanding of various types of data used in different platforms, applications and systems. One common methodology is doing a top-down approach by mapping logical entities (some sort of value that has attributes attached to it) to the various physical instances (the precise physical characteristics of that value) of those entities.

Another methodology is a bottom-up approach using correlation sets, which are unique to my organization — BigID. Here, you would leverage a subset of logical entities — which could include physical instances of data — to train a machine-learning model. That ML model can then be used to find additional physical instances and correlate them to a relevant logical entity. As BigID’s CEO and co-founder Dimitri Sirota explains: “the organization’s training data (or 'seed' data) can be spread across different data sources” and “are used to understand basic identifiers, relationships, and distributions.” Having this extra dimension of whose data is found and where it is located can help an organization more effectively conduct their data mapping exercise.

Data mapping the data assets for privacy pros

For privacy professionals, data mapping is typically done for regulatory purposes and will start with the required Record of Processing Activities. However, this top-down approach to data mapping may end up missing out on important insights. Instead, I recommend doing a bottom-up approach via the following steps:

Understand the data assets

Ask the following questions:

Does your organization rely upon legacy systems and mainframes? Legacy applications built upon traditional databases cannot be easily replaced. To accelerate new services, microservices-based apps, which are various single-function services that communicate with one another via application programming interfaces, may be built on top of these systems.
Is data in a system structured or unstructured? Structured data is organized and formatted to fit predefined models that can be easily mapped into designated fields. Unstructured data has no predefined model and can be difficult to deconstruct. This may impact how the organization processes it.
Is the environment on-premise or cloud-hosted? On-premise software enables more control, but is typically associated with a higher cost. Cloud computing can be less expensive and easier to deploy, but encryption keys and other security dependencies reside with the cloud provider, creating a higher risk.
Where is data physically stored? Organizations may have legal obligations depending on the jurisdiction and cloud-based storage adds another layer of complexity.

Get to know the data: Classification and inventory

Once there is clarity around the organization’s data assets, the next step is to better understand the data itself. Data processed by an organization can range in complexity, which is why it helps to have proper classification and inventory.

Data classification organizes data attributes based on shared characteristics so it can be tagged, retrieved and handled accordingly. Classification is typically done based on the:

Category of data (e.g., medical data = protected health information).
Profile type (e.g., Customer identification = customer data).
Sensitivity level (e.g., National ID number = high risk).

Data classification can make the process of protecting data an easier and more effective exercise. For instance, this can help establish appropriate role-based access controls based on the type of data and the sensitivity level.

While classification forces teams to come up with labels describing the nature of the data, the data inventory makes sure this is reflected on the physical data stored. Inventorying is typically done by applying machine-readable tags that are an “incarnation” of the classification. As Uber Privacy Engineering, Architecture and Analytics Director Nishant Bhajaria states in his book “Data Privacy: A runbook for engineers:” “Creating a data inventory is like building the backend of a search engine for your data, much like a team of smart engineers built the backend of tools like Google.”

The inventory can help manage privacy risks in an organization, as it ties the understanding of the data (a.k.a. classification) to the data itself. “This means that if you were to transfer your data from an on-premises environment to the cloud … you’d ensure that the data carried with it the identities and risk values you have attached to it,” Bhajaria explains.

Delineate data processes with the RoPA

The RoPA often comes to mind when privacy professionals think about data mapping. Article 30 of the EU General Data Protection Regulation requires both data controllers and processors to maintain a RoPA. Unlike the previous steps, this is an explicit legal requirement for organizations subject to the GDPR and several other data protection regulations.

The U.K.’s Information Commissioner's Office published guidance on what generally should be included in the RoPA:

Types of data subjects that data is processed about and details on the personal data processed about them. Note that this should include “inference data,” which could reveal sensitive information about the data subject.
Who will be on the receiving end? What type of persons or organizations (internal or external) is the data being shared with?
Third-party transfer information (this becomes especially important if it’s considered a cross-border data transfer).
Retention and destruction policies for the data at hand.
The purposes of processing.
Technical and administrative security measures that are in place.

This is often an intense and manual process that is hard to validate and keep up to date. Many organizations are automating the RoPA process by using technological solutions. For instance, in BigID’s RoPA application, if changes are made to the data inventory, they will be automatically reflected in the RoPA records leveraging those specific datasets, which helps to keep the data mapping process evergreen.

Tying it together with data lineage

Data that is properly governed can support how data is processed and protected. To tie together the data governance and data privacy side of the house, a useful and applicable method is data lineage.

As I previously wrote for The Privacy Advisor about how data flows upstream to downstream, “data lineage is typically built for specific business processes or scoping specific data elements” and can show:

“The original source of the data.
The most critical data within the inventory.
How data sets are subsequently built and aggregated.
The potential quality of the data sets.
Any transformations along the data’s life cycle journey.”

Lineage adds context to the data, since metadata — data that provides information about other data — is added to the mapping process. Metadata is used to detect changes within data tables and how those changes impact downstream applications. It also supports technical security measures needed for business processes, since data breach mitigation steps can concurrently take place (e.g., locking down sensitive databases).

Lineage can be likened to an ever-active RoPA that provides real-time visibility. As my colleague and BigID Chief Data Officer Peggy Tsai acknowledges, “knowledge of the data’s provenance and ensuing life cycle within the organization can mitigate what could end up being a burdensome obligation, financially and timewise.”

Data mapping for the win!

Data mapping is at the heart of an organization’s data strategy. It provides the privacy team with a greater understanding of privacy risks, both regulatory and non-regulatory. Working with data governance, it showcases how privacy is more than just a compliance function — when data assets are known from A to Z, the business can creatively leverage data while clearly understanding its risks.

The suggested steps may take time and some serious thought, but with a reasonable project plan for completing these steps, executive and stakeholder support and (of course) a positive attitude, this is all achievable. On the other side of the tunnel is real visibility into the who-what-why of data with the proper foundation and guardrails to effectively utilize it across the organization.

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Redefining data mapping

Related stories

Data mapping through a governance lens

Data mapping the data assets for privacy pros

Tying it together with data lineage

Data mapping for the win!