The case for static code analysis for privacy

Published23 Oct. 2023

Contributors:

Sanket Kavishwar

Senior Product Manager

Relyance AI

Privacy technology is undergoing a radical transformation and many exciting technologies are available to help companies create better privacy programs and do right by their customers.

Data is at the heart of privacy. If we simplify privacy, the top-level questions to ask should be: What type of data is processed; what or who is processing that data; and why is an entity processing it?

Enterprise platforms generate data at record speed and if a privacy team even attempts to answer these three questions, they will be at a disadvantage playing catch-up.

Let’s think about data creation and processing. If your company has an enterprise platform, you know it's constantly processing, generating, storing and deleting customer and employee data (in addition to data for other data subjects). All this data eventually makes its way to structured and unstructured data stores (databases, warehouses, cloud buckets, etc.), but where is most of this data generated? It's generated from code — lines of code process incoming data from your application and move it along a pipeline to data stores.

Most privacy technologies focus on this "data store" equation. They offer integrations to scan data warehouses or unstructured data stores in the cloud, which is excellent and valid. But scanning these "data at rest" stores, while crucial, is half the story.

The other half? It’s the code bases that generate this data in the first place — think microservices, code pipelines, etc., to understand where data comes from, tackle many risks upstream, and build better privacy programs.

Privacy, after all, is in the code.

What does code scanning mean?

In this context, code scanning means applying static code analysis techniques to your application code bases and extracting relevant information that might be needed to run a privacy program.

Static code analysis scans source code files statically and creates a skeleton and relationship between code elements. For example, a static code analysis output can include function calls and their relationships, parameter or variable names, and import paths, among other things.

These signals effectively understand the underlying code's semantic relationship and help extract unique insights relevant to a privacy program. A tangible example would be finding a variable called "customer_credit_card_number" being processed in a file, a likely signal for a credit card number being processed.

This technique has been around for a long time. It has reportedly been used in medical, nuclear, aviation, and security software, so why should privacy software stay behind?

Tangible benefits of code scanning

Third-party vendor discovery at scale: Engineers who write applications might use third-party vendors for many things — from handling payments to sending notifications. This might be "business as usual," but it is table stakes from a data map perspective. Code scans are highly effective ways to automatically create visibility into third-party vendors' usages without needing to turn to engineering teams. When running code scans at a regular cadence, this can be done at scale, producing a real-time data map. Win-win.
Data discovery and classification: Static code analysis creates a topology that contains "clues" from the source code, like function names, variable names, API calls and import paths. Systems can use the clues to triangulate and accurately predict the types of data a given piece of code is processing. This effectively means that not only can you glean much-needed information about data types, but these techniques can also help you classify them and provide an understanding of where they are being processed. This is highly effective in understanding data processing and associated flows more deeply within an application.
Comprehensive data lineage: Scanning source code can help to understand how data moves from point A to point B. Knowing that personal data exists in a data store or how it moves from that table to other stores is insufficient. It's important to know where that data came from in the first place. Source code scanning and data-at-rest scans, together, can help paint the entire data lineage and how it moves within your application and into data stores.
Privacy by design: It is a core tenet of the EU's General Data Protection Regulation and static code analysis lets you "shift left" in this "data chain" to catch potential issues before they make their way into production. Data proliferates in an organization from some piece of code. A microservice might write terabytes of data to multiple tables in a database. A code pipeline might be dumping data in multiple S3 buckets. Understanding data is processed at the source conceptually moves up the privacy checks in this flow and gets organizations a step further in privacy by design initiatives. Knowing that sensitive data made its way into a table in a data store via a microservice is much better than just knowing that sensitive data exists in a table in a data store. This almost always leads to much better decision-making.
Incident response: In the unfortunate event of a privacy incident or data breach, scanning code bases and underlying microservices play a crucial role in incident response and forensics. They can help you quickly identify the extent of the breach, trace how it occurred, and take swift action to mitigate the damage. If microservice A is sending data to a third party that has been compromised, triangulating and solving the immediate problem becomes easy.
Better data hygiene and more secure software: Code scanning and visualizing the data flows can help in visualizing sensitive data flows. When engineers grasp the data types their source code processes and shares, they can imagine how the services they own contribute to data proliferation in the company’s privacy posture. This leads to meaningful conversations around minimizing data types that aren’t needed by the parts of the source code they own vs. having these directives come top down by the privacy team. This "data minimization" approach aligns with the GDPR's principle of limiting data processing to what is essential for a specific purpose, reducing the overall risk of data breaches.
Practical and realistic approach: An ideal scenario completely changes how developers write code to be mindful of privacy. While that is an ambitious and noble goal, due to impending deadlines or lack of standardization, this is hard to achieve in practice and static code analysis shines here. Scanning code bases don’t need engineers to write code a certain way. Instead, code scanning extracts relevant bits from the already written code, leading to more productive conversations with your engineering counterparts and less work for all parties involved.
Cost-effective: Determining sampling percentages and associated costs can be difficult when dealing with data stores containing petabytes of data. Since the number of code files generating petabyte-sized data is not interrelated (the same piece of code might be responsible for generating petabytes of data), scanning the source code once is cost-efficient to comprehend data flows instead of scanning every table or bucket within the data stores.

A comprehensive understanding of data flows is critical in this age of evolving privacy regulations and high-velocity data creation and processing. Only looking at one side of the equation (integrating with data stores) is insufficient. Looking at data flows by directly scanning the source code provides much needed visibility to build stronger privacy programs and do right by customers.

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Contributors:

Sanket Kavishwar

Senior Product Manager

Relyance AI

Tags:

Program management

Contributors:

What does code scanning mean?

Tangible benefits of code scanning

Contributors:

Related Stories

The second wave of AI governance: The risks of ubiquitous transcription tools

Using the power of organizational change in compliance efforts

AESIA's AI Guidelines: Spain steps into the AI spotlight

AI, identity and the limits of consent: Why child protection must begin upstream