There appears to be consistent confusion with regard to the California Consumer Privacy Act and its incentives to encrypt and redact personal information wherever possible.
Specifically, the CCPA encourages security through two means. First, non-encrypted and non-redacted information that is breached results in fines of up to $750 per consumer. Data that is encrypted and redacted may potentially avoid such fines in the case of a breach.
Second, deidentified or aggregate data is not subject to the CCPA’s rigorous obligations. Broadly, deidentified data refers to information in which individual identities have been removed and that is reasonably unlikely to be linked to any consumer or household.
While encryption and redaction certainly sound similar, the strategy and implementation for each are quite different. In sum, when you confuse encryption for the purposes of redaction, you can create more problems. This short post is meant to clarify what each does and how to think about both encryption and redaction in the context of the CCPA.
Understanding encryption
Encryption is a security strategy. It protects your organization from scenarios like a devastating breach where, if the adversary were to gain access to your servers, the data stored would be of no use to them, unless they have the encryption key. It’s an all-or-nothing security posture: You either get the see the data unencrypted, or you don’t.
Decryption is typically handled at what’s called the “storage layer” — where the operating system or similar decrypts information as it reads it from disk. Data is stored encrypted and then decrypted as it is read so that data can be used for computation.
As you can imagine and what CCPA appropriately incentivizes is that organizations always encrypt their data on disk to avoid breaches such as these.
Understanding redaction
While encryption on disk is critical to avoid disclosure by auxiliary methods of accessing the data, it does not mitigate many other forms of attack that can undermine privacy. And this is where redaction comes into play, which can be also termed “deidentification” or “anonymization.” Redaction is ultimately a privacy strategy.
So how does it work? Redaction hides specific values within data using a variety of techniques — such as by “masking” those values, for example, through hashing (where instead of the raw value a user will see a random string of values) or by rounding (where numbers might be rounded to the nearest tenth of a decimal point, for example). Redaction is less black and white than encryption — which means that data can be accessed and used even while it is redacted with different levels of utility based on how much manipulating of the data is done to protect privacy.
With redaction, you can build policies on what data is shown to whom at query time. You might use redaction to build rules such as “mask Social Security number unless the user running the query is in the group HR.” This is a strategy not simply to protect against breaches but also to maintain consumer privacy.
Embracing encryption and redaction
The difference seems pretty clear, right? The confusion occurs between when and how far to enforce encryption. Because some users will, at some point, need to access the raw data — otherwise, why store it? Some would argue that raw data needs to be encrypted in the database rather than at the storage level — which is misconstrued as a masking technique. If the values are stored encrypted, one needs to access the database through an alternate interface that understands how to deal with the encryption. At this point, you can think of the system boundary for the database as the database plus this interface, and now this combined system has all the same semantics as the database, and so, inherits all those problems and solves nothing.
The way to balance all these needs is, in my view, to dynamically enforce deidentification at query time, combining the security strengths of both encryptions with the privacy protections of redaction.
This is a powerful and critical approach to data protection for a variety of reasons. To start with, you can enforce highly complex controls across users, and those controls can be different based on what the user is doing or what their attributes are. If you simply encrypt, the user either has access to the data or they don’t. There’s no way to add the needed granularity, nor can you change policies down the road easily when the rules on your data inevitably change.
Query-time deidentification enables organizations to think about risk in the context of how the data is actually used and accessed. If a certain way of accessing the data is too risky, you can minimize the number of users that have access to that view and ensure that users accessing the data risks in doing so.
Second, with query-time deidentification, you can mask data using techniques that are impossible or computationally infeasible to reverse. This avoids worrying about protecting decryption keys, as the users will get masked values and have no way to reverse that mask (think of techniques like hashing or, even more simply, making the value “null”).
Third, you can maintain visibility into the policies that are being enforced at a highly granular level, knowing what policies were applied to which users and when. A central component of the CCPA is transparency — mandating that organizations provide consumers with a clear understanding of what data is held about them and how it’s being used. In order to provide consumers with this information, organizations must have a uniform logging system to understand their own data environment.
In sum, organizations should encrypt their data on a disk as a required security measure. But they must not stop there. In fact, the CCPA is clear that they should go further. Without relying solely on encryption, organizations can improve security, privacy and business insights. Query-time deidentification is a key tool to meet the CCPA’s redaction requirements.
Steve Touw is co-founder and chief technology officer at Immuta.