We’ve encountered significant confusion on techniques used to preserve privacy and maintain security, such as masking, encryption and others. Because these techniques are so central to the legal and ethical use of data, we thought we’d put together a short guide designed to help shed light on this critical area. 

Note that this is a two-part series. This article is meant as a “101”-level introduction to the basics of anonymization. We’ll follow up with a “201”-level guide in part two.

Let’s dive right in.

What is masking?

Masking is a type of function that is applied to raw data used to hide its true value. Masking is, in general, an incredibly broad term that can describe a wide range of functions, including hashing, encryption and a number of other techniques covered in this post. For this reason, we sometimes find the term itself to be so broad that it can cause confusion. It’s generally helpful to try to specify the "type" of masking function, rather than just refer to a function as masking on its own. Big picture, though, masking functions should be used when an organization wants data to be useful, but when it also wants to hide the true value of that data.

What is encryption?

Unlike other forms of masking, encryption is a function that can be reversed with what’s called a “decryption key.” An encryption algorithm, also called a "cipher," is what takes a readable chunk of text and turns it into seemingly random values that are not decipherable to others (at least, not without the decryption key).

In other words, organizations rely on encryption when they want the value of that data to be discoverable to specific users, but not to the entire world. For that reason, encryption is heavily relied upon for data security. Once data is encrypted, it is generally not useful until it is decrypted by someone who holds the decryption key.

What is hashing?

While hashing is an expansive topic in computer science, its meaning in data privacy and data security contexts refers very narrowly to a family of techniques that employ so-called cryptographic hash functions to hide the true values of the original data. These types of hashing functions mix up the original bits of input into an unrecognizable output value, also called a “hashed value” or a “token.” Hashing functions are called “one-way” functions because, unlike a “reversible” function, such as encryption, they are not meant to be reversed.

It’s important to note that while there are multiple types of hashing, cryptographic hashing is the most reliable method in the context of data protection and should only be used with something called “salting.”

What is salting then?

Salting is a step used to make hashing functions more secure. While hashing is meant to be a one-way function, there are known attacks called “dictionary attacks” that can, in some cases, recover original values from hashing functions.

Here’s how these attacks work: Because the best hash functions are commonly known, an attacker can take a large series of values and input them into the function, building a “dictionary” that shows what these values look like in hashed form.

For example, an attacker could input the text “John Doe” into a specific hash function and then know exactly what the hashed output for “John Doe” looks like. Once an attacker creates a large enough dictionary, they can reference the output of hashing functions to see the original raw values. In practice, creating a usable dictionary becomes a game of probabilities and computing power.

Salting, however, is an approach used to make this type of attack probabilistically impossible. Salting adds long random values (called “the salt”) to the data before it’s inputted into the hashing function. With salting, the attacker now needs to know the salt value and the original value to build the dictionary, making it mathematically infeasible for an attacker to create a dictionary that would be usable in the real world.

What other types of masking functions are there?

A large number. A few common examples include:

Regex, which searches for specific types of strings and replaces them with certain patterns. A regex function might, for example, look for nine-digit Social Security numbers and replace all but the last four digits with “0000.”

Rounding, which rounds numbers up or down. Rounding, for example, is frequently used to group ages into specific “buckets,” such as assigning everyone aged 45 to 54 the age “50” and everyone 55 to 64 the age “60.”

Nulling, which involved replacing the original values with empty strings, like a series of dashes.

If hashing is not reversible, why do some regulators consider hashed identifiers personally identifiable?

While some regulators, such as the U.S. Federal Trade Commission, have publicly stated that hashed identifiers on their own may be considered personally identifiable information, it’s also equally important to note that this assessment is frequently used in reference to hashing without salts, which we recommend against. (See, for example, footnote 74 in this FTC report making a similar point.)

But more broadly, it’s important to understand there’s a big difference between the masking and encryption techniques covered here and anonymization itself. Successful anonymization approaches will always need to take context into account, which is why mathematical techniques on their own are rarely sufficient. It’s also why even after hashed salting, organizations should still protect and monitor that tokenized data. Anonymization is, in short, as much an art as it is a science — the end goal should always be to use mathematical techniques to protect data based on acceptable levels of risk.

We’ll cover how to make that risk determination in the next installment.

And please note, we’d like for this guide to be interactive, so if you think we’re missing an important area, or if you simply have feedback for us, please comment below or reach out to governance@immuta.com.

Photo by Hunter Harritt on Unsplash