The world of privacy is underpinned by rules that require enforcement, and today the control of choice is often technology. There is a category of security technology that is supposed to specialize in controlling and monitoring data. This is broadly known as DLP — an acronym that stands for "data loss prevention" (it can also be referred to as "data leakage protection" or "data loss protection," depending on who you’re talking to).
This technology is supposed to be able to label data automatically, apply rules and then make a decision on whether to allow the data to pass or prevent it from being used. It’s most often found in web and email gateways, two vectors that are often used to transmit data and are often a cause of accidental or malicious data breaches.
So, if this is the case, then surely deploying DLP technology is the panacea to most privacy issues and preventing breaches? Well, it’s not so simple. First of all, to understand the issues with this technology, it’s important to understand how they work.
Issues with DLP today
DLP technology relies heavily on pattern matching to identify certain types of data. For example, a U.S. Social Security number is a nine-digit number separated by hyphens as follows: NNN-NN-NNNN. If you have DLP technology in force and you want to prevent Social Security numbers from being accessed, it will seek out and identify any numbers in this format, and it will work, with a major side effect.
Any nine-digit number separated by hyphens in the same format but that’s not a Social Security number will also get blocked. If you think of invoice numbers, purchase order numbers, tracking numbers, phone numbers and any other string of digits that your company can employ in day-to-day operations, you start to get an idea of the scale of the problem.
Because of the above, DLP technology can quickly become a blocker for business as it will start to block legitimate data that isn’t a risk, so the solution quickly becomes to set the tools to "alert" — meaning let the data through but alert someone that it has identified a type of data it thinks is a Social Security number. To monitor these alerts, you obviously need personnel, and this often falls to security operations teams to monitor, filter and tweak the rules until they are getting a low false-positive rate.
When the type of data being monitored becomes less structured, the problems get even worse. Let’s say you wanted to stop anything labeled "confidential" leaving the company. You could put a rule in to look for that word, but then very quickly, you’ll be bombarded with a deluge of false positives. Any email with the word "confidential" in it will be flagged, but the sentence could be something benign, such as, "We don’t need to worry because this isn’t confidential."
Again, false positives abound, and blocking this automatically will cause chaos within the business, so there needs to be a manual validation of alerts to weed out false positives.
Easy to bypass
Once you understand how pattern matching works, it’s easy to bypass, even accidentally. The aforementioned "confidential" rule will not pick up "C-O-N-F-I-D-E-N-T-I-A-L" because of the hyphens or even "C O N F I D E N T I A L" because of the spaces.
It gets even worse if you start to use encoding, like base64 encoding (a way of encoding binary data into text). The word "confidential" encoded in base64 looks like this: Y29uZmlkZW50aWFs. It will look like random gibberish to most people but can be decoded easily by just using any base64 decoder online and, of course, cut through your DLP like butter so any employee who does a bit of research can base64 encode entire reams of data, exfiltrate it and decode it on the other side, and it will never be picked up.
So why use it at all?
DLP still has its uses for structured data, that is, data that follows a predefined format. A credit card number is a good example because it follows a very specific format and even has a "checksum" to make sure the number is valid. This means false positives are low (although they never completely disappear). Private encryption keys and things like Amazon Web Services application programming interface keys also follow a very specific format so rules can be applied to detect and apply controls on these kinds of data. While false positives will still happen, they do a good job of detecting these key pieces of data that can cause trouble if they go places they shouldn’t.
What about personal data?
This is often a question that comes up: How do I detect personal data? Well, with pattern matching, it’s impossible. If you think of how people’s names are structured, the only constant is they usually don’t have any numbers, but names can be any length, any number of words and can even have things like hyphens in them. In a pattern-matching scenario, you can’t set a rule saying "find me a word followed by another word" for obvious reasons.
Just one tool among many
Despite its issues, DLP still has its use in the technology sphere for enforcing privacy but shouldn’t be seen as a catch-all solution to data leakage and data breaches in general. If it forms part of a robust set of controls that encompass good access control, system updates and efficient monitoring, DLP can become a valuable tool in the constant war to keep our data safe.
Photo by Alexander Schimmeck on Unsplash