14 July 2020

Data retention in a distributed system

For people who haven’t tackled this, it may be difficult to believe, but deleting data reliably from a large and complicated distributed system is hard. Deleting it within a specified retention time frame is harder. At this point, though, we privacy engineers have spent enough years on this to have worked out the principles.

There can be only one primary data source for every piece of data. Know where it is.

If two different pieces of storage (e.g., files, database tables) claim to be the canonical source for a piece of data, start by disabusing them of this notion. In a distributed system, you should assume that computers will crash at any and all times, including in the middle of writing data. If two data sources each claim to be the primary, either one of them is wrong or you’re going to have a bad, bad day when the system goes haywire. When one of those “canonical” servers crashes in the middle of writing data (or if both try to write contradictory data!), you will not be able to tell which one is right. At that point, you have two different pieces of data both claiming to be right, and you are having a bad, bad day filled with code archeology and the distinct possibility that you won’t be able to reconstruct the correct answer.

Knowing the canonical source of each piece of data is also crucial for constructing your data portability system and responding to data subject access requests, but useful automated portability is a whole different topic.

Every other copy of a piece of data is a secondary datastore. Know where and how stale they are.

Every datastore, which isn’t the primary, is a secondary datastore. There are two main types of secondary: copies and derived data. Copies of the primary datastore are just that: copies. They usually exist for caching (where some data is copied closer to where it will be used for speed), redundancy (where some data is copied as a backup in case of errors), or manual use (usually debugging or analysis).

Derived data is anything computed from some dataset that isn’t a straightforward copy. For example, a machine learning model, as well as both the intermediate and final results in some kind of computational pipeline are derived data. If this derived data is anonymous (and not just deidentified), then it is no longer user data and may have a different allowed retention period. In almost all cases, though, derived data is not anonymous and retains the same characteristics (e.g., user data) as the primary datastore, though it may not be easy to recognize.

For each secondary datastore, calculate its maximum staleness. For data derived or copied from the primary datastore, this is pretty easy: What is the maximum time lapse before that data is replaced with a new copy? For manual copies, the time frame is by default infinite. If a cache is replaced with a new copy of the primary every 24 hours, the maximum staleness is 24 hours.

Here is where things start getting tricky.

Secondary datastores often have their own secondaries. Perhaps someone is caching a machine learning model. Perhaps a data processing pipeline has multiple stages, each with its own results. Perhaps the two have been mixed somewhere. In these cases, the maximum staleness is the largest of the sums of the maximum staleness along all the computational paths to the datastore.

Recombinant datastores, ones that bend back on themselves, can be even trickier. Perhaps data from one of these secondaries is in your logs somewhere, like a log of what suggested cat pictures you have offered to a user and that they chose to download. That log now contains data that is from, say, a recommendations datastore. The log’s maximum staleness is naively the length of time the log is retained plus the maximum staleness of the datastores derived therein, like the recommendations datastore.

However, if you’re using this log as part of the inputs to the recommendations datastore, you’ve come around in a circle. Putting that data in a log doesn’t make it new. Coming around in a circle generally means, in practical terms, that you need to add anonymization somewhere to avoid infinite retention.

The primary must remain within the appropriate allowed timeframe. Each secondary must also be completely synced to the primary within the appropriate allowed retention time frame.

If your calculated retention timeframes are larger than allowed for the particular types of data in particular datastores, then you have to fix that. But you also have to ensure that your sync is complete. It’s surprisingly easy to think you’ve synced your datastore to the primary without it having been done completely in all possible cases.

For example, if you use a publish-subscribe system to push deletion notifications from one datastore (the publisher) to another (the subscriber), it will usually work quickly. But when things go really sideways, the system could lose deletion notifications. The subscriber might crash in the middle of handling a deletion notification but after having promised to take care of it (software bugs happen). Notifications are stored in a queue in case the subscriber goes offline for a while, but the queue can be overwhelmed and run out of storage in the case of too many published notifications (data centers go down, systems get swamped). Those failures don’t always give the system operator enough information to debug and fix the error, keeping the system from fulfilling its deletion promises.

Another way that data deletion might be accidentally incomplete is in the handling of data items that are edited rather than being fully deleted. Unless there is an explicit document history feature (like in Google Docs), when I change Alice’s phone number in my address book from “555-1212” to “867-5309”, then that’s effectively the same as deleting the old phone number and adding a new one. Edits, in most cases, need to be handled just like deletions with all the retention timeline calculation and system-checking that implies.

The easiest way to be sure that you’ve synced every item of a secondary datastore is to delete that secondary datastore entirely and recreate it. At that point, you can be quite sure that it’s not any more stale than the dataset it was copied from. In the case of processing pipelines or caches, this is by far the easiest approach. It is always possible or practical: If the data is very large compared to the amount of copying bandwidth available or recreating a computation is too large to be done in one fell swoop, you’ll need to do a per-item sync.

If you need to sync two datastores, if your datastore has a real 100%-effective per-item data sync available, then use that. Data syncing protocols can be a serious pain to debug. If you don’t, then you’ll need to build your own.

As I describe above, publish-subscribe protocols generally cannot be relied upon to reach 100% effectiveness in all circumstances. They are, however faster than the 100% effective methods and thus make a great part of an overall data deletion system.

Algorithms that are 100%-effective rely on one fundamental truth: Something is going to go wrong eventually. So they rely on persistently stored data that can be checked.

For example, one might need to delete all data related to a particular set of users with randomly generated user IDs. If you store those random user IDs, your system can check again and again that all relevant data has been deleted. If there’s a bug, engineers can fix it and the deletion job or sync job can be retried. Or your system can take the opposite approach: Store only the data you wish to keep and delete everything in the secondaries that doesn’t match. For instance, your photos' database only stores the photos that have not been deleted; all others should be deleted from the secondaries.

The standard, robust way of performing this sync is to use the mark-and-sweep algorithm. In its simplest form, this means “marking” every data element in the secondary datastore. If an element should not be deleted (i.e., it is not deleted in the primary datastore), then remove the mark. When the un-marking is completed, “sweep” the marked elements away (delete them). This algorithm may over-delete but is very unlikely to under-delete. In the case of error, a secondary datastore can be re-created from the primary. Running this algorithm periodically is a robust way of syncing deletions from a primary datastore.

No matter which technique you are using, whether periodic deletion or a per-item sync, I highly recommend performing data deletion somewhat more frequently than required. Computers are a part of the messy world in which we live. Computers break. Backhoes take out fiber and leave data centers temporarily isolated. Software goes haywire and has to be debugged. Build slack into your schedule to account for this.

… and if you’re not monitoring this, don’t assume it’s working.

No matter how robust your algorithm is, you cannot assume it’s working. Like everything else you care about in a computer system, you should monitor its performance. You might carefully check the deletion code for correctness, but if your code isn’t actually running because job scheduling isn’t working, then your deletion code is still broken.

Monitoring requires that you check the actual state of the system in a way that is independent (or as independent as possible) of the code you are monitoring. In this case, that means not looking at the deletion code directly. You need to check the datastores directly.

If a particular store is being periodically deleted and re-created, you may perform this monitoring (at least primarily) on the filesystem or database logs. Do you see the datastore being deleted at least as frequently as is required?

If a datastore needs a per-item sync, then you need per-item monitoring. Take as an example our lists of user IDs whose data should be deleted. Do they appear as an owner of data in the datastore? If so, then your data deletion job has a problem somewhere. Building a predictable monitoring job of this sort requires building some understanding of the data (e.g., where the user IDs live) into the monitoring job.

Even better than monitoring for noncompliance is monitoring in advance of noncompliance so that bugs can be fixed before the system blows its retention window. If you plan to have data deleted within days before the retention timeline expires, let the engineers know with increasing urgency that there may be a problem as the deadline approaches. These alerts must be carefully tuned; if they are fired often when there isn’t a problem, they aren’t helpful and are likely to be ignored.

Humans are really good at accidentally ignoring data that is generally irrelevant.

Predictable monitoring is conservative; it looks for clear violations of the invariants involved and fires an alert. Accidentally retained data is not always so straightforward to track down. In some cases, you need heuristic monitoring, as well.

For example, if the identifiers have been stripped from data but it is not anonymized, you can employ a scalable data scanner to find possible joins between the deidentified data and identified datastores — or even more deeply transformed data. A less-sophisticated but more common scanner uses regular expressions to identify data that “looks like” whatever you’re looking for, such as government IDs, email addresses, IP addresses and phone numbers.

Heuristic monitoring can find issues that more straightforward monitoring may miss, but this characteristic leads to two further problems: false positives and false negatives. False positives are errors made by the monitoring job where it flags a possible violation that is not, in fact, a problem. These noisy alerts face the same problems described above: They waste time and attention and, if fired often, are counterproductive. The best part of false positives is that they are visible. That means that the engineers have some hope of fixing them.

False negatives are trickier. False negatives are when your monitoring does not a flag a problem that really exists. Because these errors are silent you don’t know they’re there and thus cannot feed corrections or fixes into the monitoring algorithm.

The only way to find such errors to correct them is to perform very careful manual checks, preferably by privacy engineers with strong skills in statistics.

Photo by Markus Spiske on Unsplash

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Data retention in a distributed system

Related stories

Editor's Note:

There can be only one primary data source for every piece of data. Know where it is.

Every other copy of a piece of data is a secondary datastore. Know where and how stale they are.

The primary must remain within the appropriate allowed timeframe. Each secondary must also be completely synced to the primary within the appropriate allowed retention time frame.

… and if you’re not monitoring this, don’t assume it’s working.