15 July 2020

Setting data retention timelines

Your user asked you to keep the data. It’s rude to delete someone’s data when they didn’t ask you to and they’re likely to be upset. Let me differentiate here between primary and side-effect data. Primary data is the particular data they asked you to keep, such as their email or baby pictures. Side-effect data is data created as a side effect of interaction with a service, such as logs of those interactions. Side-effect data does not count as something the user asked you to keep unless that request is extremely explicit.
You need the data for security/privacy/anti-abuse. If you’re keeping data only for these purposes, then make sure to segregate it from your other data and reduce access so that it is only used for those purposes. Because abusers will do things like rotate their attack patterns and “age” accounts specifically to get beyond your data retention periods, you may be cautious about specifying the exact retention period for this particular data.
Other necessary business purposes. Those break down largely into two categories:
- Data necessary to run a system. There are certain types of data that are necessary to keep a system functional. For example, without debugging logs, your engineers are going to be pretty much out of luck when they try to fix their broken system. Every system breaks. Without data, it is extremely difficult to effectively load-test a complex system, especially when it’s being changed or before it’s in use. Without load-tests, systems tend to break at precisely the worst times: when they’re in heavy use. Engineers also need to analyze their system and its use to keep it working in the future. For example, they have no way to know how many more servers they’ll need next week or next month unless they have a fairly accurate picture over time of the system traffic and load.
- Data used to improve a system. Most users expect a system to improve over time. If you have permission to do so, then you’ll need to keep logs of system interactions over some time frame in order to do this work.

Once data is truly anonymized, then it’s no longer user data. For example, it’s important to know approximately how many requests your servers received at a particular time on a particular day, because that’s part of how you understand whether your product is effective and part of how you plan to buy the appropriate amount of server capacity. Don’t just remove the identifiers and call it good, though.

How long to keep data

The obvious bits: If a user or customer asks you to keep certain data, you should keep it as long as they’ve asked you to and delete it promptly when they ask you to delete it. If you have to keep data for legal reasons, keep it for as long as is required and then delete it. Past there, things get fuzzier.

Here are some guidelines I’ve found helpful.

Five weeks is good default timeframe for data kept for analysis, system maintenance and debugging. A lot of the analyses listed above need to be able to compare what is happening today to what happened a month ago. Daily statistics have a lot of noise and many systems have swings in usage on a month-over-month basis. Because analysis pipelines take time to run and sometimes fail, five weeks gives time to perform month-over-month analysis even in the face of those failures and working around holidays — a standard month-over-month analysis may be skewed during a holiday, so you might need to do the analysis for a slightly skewed timeframe.

A more limited-amount need analysis over a longer timeframe. For those, I would default to 13 months. Thirteen months gives enough data to provide a year-over-year analysis with enough wiggle room to work around holidays and system failures.

There are two major places where these analyses are important: in systems that have strongly year-oriented behavior, it can be extremely difficult to understand their behavior without comparing it to the behavior of the system from a year ago. But in most cases, these analyses are not sufficiently valuable to justify keeping data for that long of a period. The biggest exception to this in the case of fighting fraud and abuse. Attackers to a system will specifically take advantage of limited data retention by recycling old attacks periodically — when your abuse-fighting system “forgets” certain kinds of spam, spoofing or other forms of abuse, they’ll crop right up again. Longer retention periods reduce the ability of attackers to take advantage of the system.

A huge factor in setting timelines on data retention is how long your system takes to actually delete data completely, which has to be added to your chosen amount of time abuse. It’s almost never immediate and, in a complex system, can take a while.

Take into account system factors, such as:

How long you keep your backups. Data in backups isn’t deleted.
How long do your data centers go down for maintenance? During those times, you won’t be able to delete data.
How long does it take to rebuild your machine learning models? A machine learning model that is built with user data almost certainly isn’t anonymous and must be rebuilt (this generally requires certain bleeding-edge research techniques or pre-anonymizing your techniques).
How long does your deletion pipeline take to run? How often does it fail? You should plan on running that deletion pipeline multiple times, even in a generally stable system, because random failures happen.
How long does it take between asking your storage system to delete and that deletion actually taking place? Depending on what you’re using, this can be far longer than “immediately.” For example:
- If you are using a distributed storage system, then registering a deletion with the system does not mean that every distributed replica is deleted immediately. Consult whoever is running your storage system to figure out how long this takes, even in the face of a flaky network or other errors.
- If you are deleting data off of a hard drive, that doesn’t mean that the data is actually removed from the hard drive, generally. It means that the file system marks the places where that data was stored as available to be overwritten. You have to actually overwrite the data in those areas for it to be reasonably/fully gone. (I’m not going to get into forensic techniques for recovery from different storage media here.) That means you either need to overwrite the data on purpose or figure out how long it takes for normal operation of your system to overwrite that data (with a safety margin built in for things like maintenance periods or holidays).
Sadly, few storage systems natively support a “time to live” that you can set on data. If you set a TTL of one week, for example, then data you write is deleted after one week. TTL can be extremely helpful both in reducing errors in data deletion, as you don’t need to track down every last piece of data written by every single intermediate stage of your pipeline, and in speeding up parts of your data deletion process, depending on implementation.

Photo by Mika Baumeister on Unsplash

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Setting data retention timelines

Related stories

Editor's Note:

Keeping data

How long to keep data