Large cloud computing services are generally run for multiple users. In a few cases, all the data processed by that service is public. In virtually all cases, users have an expectation that some of the information about them is kept private. Even if the data store itself is public, logs about access to that data are generally not. Keeping each person’s information separate is most simple in the primary data stores, where each object can easily have its own access control list.

Once we step into the computational pipelines, however, it’s a whole new world.

For all but the smallest cloud services, they can’t effectively run a separate processing pipeline for every single one of their users or customers. It would be an awful lot of pipelines with an awful lot of overhead. Instead, a cloud service will have combined pipelines processing data from multiple sources and producing results for each user/customer.

In order to do this safely, I recommend the "access control list-aware" data processing model.

In ACL-aware data processing, when we compute a result for Alice, we use only data that Alice would be allowed to see, such as:

  • Her own data.
  • Data that has been made available to her (shared from other users, or, in an enterprise setting, provided by her employer).
  • Data that is public (data that is publicly available on the internet).
  • Data that is anonymous.

Training an anonymous machine-learning model and then querying it with Alice’s data is a common case of ACL-aware data processing, providing that the model is truly anonymous.

Consider friend suggestions on a social network as an example of where ACL-aware data processing is useful. Alice has put some information into the system, perhaps her contacts, perhaps her activity, such as interacting with certain people on the system. Other users have also put in some data. Some of it can be seen by Alice, such as posts shared with her. Other information is public; names and profile pictures, for example, are public on many social networks. Yet, other information might be anonymous, like an anonymous machine-learning model that suggests that if one user has interacted with another user four times, they will tend to accept the friend suggestion. If the friend suggestion pipeline is ACL-aware, then that sort of data is what can be used to make the friend suggestions.

Other users have also put in information that should not be seen by Alice, such as which IP addresses they have used to access the service or their contact information. A friend suggestion pipeline that uses private data like this is likely to leak data between users, sometimes only in subtle ways that may be difficult to test. Difficult-to-test doesn’t necessarily mean impossible-to-attack, sadly. An attacker may create new accounts with carefully crafted contents in order to exploit uncommon corner cases.

As an example, let us take users Eve and Alice. They have both interacted with the service using particular IP addresses. Let us assume that access logs are private, as they generally are, and neither should be able to see the other’s IP address. Both hand-coded friend-suggestion signals and machine-learning models that utilize IP addresses are likely to believe that Eve is more likely to accept Alice as a suggested friend if they log in from the same network subnet, as real-life friends are reasonably likely to share a physical location from time to time.

Thus, if Eve logs in from the same network subnet as Alice, she is likely to see Alice appear as a suggested friend or rank more highly in a list of suggested friends. Eve can wander around, log in from different IPs, and watch what happens to her friend suggestion list to try to learn more about Alice’s IP address use. The same might be true of Alice starting to write private posts about puppies or politics; Alice having certain people in her contacts; or, basically any form of private information used in these recommendation algorithms.

The failure modes of not using ACL-aware data processing are subtle and usually easier to exploit than avoid. Sticking to ACL-aware data processing makes your processing pipelines more robust, especially if you add hardening technology to them, like labeling the type of data and adding checkers to avoid mixing different users’ private data.

Photo by Joshua Sortino on Unsplash