Artificial intelligence applications, such as language translation, voice recognition and text prediction apps, typically require large-scale data sets to train high-performance machine learning models such as deep neural networks. There can be challenges when the data needed to train the model is personal or proprietary. How can ML algorithms be trained on multiple data sets when, potentially, those data sets cannot be shared? With its capability to train algorithms on various data sets without exchanging raw data, federated learning is one of several emerging privacy-enhancing technologies steadily gaining traction. In fact, the global federated learning market is expected to increase at a compound annual growth rate of 10.5% by 2032.
What makes federated learning a trending technology? Introduced by researchers at Google in 2016, and accompanied by a comic explainer, federated learning — also called on-device ML — refers to a form of ML that processes data at its source, allowing users of the technology to gain insights from combined information in decentralized data sets.
Commonly, ML models are trained on a single data set via centralized learning. With centralized learning, all training data is sent to a server that trains and hosts the model. The trained and updated model is then sent back to all participants in the centralized learning network. Since this approach is based on duplicating local data sets to execute training in a central location, the use of centralized ML can increase security and privacy risk.
On the other hand, federated learning enables ML models to operate without moving the personal data needed for training to a central server. The raw data never leaves the device and is never collected in a central location.
Instead, the ML model itself is sent to edge devices — mobile phones, laptops, Internet of Things devices or private servers — to be trained. The model is updated locally on the devices and then sent back to a central server. There, the updates are aggregated and incorporated into a shared global model. In many instances these model parameters (sent back and forth between the central server and the edge device) are encrypted before being exchanged. This collaborative training process continues over multiple iterations until the model is fully trained. It can then be redistributed to the devices, to share the results of the analysis.
There are different flavors of federated learning architectures, and each has a different application.
Horizontal federated learning is used for homogenous data sets, or those that contain different cases but share a consistent set of features, such as the same types of data points for different customers. Vertical, or heterogeneous, federated learning is used when data sets contain at least some of the same cases but have different feature sets, such as when organizations share some customers with divergent data points on record. Federated transfer learning combines the two approaches, allowing an already-trained model to solve new problems with different feature sets and different cases, but with similar goals.
Closely related to federated learning is federated analytics. Like federated learning, this works by running local computations and sharing only aggregated results. Unlike federated learning, however, the computation goal of federated analytics is not training an ML model, but the calculation of metrics or other aggregated statistics on user data.
How does federated learning support realization of privacy principles?
Federated learning was developed to tap into the vast array of data available on personal devices while keeping privacy and security in mind.
In 2016, one of the earliest papers introducing the concept of federated learning, Federated Optimization: Distributed Machine Learning for On-Device Intelligence, referred to a 2012 White House report on the privacy of consumer data, explaining how federated learning architectures structurally support the principle of data minimization. Data minimization is a basic privacy principal found in privacy laws around the world. It ensures only the personal data necessary for a specific legitimate purpose is collected and retained. In other words, only relevant, adequate and necessary personal data should be collected, and kept for as little time as possible. This is also referred to as storage limitation.
In practice, data minimization includes restricting access to data using security mechanisms such as encryption and access-control lists, as well as ensuring policies and procedures are in place to identify and remove superfluous collected data. Federated learning supports this by inherently restricting direct access to the personal data being processed and only using the data necessary for the efficacy of the model. While model updates from the device may still contain private information, they contain less than the raw training data does. Also, the communicated models are not regularly stored on cloud servers, but instantly discarded after being integrated into the global model.
Federated learning also limits the attack surface to just the device, rather than both the device and cloud server. Therefore, federated learning has been referred to as a security control to minimize the risk of data breaches. In a report from December 2021, the European Union Agency for Cybersecurity stressed the capability of federated learning architectures to avoid transferring data and/or entrusting it to an untrusted third party. ENISA concluded federated learning helps to preserve the privacy of data and protects against the disclosure of sensitive data for ML algorithm training.
Examples of real-world applications
With the rise of device manufacturers turning to on-device ML, federated learning is increasingly used in the real world. Among other examples, Google has used federated learning for next-word prediction and emoji suggestions in its keyboard, to power the Now Playing music feature on Pixel phones and to personalize the Smart Reply feature in its Messages app.
Similarly, Apple has been actively developing federating learning solutions to personalize its devices and platforms including Siri, QuickType and “Found In Apps” features.
Hopes are particularly high for the health care, medical technology and pharmaceutical sectors to benefit massively from federated learning, provide new insights and serve society at large through multicentral health care ecosystems. Federated learning is discussed as a solution for digital health opportunities, especially for classification tasks without access to sufficient data, and for smart health care. Some concrete examples include the development of brain tumor and breast cancer classification systems, general oncology and COVID-19 research, and collaboration on drug discovery.
Other applications of federated learning include solutions for money laundering detection, autonomous vehicles, and IoT devices.
In practice, already established ML frameworks now also include federated learning capabilities. Google’s TensorFlow and Meta’s PyTorch, the two most popular open source ML frameworks today, both offer federated learning tools. TensorFlow Federated is an extension of the popular ML framework and PyTorch incorporates several add-ons to use in federated learning setups. Other popular products include Flower, which can be used to build a federated version of an existing ML workload; Syft, an interface that makes federated learning projects easier to develop when using frameworks like PyTorch; and OpenFL, a python-based federated learning framework geared towards app and video game developers. Many more companies are starting to offer federated learning as a service.
Privacy and security concerns
Despite the many benefits, federated learning is not free of privacy and security challenges. An increasingly popular area of research is attempting to determine these challenges and learn how to mitigate them.
Federated learning models trained with individuals’ personal data, including phone numbers, addresses and credit card numbers, can be the target of inference attacks. Through these attacks, malicious servers can learn whether a particular data point or sample data points from the distribution of the data were used during the training process. These are called membership attacks and reconstruction attacks, respectively. Some of these privacy attacks can even be orchestrated by other clients participating in the same network.
The training of models can also be “poisoned,” either via data poisoning attacks or model poisoning attacks. Data poisoning alters the training data set, leading to an overall performance reduction of the model due to “bad” input. Model poisoning constitutes a more active and aggressive approach because attackers focus on the model updates, rather than the training data. The goal of these attacks is to get the model to misclassify or misinterpret input data due to poisoned model updates. Influencing the updates can lead to fundamental shifts in the efficacy and function of the model.
Additionally, attackers can target the method used to communicate between the central server and its users. Federated learning requires a multitude of devices to frequently exchange their learned model updates, thus introducing communication overheads. This imposes a major challenge for federated learning over realistic networks that are limited in computational and communication resources. Attacks can exploit these bottlenecks and force certain devices to disconnect from the learning system, introducing bias into the model. Additionally, unsecure communication between the model and its users can be taken advantage of in the form of man-in-the-middle attacks, where model updates are hijacked to steal sensitive or altered data before it is passed along to the central server.
With increased interest in using ML and federated learning, focus on the development of enhanced privacy and security audits and controls is needed. This was stressed by ENISA in its report on securing machine learning algorithms from December 2021, and is supported by leading researchers in the field.
Outlook
Adopting emerging PETs, like federated learning, raises various challenges and questions.
The most straightforward reasons for using federated learning as a PET are enhanced data minimization and data security. Yet federated learning does not directly tackle the problem of anonymizing or deidentifying user data. To address this, federated learning can be combined with differential privacy, an approach prominently presented by Google, Meta and Apple. This way both data minimization and anonymization are achieved, supporting privacy-respecting federated learning systems that can safely learn about consumer behavior and provide services like advertisement personalization.
Similarly, federated learning is increasingly combined with other emerging PETs such as homomorphic encryption, multiparty computation or synthetic data. The growing number of federated learning use cases and the prospect of better privacy and security protection through these technologies have recently been noted by regulators worldwide. At the end of January 2023, the White House announced the first U.S.-EU AI Collaboration. In an interview with Reuters, an unnamed senior U.S. administration official explained previous agreements were limited to enhancing privacy and said “The magic here is in building joint models (while) leaving data where it is. The U.S. data stays in the U.S. and European data stays there, but we can build a model that talks to the European and the U.S. data because the more data and the more diverse data, the better the model."
Against this background and with new, more comprehensive AI regulatory initiatives expected globally, the space is one to watch closely. While PETs are not a silver bullet for privacy protection, they could be crucial puzzle pieces for privacy compliant data sharing and research, and legal assessment will be fundamental to tapping into their promising potential.
