The trend of moving more and more services to the cloud has been fashionable for a while now, and in recent years this trend has expanded into the information-security market as well, with Security as a Service (SECaaS) offerings becoming ever more prevalent. One such example of a SECaaS offering is the group of cloud-based antivirus products, with vendors like Immunet and Panda Security being at the forefront of the trend and with vendors of more traditional antivirus (AV) products now releasing cloud-based versions as well.
While these products hold the potential to greatly improve the security of information systems, they may do so at the price of privacy, since, if abused, the information gathered by these cloud-based AV products could be used to provide snapshots of the digital content that is both stored and consumed on the user’s computing device. As detailed below, these AV products send a digital “fingerprint” of each file they encounter back to the AV vendor for processing, and while these digital fingerprints are used to recognize malware, they could in theory be used to recognize any set of files. This creates the potential for a digital Panopticon to be put in place, in which cloud AV data could be used to trigger near real-time alerts that a particular user is accessing a particular file without the user even knowing.
From an information security perspective, the case for cloud-based AV products is often very compelling. Cloud-based AV products tend to place a very lightweight agent on PCs and offload most of the heavy processing normally associated with virus scanning to cloud-based resources. This lightweight agent has a smaller memory and processor footprint and as a result has a much smaller impact on system performance than traditional antivirus solutions. It is argued that this makes antivirus more palatable for end users and can lead to more widespread adoption of AV software. This type of setup also enables AV to be run more effectively on mobile devices where traditional AV software would negatively impact battery life and performance to too large an extent.
Another advantage of this type of setup is that the greater computational resources of a cloud-based setup could potentially allow for files to be compared against an even larger set of AV signatures than could be easily downloaded and processed by a typical PC running traditional AV software. Moreover, development of a larger number of virus signatures is also possible, since the cloud setup exposes the AV vendor to data regarding a larger number of files and hence contributes to exposure to a larger number of suspicious files that could be considered malicious after further analysis.
What is even more compelling for cloud-based antivirus, however, is the centralization of reporting and processing of all of this file and virus data. This allows for cloud-based AV vendors to perform real-time threat analytics and become aware of, and able to respond to, changes in the malware threat landscape much more rapidly. Under typical circumstances there will be a few malware infections that will go on to infect a large number of machines and a large number of malware infections that will never gain traction beyond infecting a small number of machines. An example of the utility of real-time threat analytics is the ability to rapidly identify these trends in order to provide protection against the largest threats first.
Cloud-based AV products do have some limitations, however, with one of the primary limitations being that network connectivity is required for these products to operate optimally. While most cloud-based AV solutions do have some signatures cached locally, the efficacy of these products is greatly reduced without access to the cloud-based scanning engines. Vendors of cloud-based AV products argue that devices are much less likely to become infected without a network connection, and some vendors even say their solution can be run without issue alongside traditional AV software if this is a concern. It is also arguable that as mobile computing continues to expand, and ubiquitous computing initiatives like the Internet of Things take off, that likelihood of being unable to connect to a network will become increasingly rare.
What is a potentially more serious but less discussed downside of cloud-based AV solutions, however, is the potential privacy implications posed by these systems. In order to have a clear understanding of the depth of the potential privacy issues that these security solutions pose, let’s first establish a basic understanding of how these cloud security solutions operate. Cloud-based security solutions do not transfer all of a user’s files for analysis unless there is reason to regard a file as suspicious, but they base their scanning activities on metadata that is collected from files and passed to the cloud-based scanning engine. Amongst the metadata collected about each file is a cryptographic hash of the file using an algorithm such as SHA-1 or MD5.
The outputs of these cryptographic hash functions can be thought of as a “fingerprint” that uniquely identifies that particular file as any input to a hash function should result in a unique output. As long as the contents of a file remain unchanged, the hash value of the file will also remain unchanged. Even a minor, single-character change to the contents of a file will result in a hash value that is very different from the original. This is one of the reasons that computations of file and drive hash values are a requirement of proper digital forensics methods to help rule out that collected data was not tampered with or modified in any way. Hash values are a key component of antivirus signatures and are a common means used by AV software packages to detect malicious files, since if the hash of a malicious file is known, it can be concluded that any file with the same hash value is a copy of that same malicious file. Thus, from a technical standpoint of performing signature-based detections in the cloud, sending the hash value as a part of the file metadata makes a lot sense.
Where the privacy issue comes in is that these hash values could also be conceivably used to identify things other than malware.
Identification of files by hash values is not unprecedented and has been a commonly used tactic in law enforcement circles for locating and obtaining evidence against individuals suspected of trafficking child pornography. For example, the U.S. government maintains a database of the hash values of over 700,000 child exploitation images as a part of its digital forensics resources. Similar tactics of using hash functions have been used to try to identify digital content that might be pirated, and hash values have been used in the anti-piracy efforts of organizations like the RIAA (
Where the potential privacy issue comes in is that an individual’s cloud AV provider has insights into the hash values of every file on the user’s system if a full system scan is done. This should make anyone who uses a cloud-based AV service take into consideration what their AV vendor does with this data beyond the lifetime of the malware detecting scan. Is it retained? How long is it retained? Is it retained in conjunction with any identifiable information? How secure is it kept? Are National Security Letters or warrants ever issued for this data?
It is important to keep in mind that just like signature sets can be created for malware detection or forensics purposes, signature sets can just as easily be created for any set of files. Access to the hashes of files that a user possesses could thus be used for purposes such as identifying people that have an interest in everything from collecting images of lolcats to people that are retaining copies of the Snowden documents, depending on the signature set used. As privacy professionals, the key is for us to ensure that the proper privacy measures are in place to allow individuals and organizations to reap the security benefits of SECaaS services, like cloud AV, in a manner that preserves the privacy of those same individuals.