'Open source' in the age of AI

Open-source software refers to software with publicly available source code, such that anyone can use, modify and distribute it. Today's internet is powered by open-source technology; developers around the world contribute to writing and testing software, allowing businesses to benefit from higher quality options than just costly proprietary solutions.

Open source has always been a unique community within the online sphere, but the latest issue garnering debate revolves around open-source artificial intelligence. The contention centers on how to define open-source AI, and whether open-source AI models can maintain adequate safety and security.

Open-source AI has played a central role in how the AI race has shaken out thus far. One of the most widely discussed examples is Meta's move to open source its model, Llama 3, to grow the open-source AI community and help them protect themselves from being locked into proprietary models from vendors like Google, OpenAI and Microsoft.

Syrenis- A privacy professional's AI checklist: 10 points to master compliance ahead of the curve

Additionally, there is global regulatory support to incentivize the development and adoption of open-source models. For example, the EU AI Act contains certain exemptions for open-source models. In the U.S., the Federal Trade Commission and the National Telecommunications Information Administration have both been vocal in their support of open-source AI models due to potential benefits for competition, innovation and security.

The Open Source Initiative's draft definition

The Open Source Initiative is a non-profit organization that serves as the leading authority on defining open source. The OSI espouses 10 principles to characterize a proper open-source license. These establish underlying values like accessibility of the source code, free redistribution and nonrestrictive licensing regimes. There are several types of open-source software licenses, some more permissive than others, but they all comply with these guidelines.

The OSI has been working to expand its approach to include AI, working to define open-source AI since 2022 and most recently releasing its ninth iteration of a draft definition 22 Aug. This states an open-source AI system must grant end-users the freedom to: use the system for any purpose, without asking permission; study how the system works and inspect its components; modify the system for any purpose, including changing its output; and share the system for others to use, with or without modifications, for any purpose.

The definition references three components of an AI system: the code, the data and the weights. Similar to OSS rules, this definition requires the source code used to train and run an AI system to be made available by OSI-approved licenses. With respect to data, the definition does not require AI systems to provide access to raw training data itself, but rather "sufficiently detailed information about the data used to train," which may include information about "training methodologies and techniques, training data sets, information about the provenance of those data sets, their scope and characteristics; how the data was obtained and selected, and the labeling procedures and data cleaning methodologies."

Finally, with respect to weights, OSI's definition calls for model weights and parameters to be made available under OSI-approved terms. However, it is worth noting OSI does not take a stance on whether model parameters require a license and whether they can be legally controlled once disclosed.

Perspectives on the current state of open-source AI

Labeling AI systems as open source is challenging since AI systems differ from traditional OSS; they are driven not just by underlying software code, but also by the nature and content of the data used to train and test them.

Consequently, some maintain that in order for an AI model to be truly open source, it should allow full access to the model's training data. Open-source data is a different category than open-source models, which typically only allow for people to use the system's tunings, settings and specifications. Without knowing what data the model is being trained on, the argument is that the context required to understand and use the model effectively is missing.

This sentiment has driven the criticism against companies for "open washing" because they advertise their models as open source, without actually satisfying traditional or structured open-source standards. For example, Meta advertised Llama-2 as open source, but the model did not provide access to information about its training data and imposed a licensing regime inconsistent with OSI principles.

Another example is X's model Grok. While Grok abided by a traditional open-source licensing scheme — using a common open-source license, Apache 2.0 — it also did not release information about the model's training data. OSI previously stated it plans to flag models that falsely claim to be open source and also noted it expected the list of models that satisfied its definition to be from smaller companies or providers.

Many argue that open-source AI is the wrong term to use altogether because the principles of traditional open source simply do not translate to AI systems, where neural net weights are unreadable and frequently cannot be debugged by humans.

Alternatively, they argue "open weights" is a more appropriate label because it only refers to when model weights — the output and representation of what an AI has learned — are made freely available. The FTC seems keen to explore open-weights AI models, and the agency has previously highlighted how these models can promote competition and provide the benefits of an advanced model without the burden of training it from scratch.

Considerations in adopting open-source AI

The open-source movement has historically been recognized for promoting innovation. Unlike a traditional business-to-business context that is fully dependent on a single vendor to resolve any issues, open source facilitates collaboration by bringing more developers to the table.

These benefits also exist in the AI context. The FTC has expressed support for encouraging open-source AI. At a recent July technology startup incubator event, Chair Lina Khan noted the FTC's specific interest in exploring open weights for AI models where smaller players can establish themselves in a market that has so far been primarily dominated by Big Tech.

In addition to definitional and access questions, another potential challenge with adopting open-source AI systems is determining whether they create unreasonable security risks for consumers. In fact, some initial criticism of the EU AI Act was driven by security concerns stemming from the law's exemptions for open-source models.

So far, much of the security concern around open-source models revolves around generative AI and how bad actors may use open-source models for their own destructive purposes. For example, it has been reported that Stable Diffusion, an open-source text-to-image AI system, has been used to build models that generate abusive content.

However, closed-source models are not blameless here either — they are also subject to misuse risks and security threats like jailbreaking and prompt-injection attacks. Moreover, there is research to support that open-source models can be run safely and even promote privacy and security because they run on-device and the models' creators do not have access to user data.

A final challenge is the legal risk associated with managing intellectual property when adopting open-source AI. There are three traditional types of open-source licenses: public domain, permissive and copyleft. Public domain licenses are the least restrictive and typically waive all IP-related rights. Permissive licenses often allow for proprietary use and do not impose an obligation for the derivative work to remain open source. Copyleft licenses are the most restrictive because they require derivative works to also be distributed as open source.

IP risks may arise from blending proprietary code and open-source code. If a business's proprietary code contains open-source code under a copyleft license, then the business may be subjecting itself to legal exposure if it does not make the relevant portions of code available. Therefore, businesses adopting open-source models may need to dedicate resources toward cleaning their codes to ensure proprietary information stays private, while the open-source portions are made publicly available.

IP questions become even more complex in the AI context because of a lack of consensus around which parts of an AI model are able to be copyrighted and, therefore, require a license. Traditionally, software and code are copyrightable, while the mathematical concepts that underlie code are not.

Consequently, whether model weights are copyrightable like software code or whether they are more akin to mathematical formulas is a key debate. Model weights are often defined as numerical values determining the strength of connections between units in a neural network. Since these weights are automatically created by the AI system in response to the training process and are not human-authored, some argue they should not receive copyright protection. Others maintain that weights are the product of many creative human decisions, like model design and selection of training data, and therefore, should receive copyright protection.

If weights are not copyrightable, this will likely have a chilling effect on the open-source AI ecosystem. Without adequate IP protection and licensing to support enforcement of how value is attributed, companies may feel disincentivized from offering open-source solutions.

Cost is also a key consideration for businesses choosing to offer or take advantage of open-source models. Generally, open source is considered more cost effective than closed source because licensing fees are not involved. However, it is important to note implementing open source effectively requires expertise, which is an ongoing investment that is not typically required when using proprietary AI models that offer more support. Therefore, a small business may find closed source models to be a better choice when they lack the in-house expertise to effectively administer an open-source model.

Conclusion

Ultimately, a business's decision to offer or adopt open-source vs. closed-source AI models may hinge on its particular constraints and goals. There are a range of factors to consider, including competitive considerations, security risk, IP management and cost.

Achieving consensus on a definition for open-source AI will be a crucial milestone, as OSI continues to revise. It recently began inviting organizations to endorse a definition ahead of a formal release. Initially planned for the end of September, the release has yet to be published.

Blair Robinson is an associate, Vaishali Nambiar, CIPP/US, is a research associate and Brenda Leong, AIGP, CIPP/US, is managing partner at Luminos.Law.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

'Open source' in the age of AI

Related stories

The Open Source Initiative's draft definition

Perspectives on the current state of open-source AI

Considerations in adopting open-source AI

Conclusion