Launched in November 2022, OpenAI’s chatbot, ChatGPT, took the world by storm almost overnight. It brought a new technology term into the mainstream: generative artificial intelligence. Generative AI describes algorithms that can create new content such as essays, images and videos from text prompts, autocomplete computer code, or analyze sentiment.
Many may not be familiar with the concept of generative AI; however, it is not a new technology. Generative adversarial networks — one type of generative model – were introduced in 2014. Large language models such as GPT-3 and BERT are used to generate text with human-like qualities. Since then, advancements have rapidly progressed. A variety of generative AI apps have been released over the past few years, including GitHub copilot, DALL-E 2, Midjourney, Copy.ai, Notion AI and Lensa.
ChatGPT-4, released 14 March, 2023, is the most recent and most enhanced version of the renowned chatbot and it is producing even more stunning outputs. According to OpenAI, ChatGPT-4 achieved a substantially better result than its predecessor GPT-3.5 when tasked with a simulated bar exam — a test used to assess a person's qualification to practice law in the U.S. GPT-4 passed with a score in the top 10% of test takers, while GPT-3.5’s score was in the bottom 10%.
Against this backdrop, generative AI is seen by some as the most promising breakthrough in the field of AI. Its global market size is expected to grow to over $200 billion by 2032. At the same time, its rapid development underscores the increasing importance of addressing ethical and privacy concerns regarding the utilization of generative AI technologies in various domains.
With the new possibilities created by generative AI applications comes an heightened awareness of risks. Commonly discussed issues include copyright infringement, inherent bias in generative algorithms, overestimation of AI capabilities, resulting in the reception or dissemination of incorrect output (also referred to as AI hallucinations), and the creation of deepfakes and other synthetic content that could manipulate public opinion or pose risks to public safety.
In addition to these critical issues, generative AI poses complex privacy risks to individuals, organizations and society. Latest reports on leaks of sensitive information and chat histories underline the urgent need for robust privacy and security measures in the development and deployment of generative AI technologies.
Globally accepted privacy principles such as data quality, data collection limitation, purpose specification, use limitation, security, transparency, accountability and individual participation apply to all systems processing personal data, including training algorithms and generative AI.
First, generative AI chatbots are currently using large language models trained on a mix of data sets, including data scraped from the internet. How this is regulated varies significantly across privacy jurisdictions globally and is subject to a wide range of legal considerations in general. On one side of the spectrum, U.S. state privacy regulations, such as the California Consumer Privacy Act and California Privacy Rights Act, exclude from their scopes information that a business has a reasonable basis to believe is lawfully made available to the general public. Nevertheless, it can still create liability risks for web scrapers when a website’s terms of use explicitly prohibit data extraction. Privacy problems may arise when the scraped data includes information meant to be password protected. On the other side, under the EU General Data Protection Regulation, an explicit scraping infringes upon the obligation of website providers to protect user data and puts individuals at risk.
A second issue is transparency, both in the general privacy context and in the particular context of AI systems. There is broad regulatory consensus that information provided to individuals about how their data is collected and processed should be both accessible and sufficiently detailed to empower them in the exercise of their rights. In short, organizations using AI should be able to explain how the system makes decisions to end users who may not be well-versed in the state of the art. The Federal Trade Commission has issued guidelines for data brokers that advise them to be transparent about data collection and usage practices and to implement reasonable security measures to protect consumer data. Both requirements are not easy to live up to when trying to translate and anticipate algorithmic predictions.
A third issue — and perhaps one of the most challenging privacy questions in machine learning— is how to exercise individual data privacy rights. One such important and globally-proliferating right is the "right to be forgotten," which allows individuals to request a company delete their personal information. While removing data from databases is comparatively easy, it is likely difficult to delete data from a machine learning model and doing so may undermine the utility of the model itself.
These issues and examples are only the tip of the iceberg of potential privacy issues around generative AI. As research has shown, private information used to train generative AI algorithms and large language models can surface as chatbot outputs or get recovered in cyberattacks. This should underline the need for organizations and privacy professionals alike to take a hard look at privacy principles and what questions need answering in order to assess applications’ compliance with privacy regulations. Recent FTC decisions ordering the destruction of algorithms trained on unlawfully collected personal information send a very clear message.
At the same time, there are calls to examine privacy-related questions in the broader context of the ethical use of AI. The rapid pace at which AI has advanced has left the public without a clear legal framework to properly address the technology, leaving the task of determining proper AI governance strategies up to individual organizations. While, in time, guidance may become clearer and more consistent through legislation – for example, the EU AI Act - there are certain steps organizations can take today to align their responsible use of AI with privacy principles.
Over the past few years, a set of principles defining the report by the IAPP and FTI Consulting on AI governance and privacy showed. These efforts are often driven and coordinated by the internal privacy office. Organizations can extend existing, mature privacy programs and processes designed to assess privacy risks to include broader ethics assessments that prevent biased output based on discriminatory representations.
Given the opportunities and risks generative AI and the whole sphere of foundational models bring, companies will need to make continued investments into privacy, upskill for algorithmic auditing and integrate “ethics, privacy and security by design” methodologies.
Anti-bot systems or web application firewalls can protect online data against scraping. Nascent research areas such as “machine unlearning” aim to solve the problem of deleting data from a machine learning model to address the privacy rights of individuals. Machine learning methods like reinforcement learning with human feedback can support more accurate model training and improve performance. Moving forward, emerging privacy-enhancing technologies, such as differential privacy, the development of scalable methods for cleaning data sets, including by deduplicating or training data disclosure requirements, may also contribute to solutions to many of these open problems.
With time, practice, and trial and error, the privacy and engineering community will continue to decide how to harness the innovative developments of our time and contribute to industry standards surrounding these topics. For the present, it is up to individual organizations and the community of privacy pros to identify and tackle some of the many and most important issues associated with the trustworthy implementation of generative AI.