17 January 2024

Proposed data provenance standards aim to enhance trustworthiness of AI training data

To set a baseline standard for the data quality that will eventually power the soon-to-be commonly adopted AI systems across varying industries, and "to drive adoption of responsible data and AI practices," 19 Fortune 500 companies encompassing the Data and Trust Alliance recently released new proposed data provenance standards. The first-of-their-kind standards are applicable across industries to help organizations discern the origins of data they possess in a given dataset, not just for AI systems, but for traditional applications as well.

The alliance comprises leading enterprises, such as IBM, Pfizer and Nielsen. The eight proposed data provenance standards are set to govern data lineage by identifying a data entry's source, legal rights and privacy protections, setting a standardized timestamp, discerning between structured and unstructured data, identifying how the data was generated, and listing its intended uses and restrictions.

The alliance identified provenance "pain-points" from 25 use cases across its member industries and consolidated 53 standards down to eight, "focusing on business value and feasibility." The standards were co-created by chief data officers, chief information officers, "and leads on data strategy, enterprise data and AI governance, compliance and legal from organizations across healthcare, automotive, IT, media, banking and finance, retail, education and other industries."

The alliance is also looking to grow its community of practice as it shapes "robust, transparent, and adoptable data provenance standards."

Howso General Counsel and Chief Legal Officer Michael Meehan, whose company is a DTA member organization, said the way an organization adheres to the data provenance standards depends on whether they generate data used downstream by other companies' AI systems or if they process and store data for varying purposes.

"If you're a data creator, you're generating datasets, and you’re generating these data provenance labels along with the data sets and saying this was generated by my AI model," Meehan said in an interview. "(For) organizations that are taking in data, they will kind of see it more as, 'Here's a blob of data — data science team, go use it.' These clients are mostly going to pay attention to the metadata associated with the data provenance and that's where they’re going to start their analysis using the provenance standards."

The concept of data provenance, according to Mastercard Senior Vice President of Data Quality and Sources Travis Carpenter, is ensuring data used to train AI models is "transparent, trustworthy and effective." Mastercard is also a DTA member.

"Data provenance is describing where the data comes from and the key set attributes of that data," said Carpenter. "Provenance is how data enters the overall ecosystem. It's critical that we get these data standards defined upfront, as soon as data enters the ecosystem."

Meehan said the goal of the DTA's provenance standards is identifying the initial source of a given data element, such as a web scraper, and then labeling it using the standards. Then, as the data is repurposed and reused down the AI supply chain, various users will be able to discern the data's origins and how it was used or altered at prior stages of its lifecycle.

"So data brokers are combining stuff from all over the place, (under the standards) they should have multiple labels representing all the way upstream," Meehan said. "When a big tech player needs to buy some data, they can actually look through the data and say, 'Hey, you're not transparent here.' Privacy professionals spend a lot of their time understanding which legal regime covers certain tranches of data and it should become more apparent to them if their brokers are using the standards and labeling the data at each stage of the lifecycle."

One of the eight data provenance standards is the privacy requirements imposed on a dataset. This falls directly on privacy pros who are ultimately responsible for how their company's AI tool processes personal data. Meehan said the data provenance standards will help multinational companies separate and label data points contained in various sets by the jurisdictional consumer privacy law requirements governing how a data subject's personal information can be stored and processed.

"If your company is using the data provenance standard, and you have data coming from multiple products, you often don't know what's actually happening with all that data underneath," Meehan said. "Wouldn't it be nice if all that data was labeled using the data provenance standards for that underlying metadata? If you employ the standards right, this is going to save you a lot of time, a lot of headaches and get you to the answers much more quickly than you can today. Right now, a lot of companies don't even have their data mapped."

Applying data provenance to AI governance best practices

In terms of AI governance, Meehan said implementing the data provenance standards will be an AI developers' best friend prior to the commercial release of a model. Developers tuning their models to the requirements outlined in the standards could prevent various regulators from taking a sledgehammer approach and mandating the complete deletion of a model in the event it violates global data protection laws through its operation.

Meehan cited the U.S. Federal Trade Commission’s 2022 enforcement action against facial recognition company Everalbum as an example of the legal consequences of failing to meet the responsible data-use conditions the DTA's data provenance standards are designed to protect against. In its decision, the FTC found Everalbum deceived its users when it trained a facial recognition tool using photos they uploaded onto the company's mobile app, Ever, and ordered the deletion of its algorithms, in part, because the company could not demonstrate the states where it obtained valid user consent.

"Model deletion is devastating to companies, because a lot of times they don't have the training data around, so they can't just necessarily delete (the problematic) portion of the data," Meehan said. "Now, if you start with your training data and have the data provenance standard wrapped around it, then you can do the analysis, and ... can say, 'Well, in this state, we have collected affirmative consent,' and prevent a regulator from ordering wholesale deletion.”

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Proposed data provenance standards aim to enhance trustworthiness of AI training data

Related stories

Applying data provenance to AI governance best practices