Uncertain data provenance risk for AI

Last updated: Feb 07, 2025

Transparency

Training data risks

Amplified by generative AI

Description

Data provenance refers to tracing history of data, which includes its ownership, origin, and transformations. Without standardized and established methods for verifying where the data came from, there are no guarantees that the data is the same as the original source and has the correct usage terms.

Why is uncertain data provenance a concern for foundation models?

Not all data sources are trustworthy. Data might be unethically collected, manipulated, or falsified. Verifying that data provenance is challenging due to factors such as data volume, data complexity, data source varieties, and poor data management. Using such data can result in undesirable behaviors in the model.

Parent topic: AI risk atlas

We provide examples covered by the press to help explain many of the foundation models' risks. Many of these events covered by the press are either still evolving or have been resolved, and referencing them can help the reader understand the potential risks and work towards mitigations. Highlighting these examples are for illustrative purposes only.

Was the topic helpful?

0/1000

DescriptionCopy link to section

Why is uncertain data provenance a concern for foundation models?Copy link to section

Related RisksCopy link to section

Description

Why is uncertain data provenance a concern for foundation models?

Related Risks