Lack of training data transparency risk for AI

Last updated: Dec 12, 2024

Risks associated with input

Training and tuning phase

Transparency

Amplified by generative AI

Description

Without accurate documentation on how a model's data was collected, curated, and used to train a model, it might be harder to satisfactorily explain the behavior of the model with respect to the data.

Why is lack of training data transparency a concern for foundation models?

A lack of data documentation limits the ability to evaluate risks associated with the data. Having access to the training data is not enough. Without recording how the data was cleaned, modified, or generated, the model behavior is more difficult to understand and to fix. Lack of data transparency also impacts model reuse as it is difficult to determine data representativeness for the new use without such documentation.

Background image for risks associated with input

Example

Data and Model Metadata Disclosure

OpenAI‘s technical report is an example of the dichotomy around disclosing data and model metadata. While many model developers see value in enabling transparency for consumers, disclosure poses real safety issues and might increase the ability to misuse the models. In the GPT-4 technical report, the authors state, “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, data set construction, training method, or similar.”

Sources:

OpenAI, March 2023

Parent topic: AI risk atlas

We provide examples covered by the press to help explain many of the foundation models' risks. Many of these events covered by the press are either still evolving or have been resolved, and referencing them can help the reader understand the potential risks and work towards mitigations. Highlighting these examples are for illustrative purposes only.