Removing harmful language from model input and output

Last updated: Feb 26, 2025

AI guardrails removes potentially harmful content, such as hate speech, abuse, and profanity, from foundation model output and input.

The AI guardrails feature in the Prompt Lab is powered by AI that applies a classification task to foundation model input and output text. The sentence classifier, which is also referred to as a hate, abuse, and profanity (HAP) detector or HAP filter, was created by fine-tuning a large language model from the Slate family of encoder-only NLP models built by IBM Research.

The classifier breaks the model input and output text into sentences, and then reviews each sentence to find and flag harmful content. The classifier assesses each word, relationships among the words, and the context of the sentence to determine whether a sentence contains harmful language. The classifier then assigns a score that represents the likelihood that inappropriate content is present.

AI guardrails in the Prompt Lab detects and flags the following types of language:

Hate speech: Expressions of hatred toward an individual or group based on attributes such as race, religion, ethnic origin, sexual orientation, disability, or gender. Hate speech shows an intent to hurt, humiliate, or insult the members of a group or to promote violence or social disorder.
Abusive language: Rude or hurtful language that is meant to bully, debase, or demean someone or something.
Profanity: Toxic words such as expletives, insults, or sexually explicit language.

The AI guardrails feature is supported when you inference natural-language foundation models and can detect harmful content in English text only. AI guardrails are not applicable to programmatic-language foundation models.

Removing harmful language from input and output in Prompt Lab

To remove harmful content when you're working with foundation models in the Prompt Lab, set the AI guardrails switcher to On.

The AI guardrails feature is enabled automatically for all natural language foundation models in English.

After the feature is enabled, when you click Generate, the filter checks all model input and output text. Inappropriate text is handled in the following ways:

Input text that is flagged as inappropriate is not submitted to the foundation model. The following message is displayed instead of the model output:

[The input was rejected as inappropriate]
Model output text that is flagged as inappropriate is replaced with the following message:

[Potentially harmful text removed]

Removing PII from input and output in Prompt Lab

You can apply a PII filter to flag content that might contain personally identifiable information.

The PII filter uses a natural language processing AI model to identify and flag mentions of personally identifiable information (PII) information, such as phone numbers and email addresses.

For the full list of entity types that are flagged, see Rule-based extraction for general entities.

To enable the PII filter, complete the following steps:

From the Prompt Lab, set the AI guardrails switcher to On.
Click the AI guardrails settings icon .
In the input and output sections, set the PII switcher to On to enable the PII filter.

The PII filter threshold value is set to 0.8 and cannot be changed.

Configuring AI guardrails

You can control whether the hate, abuse, and profanity (HAP) filter is applied at all and change the sensitivity of the HAP filter for the user input and foundation model output independently. You cannot change the sensitivity of the PII filter.

To configure AI guardrails, complete the following steps:

With AI Guardrails enabled, click the AI guardrails settings icon .
To disable AI guardrails for user input or foundation model output only, set the HAP slider for the user input or model output to 1.
To change the sensitivity of the guardrails, move the HAP sliders.

The slider value represents the threshold that scores from the HAP classifier must reach for the content to be considered harmful. The score threshold ranges from 0.0 to 1.0.

A lower value, such as 0.1 or 0.2, is safer because the threshold is lower. Harmful content is more likely to be identified when a lower score can trigger the filter. However, the classifier might also be triggered when content is safe.

A value closer to 1, such as 0.8 or 0.9, is more risky because the score threshold is higher. When a higher score is required to trigger the filter, occurrences of harmful content might be missed. However, the content that is flagged as harmful is more likely to be harmful.

Experiment with adjusting the sliders to find the best settings for your needs.
Click Save.

Programmatic alternative

When you prompt a foundation model by using the API, you can use the moderations field to apply filters to foundation model input and output. For more information, see the watsonx.ai API reference. For more information about how to adjust filters with the Python library, see Inferencing a foundation model programmatically.

Learn more

Parent topic: Prompt Lab

Was the topic helpful?

0/1000

Removing harmful language from input and output in Prompt LabCopy link to section

Removing PII from input and output in Prompt LabCopy link to section

Configuring AI guardrailsCopy link to section

Programmatic alternativeCopy link to section

Learn moreCopy link to section

Removing harmful language from input and output in Prompt Lab

Removing PII from input and output in Prompt Lab

Configuring AI guardrails

Programmatic alternative

Learn more