BLEU evaluation metric

Last updated: Feb 26, 2025

The BLEU (Bilingual Evaluation Understudy) metric compares translated sentences from machine translations to sentences from reference translations to measure the similarity between reference texts and predictions.

Metric details

BLEU is a generative AI quality evaluation metric that measures how well generative AI assets perform tasks.

Scope

The BLEU metric evaluates generative AI assets only.

Types of AI assets: Prompt templates
Generative AI tasks:
- Text summarization
- Content generation
- Question answering
- Retrieval augmented generation (RAG)
Supported languages: English

Scores and values

The BLEU metric score indicates the similarity between the machine translation and reference translations. Higher scores indicate more similarity between reference texts and predictions.

Range of values: 0.0-1.0
Best possible score: 1.0

Settings

Thresholds:
- Lower limit: 0.8
- Upper limit: 1
Parameters:
- Max order: Maximum n-gram order to use when completing BLEU score
- Smooth: Whether or not to apply a smoothing function to remove noise from data

Parent topic: Evaluation metrics

Was the topic helpful?

0/1000