Foundation model benchmarks

Last updated: Feb 21, 2025

To find the right foundation model for your needs, compare how different foundation models do on relevant performance benchmarks.

Foundation model benchmarks are metrics that test a foundation model's ability to generate accurate or expected output against specific test datasets. Benchmarks cover various capabilities, including whether the model can answer questions about topics that range from elementary mathematics to legal matters and finance, or whether the model can summarize text, generate text in other languages, and more.

Foundation model benchmarks test a foundation model’s ability to generate accurate or expected output for specific tasks. Benchmarks cover various capabilities, including whether the model can answer questions about topics that range from elementary mathematics to legal matters and finance, or whether the model can summarize text, generate text in other languages, and more. Benchmarks are composed of datasets with inputs and expected outputs and metrics that quantify the quality of a model’s responses by measuring factors such as accuracy, harmfulness, and bias.

Look for benchmarks that test the model against the specific tasks that you care about. Reviewing metrics can help you to gauge the capabilities of a foundation model before you try it out.

The following foundation model benchmarks are available in watsonx.ai:

IBM English language understanding benchmarks
Open source English language understanding benchmarks
Open source multilingual language understanding benchmarks

Finding the model benchmark scores

To access the foundation model benchmarks, complete the following steps:

From the watsonx.ai Prompt Lab in chat mode, click the Model field, and then choose View all foundation models.
Click the Model benchmarks tab to see the available benchmarks.

Click the Filter icon to change factors such as the models or benchmark types to show in the comparison view.

The scores range from 0 to 100. Higher scores are better.

Running your own foundation model benchmark evaluations

The Model benchmarks view in watsonx.ai shows benchmark scores from tests that were conducted by IBM. IBM uses a foundation model evaluation framework that is based mostly on the Unitxt library. Unitxt is an open source project that was developed by IBM Research to address the unique foundation model evaluation needs in enterprise use cases. IBM's model evaluation framework also uses another leading open source evaluation framework, called LM evaluation harness. Both of these open source tools can be used to do independent evaluations of foundation models.

Step through this sample notebook, which evaluates the granite-13b-instruct-v2 model against standard benchmarks by using the LM evaluation harness to learn more. See Use lm-evaluation-harness and own benchmarking data with watsonx.ai foundation models.

For more information, see the following resources:

IBM English language understanding benchmarks

The IBM English language understanding benchmarks are benchmarks that are published by IBM based on testing that was done by IBM Research to assess each model's ability to do common tasks.

The following table describes the datasets, goals, and metrics for the IBM benchmarks.

Table 1. IBM English language understanding benchmarks
Benchmark name	Goal	Dataset description	Metric
Summarization	Condenses large amounts of text into a few sentences that capture the main points. Useful for capturing key ideas, decisions, or action items from a long meeting transcript, for example.	Asks the models to summarize text and compares the AI-generated summaries to human-generated summaries from three datasets: • IT dialogs • Technical support dialogs • Social media blogs	Average ROUGE-L score
Retrieval-augmented generation (RAG)	A technique in which a foundation model prompt is augmented with knowledge from external sources. In the retrieval step, relevant documents from an external source are identified from the user’s query. In the generation step, portions of those documents are included in the prompt to generate a response that is grounded in relevant information.	Submits questions based on information from documents in 3 separate datasets	Average ROUGE-L score
Classification	Identifies data as belonging to distinct classes of information. Useful for categorizing information, such as customer feedback, so that you can manage or act on the information more efficiently.	Five datasets with varying content, including contractual content to be classified and content to be assessed for sentiment, emotion, and tone.	Average F1 score
Generation	Generates language in response to instructions and cues that are provided in foundation model prompts.	One dataset with marketing emails	SacreBLEU score
Extraction	Finds key terms or mentions in data based on the semantic meaning of words rather than simple text matches.	Compares entity mentions found by the model to entity mentions found by a human. The datasets include one dataset with 12 named entities and one dataset with three sentiment types.	Average F1 score

Open source English language understanding benchmarks for foundation models

The Open source English language understanding benchmarks show results from testing that is done by IBM Research using mostly English datasets that are published by third-parties, such as academic institutions or industry research teams.

The following table describes the datasets, goals, and metrics for the English language understanding benchmarks.

Table 2. Open source English language understanding benchmarks in watsonx.ai
Benchmark name	Goal	Dataset description	Metric	Related information
20 Newsgroups	Evaluates a model's ability to classify text.	A version of the 20 newsgroups dataset from scikit-learn with almost 20,000 newsgroup documents grouped into 20 categories, including computers, automobiles, sports, medicine, space, and politics.	F1 score	• Dataset card on Hugging Face
Arena-Hard-Auto	Evaluates a model's ability to answer questions.	500 user prompts from live data that is submitted to the crowd-sourcing platform Chatbot Arena.	The metric shows the win rate for model answers.	• Dataset card on Hugging Face • Research paper
AttaQ 500	Evaluates whether a model is susceptible to safety vulnerabilities.	Questions designed to provoke harmful responses in the categories of deception, discrimination, harmful information, substance abuse, sexual content, personally identifiable information (PII), and violence.	Metric shows the model safety.	• Dataset card on Hugging Face • Research paper
BBQ (Bias benchmark for question answering)	Evaluates a model's ability to recognize statements that contain biased views about people from what are considered protected classes by US English speakers.	Question sets that highlight biases.	The metric measures the accuracy of answers.	• Dataset card on Hugging Face • Research paper
BillSum	Evaluates a model's ability to summarize text.	Dataset that summarizes US Congressional and California state bills.	ROUGE-L score for the generated summary.	• Dataset card on Hugging Face • Research paper
CFPB Complaint Database	Evaluate a model's ability to classify text.	Consumer Financial Protection Bureau (CFPB) complaints from real customers about credit reports, student loans, money transfers, and other financial services.	F1 score	• Dataset card on Unitxt.ai
CLAPnq	Evaluate a model's ability to use information from passages to answer questions.	Long-form question-and-answer pairs.	F1 score	• Dataset card on Hugging Face • Research paper
FinQA	Evaluates a model's ability to answer finance questions and do numerical reasoning.	Over 8,000 QA pairs about finance that are written by financial experts.	The metric measures the accuracy of answers.	• Dataset card on Hugging Face • Research paper
FLORES-101	Evaluates a model's ability to translate text.	English Wikipedia articles that were translated by professional human translators into 101 languages	SacreBLEU score	• Dataset card on Hugging Face • Research paper
HellaSwag	Evaluates a model's ability to do common-sense scenario completion.	Multiple-choice questions that are sourced from ActivityNet and WikiHow.	The metric measures the accuracy of answers.	• Dataset card on Hugging Face • Research paper
LegalBench	Evaluates a model's ability to reason about legal scenarios.	162 tasks that cover various legal texts, structures, and domains.	F1 score	• Dataset card on Hugging Face • Research paper
MMLU-Pro	Evaluate a model's ability to understand challenging tasks.	A more challenging version of the Massive Multitask Language Understanding (MMLU) dataset that has more reasoning-focused questions and increases the answer choices from 4 to 10 options.	The metric measures the accuracy of answers.	• Dataset card on Hugging Face • Research paper
OpenBookQA	Evaluate a model's ability to use multistep reasoning and rich text comprehension to answer multiple-choice questions.	Simulates an open-book exam format to provide supporting passages and multiple choice Q&A pairs.	The metric measures the accuracy of answers.	• Dataset card on Hugging Face • Research paper
TLDR	Evaluates a model's ability to summarize text.	Over 3 M preprocessed posts from Reddit with an average length of 270 words for content and 28 words in the summary.	ROUGE-L score for the generated summary.	• Dataset card on Hugging Face • Research paper
Universal NER	Evaluates a model's ability to recognize named entities.	Includes 19 datasets from various domains, including news and social media. The datasets include named entity annotations and cover 13 diverse languages.	F1 score	• Dataset card on Hugging Face

Open source multilingual language understanding benchmarks for foundation models

The Open source multilingual language understanding benchmarks show results from testing that is done by IBM Research using multilingual datasets that are published by third-parties, such as academic institutions or industry research teams.

The following table describes the datasets, goals, metrics, and target languages for the multilingual benchmarks.

Table 3. Open source multilingual language understanding benchmarks in watsonx.ai
Benchmark name	Goal	Dataset description	Metric	Languages	Related information
Basic English	Evaluates whether a model can translate English sentences into these languages: French, German, Spanish, Portuguese, Japanese, and Korean.	850 key English words and their translations.	The metric shows the string containment score, which measures the word or character distance between the target sentence and the reference translation.	Dataset supports English, French, German, Spanish, Portuguese, Japanese, and Korean. Available in watsonx.ai for models that support Korean.	Ogden's Basic English word list
Belebele	Evaluates a model's multilingual reading-comprehension and question-answering ability.	Questions, related passages, and multiple-choice answers in 122 languages.	The metric measures the accuracy of answers.	Available in watsonx.ai for models that support Arabic, French, German, Japanese, Korean, Portuguese, and Spanish.	Dataset card on Hugging Face
MASSIVE	Evaluates a model's ability to classify multilingual text.	Over 1 M utterances from interactions with Amazon's voice assistant that is localized into 52 languages and annotated with intent and slot type information.	F1 score	Available in watsonx.ai for models that support Arabic, French, German, Japanese, Korean, Portuguese, and Spanish.	Dataset card on Hugging Face
MASSIVE with English prompts	Evaluates a model's ability to classify multilingual text with English labels.	Over 1 M utterances from interactions with Amazon's voice assistant that is localized into 52 languages and annotated with intent and slot type information.	F1 score	Available in watsonx.ai for models that support Arabic and Korean.	Dataset card on Hugging Face
MKQA	Evaluates a model's multilingual question-answering ability.	Includes 10 K question-and-answer pairs for each of 26 languages (totaling 260 K pairs).	F1 score	Available in watsonx.ai for models that support Arabic, French, German, Japanese, Korean, Portuguese, and Spanish.	Dataset card on Hugging Face
MLSUM	Evaluates a model's ability to summarize multilingual text.	Over 1.5 M article-and-summary pairs from online newspapers in 5 languages (French, German, Spanish, Russian, Turkish) and English newspapers from CNN and Daily Mail	ROUGE-L score for the generated summary.	Available in watsonx.ai for models that support French and German.	Dataset card on Hugging Face
XGLUE.qg	Evaluates a model's ability to understand multilingual text and generate insightful questions about the text.	11 tasks that span 19 languages	ROUGE-L score for the generated question.	Available in watsonx.ai for models that support French, German, Portuguese, and Spanish.	Dataset card on Hugging Face
XGLUE.wpr	Evaluates a model's ability to retrieve and rank multilingual text.	11 tasks that span 19 languages.	Normalized Discounted Cumulative Gain (NDCG) score for the information retrieval and ranking.	Available in watsonx.ai for models that support French, German, Portuguese, and Spanish.	Dataset card on Hugging Face
XLSum	Evaluates a model's ability to summarize multilingual text.	1.35 M professionally annotated summaries of BBC news articles in 44 languages.	ROUGE-L score for the generated summary.	Available in watsonx.ai for models that support Arabic, French, Japanese, Korean, Portuguese, and Spanish.	Dataset card on Hugging Face
XMMLU	Evaluates the model's ability to answer multilingual questions about elementary mathematics, US history, computer science, law, and more.	Translations of the Massive Multitask Language Understanding (MMLU) English dataset, which consists of general knowledge multiple-choice questions.	The metric measures the accuracy of answers.	Available in watsonx.ai for models that support Arabic, French, and Korean.
XNLI	Evaluates how well a model can classify multilingual sentences.	Subset of data from the MNLI (Multi-Genre Natural Language Inference) dataset, which includes crowd-sourced sentence pairs that are annotated with textual entailment information and translated into 14 languages.	The metric measures the accuracy of answers.	Available in watsonx.ai for models that support Arabic, French, German, and Spanish.	Dataset card on GitHub
XNLI with English instructions	Evaluates how well a model can classify multilingual sentences when the prompts are in English.	Subset of data from the MNLI (Multi-Genre Natural Language Inference) dataset, which has crowd-sourced sentence pairs that are annotated with textual entailment information, translated into 14 languages	The metric measures the accuracy of answers.	Available in watsonx.ai for models that support Arabic.	Dataset card on GitHub
XWinograd	Evaluates a model's ability to understand context and resolve ambiguity in multilingual text.	Multilingual collection of Winograd schemas, which are pairs of sentences with drastically different meanings due to slight word changes.	The metric measures the accuracy of answers.	Available in watsonx.ai for models that support Portuguese.	Dataset card on Hugging Face

Understanding benchmark metrics

Some metrics are self-explanatory, such as the accuracy score for a model that is tested against multiple-choice datasets. Others are less commonly known. The following list describes the metrics that are used to quantify model performance in watsonx.ai:

F1: Measures whether the optimal balance between precision and recall is reached. Often used to score classification tasks where precision measures how many of the overall sentences are classified as the correct sentence class and recall measures how often sentences that should be classified are classified.
Normalized Discounted Cumulative Gain (NDCG): A ranking quality metric that compares generated rankings to a reference order where the most relevant items are at the top of the ranked list.
Rouge-L: Used to score the quality of summarizations by measuring the similarity between the generated summary and the reference summary. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. The L stands for scoring based on the longest matching sequence of words. This metric looks for in-sequence matches that reflect sentence-level word order.
SacreBLEU: Bilingual Evaluation Understudy (BLEU) is a metric for comparing a generated translation to a reference translation. SacreBLEU is a version that makes the metric easier to use by providing sample test datasets and managing tokenization in a standardized way. Most often used to assess the quality of translation tasks, but can be used to score summarization tasks also.
Safety: A metric used with the AttaQ 500 benchmark that combines the Adjusted Rand Index (ARI) metric, which considers the labels associated with attacks, and the Silhouette Score, which assesses cluster-based characteristics such as cohesion, separation, distortion, and likelihood. For more information, see the research paper Unveiling safety vulnerabilities of large language models.
Win rate: A metric used with the Arena-Hard-Auto benchmark to show the percentage of conversations in which model responses lead to the successful completion of an action. For more information, see the research paper From crowsourced data to high-quality benchmarks: Arena-Hard and Benchbuilder pipelie.

Learn more

Choosing a model

Parent topic: Supported foundation models