To find the right foundation model for your needs, compare how different foundation models do on relevant performance benchmarks.
Foundation model benchmarks are metrics that test a foundation model's ability to generate accurate or expected output against specific test datasets. Benchmarks cover various capabilities, including whether the model can answer questions about topics that range from elementary mathematics to legal matters and finance, or whether the model can summarize text, generate text in other languages, and more.
Look for benchmarks that test the model against the specific tasks that you care about. Reviewing metrics can help you to gauge the capabilities of a foundation model before you try it out.
The following foundation model benchmarks are available in watsonx.ai:
Finding the model benchmark scores
To access the foundation model benchmarks, complete the following steps:
-
From the watsonx.ai Prompt Lab in chat mode, click the Model field, and then choose View all foundation models.
-
Click the Model benchmarks tab to see the available benchmarks.
Click the Filter icon to change factors such as the models or benchmark types to show in the comparison view.
The scores range from 0 to 100. Higher scores are better.
IBM English language understanding benchmarks
The IBM English language understanding benchmarks are benchmarks that are published by IBM based on testing that was done by IBM Research to assess each model's ability to do common tasks.
The following table describes the datasets, goals, and metrics for the IBM benchmarks.
Benchmark name | Goal | Dataset description | Metric |
---|---|---|---|
Summarization | Condenses large amounts of text into a few sentences that capture the main points. Useful for capturing key ideas, decisions, or action items from a long meeting transcript, for example. | Asks the models to summarize text and compares the AI-generated summaries to human-generated summaries from three datasets: • IT dialogs • Technical support dialogs • Social media blogs |
Average ROUGE-L score |
Retrieval-augmented generation (RAG) | A technique in which a foundation model prompt is augmented with knowledge from external sources. In the retrieval step, relevant documents from an external source are identified from the user’s query. In the generation step, portions of those documents are included in the prompt to generate a response that is grounded in relevant information. | Submits questions based on information from documents in 3 separate datasets | Average ROUGE-L score |
Classification | Identifies data as belonging to distinct classes of information. Useful for categorizing information, such as customer feedback, so that you can manage or act on the information more efficiently. | Five datasets with varying content, including contractual content to be classified and content to be assessed for sentiment, emotion, and tone. | Average F1 score |
Generation | Generates language in response to instructions and cues that are provided in foundation model prompts. | One dataset with marketing emails | SacreBLEU score |
Extraction | Finds key terms or mentions in data based on the semantic meaning of words rather than simple text matches. | Compares entity mentions found by the model to entity mentions found by a human. The datasets include one dataset with 12 named entities and one dataset with three sentiment types. | Average F1 score |
Open source English language understanding benchmarks for foundation models
The Open source English language understanding benchmarks show results from testing that is done by IBM Research using mostly English datasets that are published by third-parties, such as academic institutions or industry research teams.
The following table describes the datasets, goals, and metrics for the English language understanding benchmarks.
Benchmark name | Goal | Dataset description | Metric | Related information |
---|---|---|---|---|
20 Newsgroups | Evaluates a model's ability to classify text. | A version of the 20 newsgroups dataset from scikit-learn with almost 20,000 newsgroup documents grouped into 20 categories, including computers, automobiles, sports, medicine, space, and politics. | F1 score | • Dataset card on Hugging Face |
Arena-Hard-Auto | Evaluates a model's ability to answer questions. | 500 user prompts from live data that is submitted to the crowd-sourcing platform Chatbot Arena. | The metric shows the win rate for model answers. | • Dataset card on Hugging Face • Research paper |
AttaQ 500 | Evaluates whether a model is susceptible to safety vulnerabilities. | Questions designed to provoke harmful responses in the categories of deception, discrimination, harmful information, substance abuse, sexual content, personally identifiable information (PII), and violence. | Metric shows the model safety. | • Dataset card on Hugging Face • Research paper |
BBQ (Bias benchmark for question answering) |
Evaluates a model's ability to recognize statements that contain biased views about people from what are considered protected classes by US English speakers. | Question sets that highlight biases. | The metric measures the accuracy of answers. | • Dataset card on Hugging Face • Research paper |
BillSum | Evaluates a model's ability to summarize text. | Dataset that summarizes US Congressional and California state bills. | ROUGE-L score for the generated summary. | • Dataset card on Hugging Face • Research paper |
CFPB Complaint Database | Evaluate a model's ability to classify text. | Consumer Financial Protection Bureau (CFPB) complaints from real customers about credit reports, student loans, money transfers, and other financial services. | F1 score | • Dataset card on Unitxt.ai |
CLAPnq | Evaluate a model's ability to use information from passages to answer questions. | Long-form question-and-answer pairs. | F1 score | • Dataset card on Hugging Face • Research paper |
FinQA | Evaluates a model's ability to answer finance questions and do numerical reasoning. | Over 8,000 QA pairs about finance that are written by financial experts. | The metric measures the accuracy of answers. | • Dataset card on Hugging Face • Research paper |
HellaSwag | Evaluates a model's ability to do common-sense scenario completion. | Multiple-choice questions that are sourced from ActivityNet and WikiHow. | The metric measures the accuracy of answers. | • Dataset card on Hugging Face • Research paper |
LegalBench | Evaluates a model's ability to reason about legal scenarios. | 162 tasks that cover various legal texts, structures, and domains. | F1 score | • Dataset card on Hugging Face • Research paper |
MMLU-Pro | Evaluate a model's ability to understand challenging tasks. | A more challenging version of the Massive Multitask Language Understanding (MMLU) dataset that has more reasoning-focused questions and increases the answer choices from 4 to 10 options. | The metric measures the accuracy of answers. | • Dataset card on Hugging Face • Research paper |
OpenBookQA | Evaluate a model's ability to use multistep reasoning and rich text comprehension to answer multiple-choice questions. | Simulates an open-book exam format to provide supporting passages and multiple choice Q&A pairs. | The metric measures the accuracy of answers. | • Dataset card on Hugging Face • Research paper |
TLDR | Evaluates a model's ability to summarize text. | Over 3 M preprocessed posts from Reddit with an average length of 270 words for content and 28 words in the summary. | ROUGE-L score for the generated summary. | • Dataset card on Hugging Face • Research paper |
Universal NER | Evaluates a model's ability to recognize named entities. | Includes 19 datasets from various domains, including news and social media. The datasets include named entity annotations and cover 13 diverse languages. | F1 score | • Dataset card on Hugging Face |
Understanding benchmark metrics
Some metrics are self-explanatory, such as the accuracy score for a model that is tested against multiple-choice datasets. Others are less commonly known. The following list describes the metrics that are used to quantify model performance in watsonx.ai:
- F1
- Measures whether the optimal balance between precision and recall is reached. Often used to score classification tasks where precision measures how many of the overall sentences are classified as the correct sentence class and recall measures how often sentences that should be classified are classified.
- Rouge-L
- Used to score the quality of summarizations by measuring the similarity between the generated summary and the reference summary. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. The L stands for scoring based on the longest matching sequence of words. This metric looks for in-sequence matches that reflect sentence-level word order.
- SacreBLEU
- Bilingual Evaluation Understudy (BLEU) is a metric for comparing a generated translation to a reference translation. SacreBLEU is a version that makes the metric easier to use by providing sample test datasets and managing tokenization in a standardized way. Most often used to assess the quality of translation tasks, but can be used to score summarization tasks also.
- Safety
- A metric used with the AttaQ 500 benchmark that combines the Adjusted Rand Index (ARI) metric, which considers the labels associated with attacks, and the Silhouette Score, which assesses cluster-based characteristics such as cohesion, separation, distortion, and likelihood. For more information, see the research paper Unveiling safety vulnerabilities of large language models.
- Win rate
- A metric used with the Arena-Hard-Auto benchmark to show the percentage of conversations in which model responses lead to the successful completion of an action. For more information, see the research paper From crowsourced data to high-quality benchmarks: Arena-Hard and Benchbuilder pipelie.
Learn more
Parent topic: Supported foundation models