Third-party foundation models

Last updated: Apr 23, 2025

You can choose from a collection of third-party foundation models in IBM watsonx.ai.

The following models are available in watsonx.ai:

allam-1-13b-instruct
codellama-34b-instruct-hf
deepseek-r1-distill-llama-8b
deepseek-r1-distill-llama-70b
elyza-japanese-llama-2-7b-instruct
eurollm-1-7b-instruct
eurollm-9b-instruct
flan-t5-xl-3b
flan-t5-xxl-11b
flan-ul2-20b
jais-13b-chat
llama-4-maverick-17b-128e-instruct-fp8
llama-4-scout-17b-16e-instruct
llama-3-3-70b-instruct
llama-3-2-1b-instruct
llama-3-2-3b-instruct
llama-3-2-11b-vision-instruct
llama-3-2-90b-vision-instruct
llama-guard-3-11b-vision
llama-3-1-8b
llama-3-1-8b-instruct
llama-3-1-70b
llama-3-1-70b-instruct
llama-3-405b-instruct
llama-3-8b-instruct
llama-3-70b-instruct
llama-2-13b-chat
llama-2-70b-chat
mistral-large
mistral-large-instruct-2407
mistral-large-instruct-2411
mistral-nemo-instruct-2407
mistral-small-24b-instruct-2501
mixtral-8x7b-base
mixtral-8x7b-instruct-v01
mt0-xxl-13b
pixtral-12b

To learn more about the various ways that these models can be deployed, and to see a summary of pricing and context window length information for the models, see Supported foundation models.

For details about IBM foundation models, see IBM foundation models.

How to choose a model

To review factors that can help you to choose a model, such as supported tasks and languages, see Choosing a model and Foundation model benchmarks.

To view model lifecycle updates, see Foundation model lifecycle.

Foundation model details

The foundation models in watsonx.ai support a range of use cases for both natural languages and programming languages. To see the types of tasks that these models can do, review and try the sample prompts.

allam-1-13b-instruct

The allam-1-13b-instruct foundation model is a bilingual large language model for Arabic and English provided by the National Center for Artificial Intelligence and supported by the Saudi Authority for Data and Artificial Intelligence that is fine-tuned to support conversational tasks. The ALLaM series is a collection of powerful language models designed to advance Arabic language technology. These models are initialized with Llama-2 weights and undergo training on both Arabic and English languages.

Note:

When you inference this model from the Prompt Lab, disable AI guardrails.

Usage

Supports Q&A, summarization, classification, generation, extraction, and translation in Arabic.

Size

13 billion parameters

API pricing tier

Class 2. For pricing details, see Table 3.

Availability

Provided by IBM deployed on multitenant hardware in the Frankfurt data center.

Deploy on demand for dedicated use except in the Frankfurt data center.

Try it out

Experiment with samples:

Token limits

Context window length (input + output): 4,096

Supported natural languages

Arabic (Modern Standard Arabic) and English

Instruction tuning information

allam-1-13b-instruct is based on the Allam-13b-base model, which is a foundation model that is pre-trained on a total of 3 trillion tokens in English and Arabic, including the tokens seen from its initialization. The Arabic dataset contains 500 billion tokens after cleaning and deduplication. The additional data is collected from open source collections and web crawls. The allam-1-13b-instruct foundation model is fine-tuned with a curated set of 4 million Arabic and 6 million English prompt-and-response pairs.

Model architecture

Decoder-only

License

Llama 2 community license and ALLaM license

Learn more

Read the following resource:

Model card

codellama-34b-instruct-hf

A programmatic code generation model that is based on Llama 2 from Meta. Code Llama is fine-tuned for generating and discussing code.

Usage: Use Code Llama to create prompts that generate code based on natural language inputs, explain code, or that complete and debug code.

Note:

When you inference this model from the Prompt Lab, disable AI guardrails.

Size

34 billion parameters

API pricing tier

Class 2. For pricing details, see Table 3.

Availability

Deploy on demand for dedicated use.

Try it out

Experiment with samples:

Token limits

Context window length (input + output): 16,384

Note: The maximum new tokens, which means the tokens that are generated by the foundation model per request, is limited to 8,192.

Supported natural languages

English

Supported programming languages

The codellama-34b-instruct-hf foundation model supports many programming languages, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash, and more.

Instruction tuning information

The instruction fine-tuned version was fed natural language instruction input and the expected output to guide the model to generate helpful and safe answers in natural language.

Model architecture

Decoder

License

License

Learn more

Read the following resources:

DeepSeek-R1 distilled models

The distilled variants of the DeepSeek-R1 models based on the Llama 3.1 models are provided by DeepSeek AI. The DeepSeek-R1 models are open-sourced models with powerful reasoning capabilities. The data samples generated by the DeepSeek R1 model are used to fine tune a base Llama model.

The deepseek-r1-distill-llama-8b and deepseek-r1-distill-llama-70b models are distilled versions of the DeepSeek-R1 model based on the Llama 3.1 8B and the Llama 3.3 70B models respectively.

Usage

General use with zero- or few-shot prompts and are designed to excel in instruction-following tasks such as summarization, classification, reasoning, code tasks, as well as math.

Available sizes

8 billion parameters
70 billion parameters

API pricing tier

8b: Small

70: Large

For pricing details, see Table 5.

Availability

Deploy on demand for dedicated use.

Try it out

Experiment with samples:

Token limits

8b and 70b: Context window length (input + output): 131,072

Note: The maximum new tokens, which means the tokens generated by the foundation model per request, is limited to 32,768.

Supported natural languages

English

Instruction tuning information

The DeepSeek-R1 models are trained by using large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step. The subsequent RL and SFT stages aim to improve reasoning patterns and align the model with human preferences. DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.

Model architecture

Decoder

License

8b: License

70b: License

Learn more

Read the following resources:

elyza-japanese-llama-2-7b-instruct

The elyza-japanese-llama-2-7b-instruct model is provided by ELYZA, Inc on Hugging Face. The elyza-japanese-llama-2-7b-instruct foundation model is a version of the Llama 2 model from Meta that is trained to understand and generate Japanese text. The model is fine-tuned for solving various tasks that follow user instructions and for participating in a dialog.

Usage

General use with zero- or few-shot prompts. Works well for classification and extraction in Japanese and for translation between English and Japanese. Performs best when prompted in Japanese.

Size

7 billion parameters

API pricing tier

Class 2. For pricing details, see Table 3.

Availability

Provided by IBM deployed on multitenant hardware in the Tokyo data center.

Try it out

Experiment with samples:

Token limits

Context window length (input + output): 4,096

Supported natural languages

Japanese, English

Instruction tuning information

For Japanese language training, Japanese text from many sources were used, including Wikipedia and the Open Super-large Crawled ALMAnaCH coRpus (a multilingual corpus that is generated by classifying and filtering language in the Common Crawl corpus). The model was fine-tuned on a dataset that was created by ELYZA. The ELYZA Tasks 100 dataset contains 100 diverse and complex tasks that were created manually and evaluated by humans. The ELYZA Tasks 100 dataset is publicly available from HuggingFace.

Model architecture

Decoder

License

License

Learn more

Read the following resources:

EuroLLM Instruct

The EuroLLM series of models is developed by the Unified Transcription and Translation for Extended Reality (UTTER) Project and the European Union. The EuroLLM Instruct models are open-source models specialized in understanding and generating text across all the 24 official European Union (EU) languages, as well as 11 commercially and strategically important international languages.

Usage

Suited for multingual language tasks like general instructon-following and language translation.

Sizes

1.7 billion parameters
9 billion parameters

API pricing tier

1.7b: Small

9b: Small

For pricing details, see Table 5.

Availability

Deploy on demand for dedicated use.

Token limits

1.7b and 9b: Context window length (input + output): 4,096

Supported natural languages

Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.

Instruction tuning information

The models are trained on 4 trillion tokens across the supported natural languages from web data, parallel data, Wikipedia, Arxiv, multiple books, and Apollo datasets.

Model architecture

Decoder

License

Apache 2.0 license

Learn more

Read the following resources:

flan-t5-xl-3b

The flan-t5-xl-3b model is provided by Google on Hugging Face. The model is based on the pretrained text-to-text transfer transformer (T5) model and uses instruction fine-tuning methods to achieve better zero- and few-shot performance. The model is also fine-tuned with chain-of-thought data to improve its ability to perform reasoning tasks.

Note:

This foundation model can be tuned by using the Tuning Studio.

Usage

General use with zero- or few-shot prompts.

Size

3 billion parameters

API pricing tier

Class 1. For pricing details, see Table 3 and Table 5.

Availability

Provided by IBM deployed on multitenant hardware.
Deploy on demand for dedicated use.

Try it out

Sample prompts

Token limits

Context window length (input + output): 4,096

Supported natural languages

Multilingual

Instruction tuning information

The model was fine-tuned on tasks that involve multiple-step reasoning from chain-of-thought data in addition to traditional natural language processing tasks. Details about the training datasets used are published.

Model architecture

Encoder-decoder

License

Apache 2.0 license

Learn more

Read the following resources:

flan-t5-xxl-11b

The flan-t5-xxl-11b model is provided by Google on Hugging Face. This model is based on the pretrained text-to-text transfer transformer (T5) model and uses instruction fine-tuning methods to achieve better zero- and few-shot performance. The model is also fine-tuned with chain-of-thought data to improve its ability to perform reasoning tasks.

Usage

General use with zero- or few-shot prompts.

Size

11 billion parameters

API pricing tier

Class 2. For pricing details, see Table 3 and Table 5.

Availability

Provided by IBM deployed on multitenant hardware.
Deploy on demand for dedicated use.

Try it out

Experiment with samples:

Token limits

Context window length (input + output): 4,096

Supported natural languages

English, German, French

Instruction tuning information

Model architecture

Encoder-decoder

License

Apache 2.0 license

Learn more

Read the following resources:

flan-ul2-20b

The flan-ul2-20b model is provided by Google on Hugging Face. This model was trained by using the Unifying Language Learning Paradigms (UL2). The model is optimized for language generation, language understanding, text classification, question answering, common sense reasoning, long text reasoning, structured-knowledge grounding, and information retrieval, in-context learning, zero-shot prompting, and one-shot prompting.

Usage

General use with zero- or few-shot prompts.

Size

20 billion parameters

API pricing tier

Class 3. For pricing details, see Table 3 and Table 5.

Availability

Provided by IBM deployed on multitenant hardware.
Deploy on demand for dedicated use.

Try it out

Experiment with samples:

Token limits

Context window length (input + output): 4,096

Supported natural languages

English

Instruction tuning information

The flan-ul2-20b model is pretrained on the colossal, cleaned version of Common Crawl's web crawl corpus. The model is fine-tuned with multiple pretraining objectives to optimize it for various natural language processing tasks. Details about the training datasets used are published.

Model architecture

Encoder-decoder

License

Apache 2.0 license

Learn more

Read the following resources:

jais-13b-chat

The jais-13b-chat foundation model is a bilingual large language model for Arabic and English that is fine-tuned to support conversational tasks.

Usage

Supports Q&A, summarization, classification, generation, extraction, and translation in Arabic.

Size

13 billion parameters

API pricing tier

Class 2. For pricing details, see Table 3.

Availability

Provided by IBM deployed on multitenant hardware in the Frankfurt data center.

Try it out

Sample prompt: Arabic chat

Token limits

Context window length (input + output): 2,048

Supported natural languages

Arabic (Modern Standard Arabic) and English

Instruction tuning information

Jais-13b-chat is based on the Jais-13b model, which is a foundation model that is trained on 116 billion Arabic tokens and 279 billion English tokens. Jais-13b-chat is fine tuned with a curated set of 4 million Arabic and 6 million English prompt-and-response pairs.

Model architecture

Decoder

License

Apache 2.0 license

Learn more

Read the following resources:

Llama 4 Instruct models

Tech preview of the Llama 4 collection of foundation models that are provided by Meta. The llama-4-maverick-17b-128e-instruct-fp8 and llama-4-scout-17b-16e-instruct models are multimodal models that use a mixture-of-experts (MoE) architecture for optimized, best-in-class performance in text and image understanding.

The Llama 4 Maverick model is a 17 billion active parameter multimodal model with 128 experts. The Llama 4 Scout model is a 17 billion active parameter multimodal model with 16 experts.

Usage

Generates multilingual dialog output like a chatbot, uses a model-specific prompt format, optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.

Size

17 billion parameters

API pricing tier

These models are available as preview models with no charge.

For pricing details, see Table 3.

Availability

Provided by IBM deployed on multitenant hardware.

Try it out

Token limits

Context window length (input + output): 131,072

The maximum new tokens, which means the tokens generated by the foundation models per request, is limited to 8,192.

Supported natural languages

Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.

Instruction tuning information

Llama 4 was pre-trained on a broader collection of 200 languages. The Llama 4 Scout model was pre-trained on approximately 40 trillion tokens and the Llama 4 Maverick model was pre-trained on approximately 22 trillion tokens of multimodal data from publicly available and licensed information from Meta.

Model architecture

Decoder-only

License

Meta Llama 4 Community License

Learn more

Read the following resources:

Llama 3.3 70B Instruct

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model (text in/text out) with 70 billion parameters.

The llama-3-3-70b-instruct is a revision of the popular Llama 3.1 70B Instruct foundation model. The Llama 3.3 foundation model is better at coding, step-by-step reasoning, and tool-calling. Despite its smaller size, the Llama 3.3 model's performance is similar to that of the Llama 3.1 405b model, making it a great choice for developers.

Usage

Generates multilingual dialog output like a chatbot. Uses a model-specific prompt format.

Size

70 billion parameters

API pricing tier

Class 13

For pricing details, see Table 3.

Availability

A quantized version of the model is provided by IBM deployed on multitenant hardware.
Two versions of the model are available to deploy on demand for dedicated use:
- llama-3-3-70b-instruct-hf: Original version published on Hugging Face by Meta.
- llama-3-3-70b-instruct: A quantized version of the model that can be deployed with 2 GPUs instead of 4.

Try it out

Experiment with samples:

Token limits

Context window length (input + output): 131,072

Supported natural languages

English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai

Instruction tuning information

Llama 3.3 was pretrained on 15 trillion tokens of data from publicly available sources. The fine tuning data includes publicly available instruction datasets, as well as over 25 million synthetically generated examples.

Model architecture

Decoder-only

License

Meta Llama 3.3 Community License

Learn more

Read the following resources:

Llama 3.2 Instruct

The Llama 3.2 collection of foundation models are provided by Meta. The llama-3-2-1b-instruct and llama-3-2-3b-instruct models are the smallest Llama 3.2 models that fit onto a mobile device. The models are lightweight, text-only models that can be used to build highly personalized, on-device agents.

For example, you can ask the models to summarize the last ten messages you received, or to summarize your schedule for the next month.

Usage

Generate dialog output like a chatbot. Use a model-specific prompt format. Their small size and modest compute resource and memory requirements enable the Llama 3.2 Instruct models to be run locally on most hardware, including on mobile and other edge devices.

Sizes

1 billion parameters
3 billion parameters

API pricing tier

1b: Class C1
3b: Class 8

For pricing details, see Table 3.

For pricing details, see Billing details for generative AI assets.

Availability

Provided by IBM deployed on multitenant hardware.

Try it out

Token limits

Context window length (input + output)

1b: 131,072
3b: 131,072

The maximum new tokens, which means the tokens generated by the foundation models per request, is limited to 8,192.

Supported natural languages

English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai

Instruction tuning information

Pretrained on up to 9 trillion tokens of data from publicly available sources. Logits from the Llama 3.1 8B and 70B models were incorporated into the pretraining stage of the model development, where outputs (logits) from these larger models were used as token-level targets. In post-training, aligned the pre-trained model by using Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO).

Model architecture

Decoder-only

License

Meta Llama 3.2 Community License

Learn more

Read the following resources:

Llama 3.2 Vision Instruct

The Meta Llama 3.2 collection of foundation models are provided by Meta. The llama-3-2-11b-vision-instruct and llama-3-2-90b-vision-instruct models are built for image-in, text-out use cases such as document-level understanding, interpretation of charts and graphs, and captioning of images.

Usage

Generates dialog output like a chatbot and can perform computer vision tasks including classification, object detection and identification, image-to-text transcription (including handwriting), contextual Q&A, data extraction and processing, image comparison and personal visual assistance. Uses a model-specific prompt format.

Sizes

11 billion parameters
90 billion parameters

API pricing tier

11b: Class 9
90b: Class 10

For pricing details, see Table 3.

For pricing details for deploying the 11b model on demand, see Table 5.

Availability

11b and 90b: Provided by IBM deployed on multitenant hardware.
11b model only: Deploy on demand for dedicated use.

Try it out

Token limits

Context window length (input + output)

11b: 131,072
90b: 131,072

The maximum new tokens, which means the tokens generated by the foundation models per request, is limited to 8,192. The tokens that are counted for an image that you submit to the model are not included in the context window length.

Supported natural languages

English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai with text-only inputs. English only when an image is included with the input.

Instruction tuning information

Llama 3.2 Vision models use image-reasoning adaptor weights that are trained separately from the core large language model weights. This separation preserves the general knowledge of the model and makes the model more efficient both at pretraining time and run time. The Llama 3.2 Vision models were pretrained on 6 billion image-and-text pairs, which required far fewer compute resources than were needed to pretrain the Llama 3.1 70B foundation model alone. Llama 3.2 models also run efficiently because they can tap more compute resources for image reasoning only when the input requires it.

Model architecture

Decoder-only

License

Meta Llama 3.2 Community License

Learn more

Read the following resources:

Research paper
Meta AI blog
11b Model card (Multitenant)
90b Model card
[11b Model card (Dedicated)]https://dataplatform.cloud.ibm.com/wx/samples/models/meta-llama/llama-3-2-11b-vision-instruct-curated?context=wx

llama-guard-3-11b-vision

The Meta Llama 3.2 collection of foundation models are provided by Meta. The llama-guard-3-11b-vision is a multimodal evolution of the text-only Llama-Guard-3 model. The model can be used to classify image and text content in user inputs (prompt classification) as safe or unsafe.

Usage

Use the model to check the safety of the image and text in an image-to-text prompt.

Size

11 billion parameters

API pricing tier

Class 9. For pricing details, see Table 3.

Availability

Provided by IBM deployed on multitenant hardware.

Try it out

Token limits

Context window length (input + output): 131,072

Supported natural languages

English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai with text-only inputs. English only when an image is included with the input.

Instruction tuning information

Pretrained model that is fine-tuned for content safety classification. For more information about the types of content that are classified as unsafe, see the model card.

Model architecture

Decoder-only

License

Meta Llama 3.2 Community License

Learn more

Read the following resources:

Llama 3.1

The Meta Llama 3.1 collection of foundation models are provided by Meta. The Llama 3.1 base foundation models, llama-3-1-8b and llama-3-1-70b are multilingual models that supports tool use, and have overall stronger reasoning capabilities.

Usage

Use for long-form text summarization and with multilingual conversational agents or coding assistants.

Size

8 billion parameters 70 billion parameters

API pricing tier

For pricing details, see Table 5.

Availability

Deploy on demand for dedicated use.

Try it out

Sample prompt: Converse with Llama 3

Token limits

Context window length (input + output): 131,072

Supported natural languages

English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai

Model architecture

Decoder-only

License

Meta Llama 3.1 Community License

Learn more

Read the following resources:

Llama 3.1 Instruct

The Meta Llama 3.1 collection of foundation models are provided by Meta. The Llama 3.1 foundation models are pretrained and instruction tuned text-only generative models that are optimized for multilingual dialogue use cases. The models use supervised fine-tuning and reinforcement learning with human feedback to align with human preferences for helpfulness and safety.

The llama-3-405b-instruct model is Meta's largest open-sourced foundation model to date. This foundation model can also be used as a synthetic data generator, post-training data ranking judge, or model teacher/supervisor that can improve specialized capabilities in more inference-friendly, derivative models.

Usage

Generates dialog output like a chatbot. Uses a model-specific prompt format.

Sizes

8 billion parameters
70 billion parameters
405 billion parameters

API pricing tier

8b: Class 1
70b: Class 2
405b: Class 3 (input), Class 7 (output)

For pricing details, see Table 3.

For pricing details for deploying the 8b and 70b models on demand, see Table 5.

Availability

405b: Provided by IBM deployed on multitenant hardware.
8b and 70b only: Deploy on demand for dedicated use.

Warning icon The IBM deployments of the 8b and 70b foundation models are deprecated. For details, see Foundation model lifecycle.

Try it out

Sample prompt: Converse with Llama 3

Chat API sample

Token limits

Context window length (input + output)

8b and 70b: 131,072
405b: 16,384
- Although the model supports a context window length of 131,072, the window is limited to 16,384 to reduce the time it takes for the model to generate a response.
The maximum new tokens, which means the tokens generated by the foundation models per request, is limited to 4,096.

Supported natural languages

English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai

Instruction tuning information

Llama 3.1 was pretrained on 15 trillion tokens of data from publicly available sources. The fine tuning data includes publicly available instruction datasets, as well as over 25 million synthetically generated examples.

Model architecture

Decoder-only

License

Meta Llama 3.1 Community License

Learn more

Read the following resources:

Llama 3 Instruct

The Meta Llama 3 family of foundation models are accessible, open large language models that are built with Meta Llama 3 and provided by Meta on Hugging Face. The Llama 3 foundation models are instruction fine-tuned language models that can support various use cases.

Usage

Generates dialog output like a chatbot.

Sizes

8 billion parameters
70 billion parameters

API pricing tier

8b: Class 1
70b: Class 2

For pricing details, see Table 3 and Table 5.

Availability

Provided by IBM deployed on multitenant hardware (70b in Sydney region only).
Deploy on demand for dedicated use.

Try it out

Sample prompt: Converse with Llama 3

Token limits

Context window length (input + output)

8b: 8,192
70b: 8,192

Note: The maximum new tokens, which means the tokens generated by the foundation models per request, is limited to 4,096.

Supported natural languages

English

Instruction tuning information

Llama 3 features improvements in post-training procedures that reduce false refusal rates, improve alignment, and increase diversity in the foundation model output. The result is better reasoning, code generation, and instruction-following capabilities. Llama 3 has more training tokens (15T) that result in better language comprehension.

Model architecture

Decoder-only

License

META LLAMA 3 Community License

Learn more

Read the following resources:

Llama 2 Chat

The Llama 2 Chat models are provided by Meta on Hugging Face. The fine-tuned models are useful for chat generation. The models are pretrained with publicly available online data and fine-tuned using reinforcement learning from human feedback.

You can choose to use the 13 billion parameter or 70 billion parameter version of the model.

Usage

Generates dialog output like a chatbot. Uses a model-specific prompt format.

Size

13 billion parameters
70 billion parameters

API pricing tier

Class 1. For pricing details, see Table 3 and Table 5.

Availability

13b
- Provided by IBM deployed on multitenant hardware
- Deploy on demand for dedicated use
70b
- Deploy on demand for dedicated use

Warning icon The IBM-provided deployment of this foundation model is deprecated. See Foundation model lifecycle.

Try it out

Experiment with samples:

Token limits

Context window length (input + output)

13b: 4,096
70b: 4,096

Supported natural languages

English

Instruction tuning information

Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets and more than one million new examples that were annotated by humans.

Model architecture

Decoder-only

License

License

Learn more

Read the following resources:

mistral-large

Mistral Large 2 is a family of large language models developed by Mistral AI. The mistral-large foundation model is fluent in and understands the grammar and cultural context of English, French, Spanish, German, and Italian. The foundation model can also understand dozens of other languages. The model has a large context window, which means you can add large documents as contextual information in prompts that you submit for retrieval-augmented generation (RAG) use cases. The mistral-large foundation model is effective at programmatic tasks, such as generating, reviewing, and commenting on code, function calling, and can generate results in JSON format.

For more getting started information, see the watsonx.ai page on the Mistral AI website.

Usage

Suitable for complex multilingual reasoning tasks, including text understanding, transformation, and code generation. Due to the model's large context window, use the max tokens parameter to specify a token limit when prompting the model.

API pricing tier

Pricing for inferencing the provided Mistral Large model is not assigned by a multiplier. The following special pricing tiers are used:

Input tier: Mistral Large Input
Output tier: Mistral Large

For pricing details, see Table 3. For pricing details for deploying this model on demand, see Table 5.

Attention: This foundation model has an additional access fee that is applied per hour of use.

Availability

Provided by IBM deployed on multitenant hardware
Deploy on demand for dedicated use

Try it out

Token limits

Context window length (input + output): 131,072

Note: The maximum new tokens, which means the tokens generated by the foundation model per request, is limited to 16,384.

Supported natural languages

English, French, German, Italian, Spanish, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, and dozens of other languages.

Supported programming languages

The mistral-large model has been trained on over 80 programming languages including Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran.

Instruction tuning information

The mistral-large foundation model is pre-trained on diverse datasets like text, codebases, and mathematical data from various domains.

Model architecture

Decoder-only

License

For terms of use, including information about contractual protections related to capped indemnification, see Terms of use.

Learn more

Read the following resources:

mistral-large-instruct-2411

The mistral-large-instruct-2411 foundation model from Mistral AI and belongs to the Mistral Large 2 family of models. The model specializes in reasoning, knowledge, and coding. The model extends the capabilities of the Mistral-Large-Instruct-2407 foundation model to include better handling of long prompt contexts, system prompt instructions, and function calling requests.

Usage

The mistral-large-instruct-2411 foundation model is multilingual, proficient in coding, agent-centric, and adheres to system prompts to aid in retrieval-augmented generation tasks and other use cases where prompts with large context need to be handled.

Size

123 billion parameters

API pricing tier

For pricing details, see Table 5.

Attention: This foundation model has an additional access fee that is applied per hour of use.

Availability

Deploy on demand for dedicated use.

Try it out

Sample prompts

Token limits

Context window length (input + output): 131,072

Supported natural languages

Multiple languages and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Supported programming languages

The mistral-large-instruct-2411 foundation model has been trained on over 80 programming languages including Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran.

Instruction tuning information

The mistral-large-instruct-2411 foundation model extends the Mistral-Large-Instruct-2407 foundation model from Mistral AI. Training enhanced the reasoning capabilities of the model. Training also focused on reducing hallucinations by fine tuning the model to be more cautious and discerning in its responses and to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer.

License

For terms of use, including information about contractual protections related to capped indemnification, see Terms of use.

Learn more

Read the following resources:

mistral-nemo-instruct-2407

The mistral-nemo-instruct-2407 foundation model from Mistral AI was built in collaboration with NVIDIA. Mistral NeMo performs exceptionally well in reasoning, world knowledge, and coding accuracy, especially for a model of its size.

Usage

The Mistral NeMo model is multilingual and is trained on function calling.

Size

12 billion parameters

API pricing tier

For pricing details, see Table 5.

Availability

Deploy on demand for dedicated use.

Token limits

Context window length (input + output): 131,072

Supported natural languages

Multiple languages and is particularly strong in English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.

Supported programming languages

The Mistral NeMo model has been trained on several programming languages.

Instruction tuning information

Mistral NeMo had an advanced fine-tuning and alignment phase.

License

Apache 2.0 license

Learn more

Read the following resources:

mistral-small-24b-instruct-2501

Mistral Small 3 is a cost-efficient, fast, and reliable foundation model developed by Mistral AI. The mistral-small-24b-instruct-2501 model is instruction fine-tuned and performs well in tasks that require some reasoning ability, such as data extraction, summarizing a document, or writing descriptions. Built to support agentic application, with adherence to system prompts and function calling with JSON output generation.

For more getting started information, see the watsonx.ai page on the Mistral AI website.

Usage

Suitable for conversational agents and function calling.

API pricing tier

Class 9

For pricing details, see Table 3.

Availability

Provided by IBM deployed on multitenant hardware in Frankfurt region only.

Try it out

Sample prompts

Token limits

Context window length (input + output): 32,768

Note:

The maximum new tokens, which means the tokens generated by the foundation model per request, is limited to 16,384.

Supported natural languages

English, French, German, Italian, Spanish, Chinese, Japanese, Korean, Portuguese, Dutch, Polish, and dozens of other languages.

Supported programming languages

The mistral-small-24b-instruct-2501 model has been trained on over 80 programming languages including Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran.

Instruction tuning information

The mistral-small-24b-instruct-2501 foundation model is pre-trained on diverse datasets like text, codebases, and mathematical data from various domains.

Model architecture

Decoder-only

License

Apache 2.0 license

Learn more

Read the following resources:

mixtral-8x7b-base

The mixtral-8x7b-base foundation model is provided by Mistral AI. The mixtral-8x7b-base foundation model is a generative sparse mixture-of-experts network that groups the model parameters, and then for each token chooses a subset of groups (referred to as experts) to process the token. As a result, each token has access to 47 billion parameters, but only uses 13 billion active parameters for inferencing, which reduces costs and latency.

Usage

Suitable for many tasks, including classification, summarization, generation, code creation and conversion, and language translation.

Size

46.7 billion parameters

API pricing tier

For pricing details, see Table 5.

Availability

Deploy on demand for dedicated use.

Token limits

Context window length (input + output): 32,768

Note: The maximum new tokens, which means the tokens generated by the foundation model per request, is limited to 16,384.

Supported natural languages

English, French, German, Italian, Spanish

Model architecture

Decoder-only

License

Apache 2.0 license

Learn more

Read the following resources:

mixtral-8x7b-instruct-v01

The mixtral-8x7b-instruct-v01 foundation model is provided by Mistral AI. The mixtral-8x7b-instruct-v01 foundation model is a pretrained generative sparse mixture-of-experts network that groups the model parameters, and then for each token chooses a subset of groups (referred to as experts) to process the token. As a result, each token has access to 47 billion parameters, but only uses 13 billion active parameters for inferencing, which reduces costs and latency.

Usage

Suitable for many tasks, including classification, summarization, generation, code creation and conversion, and language translation. Due to the model's unusually large context window, use the max tokens parameter to specify a token limit when prompting the model.

Size

46.7 billion parameters

API pricing tier

Class 1. For pricing details, see Table 3.

Try it out

Sample prompts

Token limits

Context window length (input + output): 32,768

Note: The maximum new tokens, which means the tokens generated by the foundation model per request, is limited to 16,384.

Supported natural languages

English, French, German, Italian, Spanish

Instruction tuning information

The Mixtral foundation model is pretrained on internet data. The Mixtral 8x7B Instruct foundation model is fine-tuned to follow instructions.

Model architecture

Decoder-only

License

Apache 2.0 license

Learn more

Read the following resources:

mt0-xxl-13b

The mt0-xxl-13b model is provided by BigScience on Hugging Face. The model is optimized to support language generation and translation tasks with English, languages other than English, and multilingual prompts.

Usage: General use with zero- or few-shot prompts. For translation tasks, include a period to indicate the end of the text you want translated or the model might continue the sentence rather than translate it.

Size

13 billion parameters

API pricing tier

For pricing details, see Table 5.

Availability

Deployed on demand for dedicated use.

Try it out

Experiment with the following samples:

Supported natural languages

Multilingual

Token limits

Context window length (input + output): 4,096

Supported natural languages

The model is pretrained on multilingual data in 108 languages and fine-tuned with multilingual data in 46 languages to perform multilingual tasks.

Instruction tuning information

BigScience publishes details about its code and datasets.

Model architecture

Encoder-decoder

License

Apache 2.0 license

Learn more

Read the following resources:

pixtral-12b

Pixtral 12B is a multimodal model developed by Mistral AI. The pixtral-12b foundation model is trained to understand both natural images and documents and is able to ingest images at their natural resolution and aspect ratio, providing flexibility on the number of tokens used to process an image. The foundation model supports multiple images in its long context window. The model is effective in image-in, text-out multimodal tasks and excels at instruction following.

Usage: Chart and figure understanding, document question answering, multimodal reasoning, and instruction following.
Size: 12 billion parameters
API pricing tier: Class 9. For pricing details, see Table 3.

Availability

Try it out

Chatting with documents and images

Token limits

Context window length (input + output): 128,000

The maximum new tokens, which means the tokens generated by the foundation models per request, is limited to 8,192.

Supported natural languages

English

Instruction tuning information

The pixtral-12b model is trained with interleaved image and text data and is based on the Mistral Nemo model with a 400 million parameter vision encoder trained from scratch.

Model architecture

Decoder-only

License

Apache 2.0 license

Learn more

Read the following resources:

poro-34b-chat

Poro 34b chat is a chat-tuned version of Poro 34B trained to follow instructions in both Finnish and English.

Usage: Use the model to generate dialog output like a chatbot.
Size: 34 billion parameters
API pricing tier: For pricing details, see Table 5.
Availability: Deploy on demand for dedicated use.
Try it out: Sample prompts
Token limits: Context window length (input + output): 2,048
Supported natural languages: English, Finnish
Instruction tuning information: Poro-34b-Chat is an SFT finetune of Poro-34b on a collection of Finnish and English instruction datasets.
Model architecture: Decoder
License: For terms of use, including information about contractual protections related to capped indemnification, see Terms of use.

Learn more

Read the following resources:

Model card

Any deprecated foundation models are highlighted with a deprecated warning icon . For more information about deprecation, including foundation model withdrawal details, see Foundation model lifecycle.

Learn more

Parent topic: Supported foundation models

Was the topic helpful?

0/1000

How to choose a modelCopy link to section

Foundation model detailsCopy link to section

allam-1-13b-instructCopy link to section

codellama-34b-instruct-hfCopy link to section

DeepSeek-R1 distilled modelsCopy link to section

elyza-japanese-llama-2-7b-instructCopy link to section

EuroLLM InstructCopy link to section

flan-t5-xl-3bCopy link to section

flan-t5-xxl-11bCopy link to section

flan-ul2-20bCopy link to section

jais-13b-chatCopy link to section

Llama 4 Instruct models Copy link to section

Llama 3.3 70B InstructCopy link to section

Llama 3.2 InstructCopy link to section

Llama 3.2 Vision InstructCopy link to section

llama-guard-3-11b-visionCopy link to section

Llama 3.1 Copy link to section

Llama 3.1 InstructCopy link to section

Llama 3 InstructCopy link to section

Llama 2 ChatCopy link to section

mistral-largeCopy link to section

mistral-large-instruct-2411Copy link to section

mistral-nemo-instruct-2407Copy link to section

mistral-small-24b-instruct-2501Copy link to section

mixtral-8x7b-baseCopy link to section

mixtral-8x7b-instruct-v01Copy link to section

mt0-xxl-13bCopy link to section

pixtral-12bCopy link to section

poro-34b-chatCopy link to section

Learn moreCopy link to section

How to choose a model

Foundation model details

allam-1-13b-instruct

codellama-34b-instruct-hf

DeepSeek-R1 distilled models

elyza-japanese-llama-2-7b-instruct

EuroLLM Instruct

flan-t5-xl-3b

flan-t5-xxl-11b

flan-ul2-20b

jais-13b-chat

Llama 4 Instruct models

Llama 3.3 70B Instruct

Llama 3.2 Instruct

Llama 3.2 Vision Instruct

llama-guard-3-11b-vision

Llama 3.1

Llama 3.1 Instruct

Llama 3 Instruct

Llama 2 Chat

mistral-large

mistral-large-instruct-2411

mistral-nemo-instruct-2407

mistral-small-24b-instruct-2501

mixtral-8x7b-base

mixtral-8x7b-instruct-v01

mt0-xxl-13b

pixtral-12b

poro-34b-chat

Learn more