Billing details for generative AI assets

Last updated: Feb 11, 2025

Learn about how usage for generative AI assets is measured using resource unit (RU), hourly rates, or a flat rate.

Working with generative AI assets with watsonx.ai Runtime requires that you are using watsonx.ai. Overview of IBM watsonx.ai.

Review the details for how resources are measured using:

Resource units to measure inferencing activities for foundation models provided by watsonx.ai.
Hourly rates for custom foundation models you import and deploy with watsonx.ai.
Hourly rates for curated foundation models deployed on demand on dedicated hardware.
Flat rates by page for document text extraction.

A resource unit is used to measure the following resources:

Tokens used for inferencing a foundation model to generate text or text embeddings.
Data points used by a time series foundation model for forecasting future values.

Resource unit metering for inferencing foundation models

For the list of supported foundation models and their prices, see Supported foundation models. For the list of supported encoder models and their prices, see Supported encoder models.

When measuring foundation model inferencing, a Resource Unit (RU) is equal to 1,000 tokens from the input and output of the foundation model. A token is a basic unit of text (typically 4 characters or 0.75 words) used in the input or output for a foundation model prompt or for input to an embeddings model.

Each foundation model provided by IBM watsonx.ai is assigned an inference price for input and output. The price is derived as a multiple of the base price for an RU ($0.0001). For example, a model with a price of $0.0006 has a multiplier of 6 times the base rate.

Important: There are limits by plan on the number of inferencing requests per second that are submitted to a model. If a user exceeds an inferencing request limit, a system notification provides guidance.

A prompt tuned foundation model is assigned the same price as the underlying foundation model. For information about tuned foundation models, see Tuning Studio. Tuning a model in the Tuning Studio consumes capacity unit hours (CUH). For more information, see Billing details for machine learning assets.

Resource unit metering for inferencing time series foundation models

When measuring foundation model forecasting, a Resource Unit (RU) is equal to 1,000 data points in the foundation model input and output. A data point is a unit of input and output content that is expressed as one or more numbers.

Billing classes by multiplier

If you are monitoring model usage with the watsonx.ai API, model prices are listed by pricing tier, as follows:

Table 1. API pricing tiers
Model pricing tier	Resource type	Price per RU in USD	Multiplier x base rate
Class 1	Tokens	$0.0006	6
Class 2	Tokens	$0.0018	18
Class 3	Tokens	$0.0050	50
Class C1	Tokens	$0.0001	1
Class 5	Tokens	$0.00025	2.5
Class 7	Tokens	$0.016	160
Class 8	Tokens	$0.00015	1.5
Class 9	Tokens	$0.00035	3.5
Class 10	Tokens	$0.0020	20
Class 11	Tokens	$0.000005	0.05
Class 12	Tokens	$0.0002	2
Class 13	Tokens	$0.00071	7.1
Class 14	Data points	$0.00013	1.3
Class 15	Data points	$0.00038	3.8

Note:

Certain models, such as Mistral Large, have special pricing that is not assigned by a multiplier. The pricing is listed in Supported models.

Calculating the resource unit rate of tokens per model

To calculate charges for foundation model inference, divide the total number of tokens consumed during the month by 1000 and round up to the nearest 1000 to obtain the total number of RUs. Multiply the total number of RUs by the model price to obtain total usage charges. The model price varies by model and can also vary for input or output tokens for a given model.

The basic formula is as follows:

Total tokens used/1000 = Resource Units (RU) consumed
RU consumed x model price = Total usage charge

The base price for an RU is $0.0001. The price for each foundation model is a multiple of the base price.

Calculating the resource unit rate of data points per model

To calculate charges for forecasting with a time series foundation model, use the following equations:

Input calculation: context length x number of series x number of channels
Output calculation: prediction length x number of series x number of channels

These equations use the following parameters:

Context length refers to the number of historical data points that a time series foundation model uses as input to make predictions.
A series is a collection of observations made sequentially over time. For example, when comparing stock prices for many companies, the observed stock price history for each company is a separate series.
Channels are the specific features or variables that are measured within a time series dataset.
Prediction length is the number of future data points for the model to predict.

Data point pricing
Resource type	Model pricing tier	Price in USD per RU
Input data points	Class 14	$0.00013
Output data points	Class 15	$0.00038

The following example shows how to calculate the cost for a time series forecasting request with the following parameters:

Parameters used to calculate data point usage
Parameter	Example quantity
Context length (granite-ttm-1536-96-r2 model)	1,536
Channels	10
Series	1,000
Prediction length	96

Total input data points: 15,360,000 (Context length of 1,536, 10 channels, for 1,000 series)
```
15,360,000 / 1,000 = 15,360 x 0.00013 = 1.9968
```
Total output data points: 960,000 (Forecast 96 time points, 10 channels, for 1,000 series)
```
960,000 / 1,000 = 960 x 0.00038 = 0.3648
```
Total price for the time series forecast request: $2.36 (Input cost $1.9968 + Output cost $0.3648)
```
1.9968 + 0.3648 = 2.3616
```

Hourly billing rates for custom foundation models

Deploying custom foundation models requires the Standard plan.

Billing rates are according to model hardware configuration and apply for hosting and inferencing the model. Charges begin when the model is successfully deployed and continue until the model is deleted.

Custom foundation model billing rates
Configuration size	Billing rate per hour in USD
Small	$5.22
Medium	$10.40
Large	$20.85

Important: You can deploy a maximum of four small custom foundation models, two medium models, or one large model per account.

For details on choosing a configuration for a custom foundation model, see Planning to deploy a custom foundation model.

Hourly billing rates for deploy on demand models

Deploy foundation models on demand when you want a hosted solution reserved for the exclusive use of your organization. Only colleagues to whom you grant access to the deployment can inference the foundation model. A dedicated deployment means faster and more responsive interactions, and allows for prompts with larger context window lengths. Billing rates are set per model and apply for hosting and inferencing the model. Charges begin when the model is deployed and continues until the model is deleted.

Note: Deploying foundation models on demand requires the Standard plan. This feature is currently available only for the Dallas data center.

For details on on deploying a foundation model on demand, including pricing, see Supported foundation models available with watsonx.ai.

Rates per page for document text extraction

Use the document text extraction method of the watsonx.ai REST API to convert PDF files that are highly structured and use diagrams and tables to convey information, into an AI model-friendly JSON file format.

Billing is charged at a flat rate per page processed. A page can be a page of text (up to 1800 characters), an image, or a .tiff frame. The billing rate depends on your plan type.

Text extraction pricing
Plan type	Price per page in USD
Essential	$0.038
Standard	$0.030

Learn more

For details on pricing for machine learning assets, see Billing rates for machine learning assets.
For details on tracking computing resource allocation and consumption, see Runtime usage.
For details about each model, see Supported foundation models.
For details about regional support for each model, see Regional availability for foundation models.

Parent topic: watsonx.ai Runtime plans