Choosing a tool
The core services for Cloud Pak for Data as a Service provide a range of tools for users with all levels of experience in preparing, analyzing, and modeling data, from beginner to expert. The right tool for you depends on the type of data you have, the tasks you plan to do, and the amount of automation you want.
To see which tools you use in a project and which services those tools require, open the tools and services map.
To pick the right tool, consider these factors.
The type of data you have
- Tabular data in delimited files or relational data in remote data sources
- Image files
- Textual (unstructured) data in documents
The type of tasks you need to do
- Prepare data: cleanse, shape, visualize, organize, and validate data.
- Analyze data: identify patterns and relationships in data, and display insights.
- Build models: build, train, test, and deploy models to make predictions or optimize decisions.
How much automation you want
- Code editor tools: Use to write code in Python or R, all also with Spark.
- Graphical builder tools: Use menus and drag-and-drop functionality on a builder to visually program.
- Automated builder tools: Use to configure automated tasks that require limited user input.
Find the right tool:
Tools for tabular or relational data
Tools for tabular or relational data by task:
Tool | Tool type | Prepare data | Analyze data | Build models |
---|---|---|---|---|
Jupyter notebook editor | Code editor | ✓ | ✓ | ✓ |
Federated Learning | Code editor | ✓ | ||
RStudio | Code editor | ✓ | ✓ | ✓ |
Data Refinery | Graphical builder | ✓ | ✓ | |
Masking flow | Automated builder | ✓ | ||
Watson Query | Graphical builder | ✓ | ||
DataStage | Graphical builder | ✓ | ||
Data Replication | Graphical builder | ✓ | ||
SPSS Modeler | Graphical builder | ✓ | ✓ | ✓ |
Decision Optimization model builder | Graphical builder and code editor | ✓ | ✓ | |
AutoAI | Automated builder | ✓ | ✓ | |
Metadata import | Automated builder | ✓ | ||
Metadata enrichment | Automated builder | ✓ | ✓ | |
Data quality rule | Automated builder and code editor | ✓ | ||
IBM Match 360 with Watson (Beta) | Automated builder | ✓ | ||
Orchestration Pipelines | Graphical builder | ✓ | ✓ | ✓ |
Tools for textual data
Tools for building a model that works with textual data:
Tool | Code editor | Graphical builder | Automated builder |
---|---|---|---|
Jupyter notebook editor | ✓ | ||
RStudio | ✓ | ||
SPSS Modeler | ✓ | ||
Orchestration Pipelines | ✓ |
Tools for image data
Tools for building a model that classifies images:
Tool | Code editor | Graphical builder | Automated builder |
---|---|---|---|
Jupyter notebook editor | ✓ | ||
RStudio | ✓ | ||
Orchestration Pipelines | ✓ |
Accessing tools
To use a tool, you must create an asset specific to that tool, or open an existing asset for that tool. To create an asset, click New asset or Import assets and then choose the asset type you want. This table shows the asset type to choose for each tool.
To use this tool | Choose this asset type |
---|---|
Jupyter notebook editor | Jupyter notebook editor |
Data Refinery | Data Refinery flow |
Masking flows | Masking flows |
DataStage | DataStage flow |
SPSS Modeler | Modeler flow |
Decision Optimization model builder | Decision Optimization |
AutoAI | AutoAI experiment |
Federated Learning | Federated Learning experiment |
Metadata import | Metadata import |
Metadata enrichment | Metadata enrichment |
Data quality rules | Data quality rule |
IBM Match 360 with Watson (Beta) | Master data configuration |
To edit notebooks with RStudio, click Launch IDE > RStudio.
Jupyter notebook editor
Use the Jupyter notebook editor to create a notebook in which you run code to prepare, visualize, and analyze data, or build and train a model.
- Required services
- Watson Studio
- Data format
- Any
- Data size
- Any
- How you can prepare data, analyze data, or build models
- Write code in Python or R, all also with Spark.
- Include rich text and media with your code.
- Work with any kind of data in any way you want.
- Use preinstalled or install other open source and IBM libraries and packages.
- Schedule runs of your code
- Import a notebook from a file, a URL, or the Resource hub.
- Share read-only copies of your notebook externally.
- Get started
- To create a notebook, click New asset > Work with data and models in Python or R notebooks.
- Learn more
- Documentation about notebooks
- Videos about notebooks
- Sample notebooks
Watch a video to learn Jupyter notebook basics
This video provides a visual method to learn the concepts and tasks in this documentation.
Data Refinery
Use Data Refinery to prepare and visualize tabular data with a graphical flow editor. You create and then run a Data Refinery flow as a set of ordered operations on data.
- Required services
- Watson Studio or IBM Knowledge Catalog
- Data format
- Tabular: Avro, CSV, JSON, Microsoft Excel (xls and xlsx formats. First sheet only, except for connections and connected data assets.), Parquet, SAS with the "sas7bdat" extension (read only), TSV (read only), or delimited text data asset
- Relational: Tables in relational data sources
- Data size
- Any
- How you can prepare data
- Cleanse, shape, organize data with over 60 operations.
- Save refined data as a new data set or update the original data.
- Profile data to validate it.
- Use interactive templates to manipulate data with code operations, functions, and logical operators.
- Schedule recurring operations on data.
- How you can analyze data
- Identify patterns, connections, and relationships within the data in multiple visualization charts.
- Get started
- To create a Data Refinery flow, click New asset > Prepare and visualize data.
- Learn more
- Documentation about Data Refinery
- Videos about Data Refinery
Watch a video to see how to refine data
This video provides a visual method to learn the concepts and tasks in this documentation.
Data Replication
Use Data Replication to integrate and synchronize data. Data Replication provides near-real-time data delivery with low impact to sources.
- Required service
-
Data Replication
- Related service
-
IBM Knowledge Catalog
- Data formats
-
Data Replication works with connections to and from select types of data sources and formats. For more information, see Supported Data Replication connections.
- Credentials
-
Data Replication uses your IBM Cloud credentials to connect to the service.
- Get started
-
To start data replication in a project, click New asset > Replicate data.
- Learn more
Watch a video to see how to replicate data
This video provides a visual method to learn the concepts and tasks in this documentation.
Watson Query
Use Watson Query to connect multiple data sources into a single self-balancing collection of data sources or databases.
- Data format
- Relational: Tables in relational data sources
- Data size
- Any
- How you can prepare data
- Connect to multiple data sources.
- Create virtual tables.
- Get started
- To create virtual tables, click Data > Data virtualization. From the service menu, click Virtualization > Virtualize > Tables.
- Learn more
- Documentation about Watson Query
- Videos about Watson Query
Watch a video to see how to virtualize data
This video provides a visual method to learn the concepts and tasks in this documentation.
DataStage
Use DataStage to prepare and visualize tabular data with a graphical flow editor. You create and then run a DataStage flow as a set of ordered operations on data.
- Required service
- DataStage
- Data format
- Tabular: Avro, CSV, JSON, Parquet, TSV (read only), or delimited text files
- Relational: Tables in relational data sources
- Data size
- Any
- How you can prepare data
- Design a graphical data integration flow that generates Orchestrate code to run on the high performing, DataStage parallel engine.
- Perform operations such as: Join, Funnel, Checksum, Merge, Modify, Remove Duplicates, and Sort.
- Get started
- To create a DataStage flow, click New asset > Transform and integrate data. The DataStage tile is in the Graphical builders section.
- Learn more
- Documentation about DataStage
- Videos about DataStage
Watch a video to see how to transform data
This video provides a visual method to learn the concepts and tasks in this documentation.
SPSS Modeler
Use SPSS Modeler to create a flow to prepare data and build and train a model with a flow editor on a graphical builder.
- Required services
- Watson Studio
- Data formats
- Relational: Tables in relational data sources
- Tabular: Excel files (.xls or .xlsx), CSV files, or SPSS Statistics files (.sav)
- Textual: In the supported relational tables or files
- Data size
- Any
- How you can prepare data
- Use automatic data preparation functions.
- Write SQL statements to manipulate data.
- Cleanse, shape, sample, sort, and derive data.
- How you can analyze data
- Visualize data with over 40 graphs.
- Identify the natural language of a text field.
- How you can build models
- Build predictive models.
- Choose from over 40 modeling algorithms.
- Use automatic modeling functions.
- Model time series or geospatial data.
- Classify textual data.
- Identify relationships between the concepts in textual data.
- Get started
- To create an SPSS Modeler flow, click New asset > Build models as a visual flow.
- Learn more
- Documentation about SPSS Modeler
- Videos about SPSS Modeler
Watch a video to see how to build a model with SPSS Modeler
This video provides a visual method to learn the concepts and tasks in this documentation.
Decision Optimization model builder
Use Decision Optimization to build and run optimization models in the Decision Optimization modeler or in a Jupyter notebook.
- Required services
- Watson Studio
- Data formats
- Tabular: CSV files
- Data size
- Any
- How you can prepare data
- Import relevant data into a scenario and edit it.
- How you can build models
- Build prescriptive decision optimization models.
- Create, import and edit models in Python DOcplex, OPL or with natural language expressions.
- Create, import and edit models in notebooks.
- How you can solve models
- Run and solve decision optimization models using CPLEX engines.
- Investigate and compare solutions for multiple scenarios.
- Create tables, charts and notes to visualize data and solutions for one or more scenarios.
- Get started
- To create a Decision Optimization model, click New asset > Solve optimization problems, or for notebooks click New asset > Work with data and models in Python or R notebooks.
- Learn more
- Documentation about Decision Optimization
- Videos about Decision Optimization
Watch a video to see how to build a Decision Optimization experiment
This video provides a visual method to learn the concepts and tasks in this documentation.
AutoAI tool
Use the AutoAI tool to automatically analyze your tabular data and generate candidate model pipelines customized for your predictive modeling problem.
- Required services
- Watson Machine Learning
- Watson Studio
- Data format
- Tabular: CSV files
- Data size
- Depends on model type. See AutoAI Overview for details.
- How you can prepare data
- Automatically transform data, such as impute missing values and transform text to scalar values.
- How you can build models
- Train a binary classification, multiclass classification, or regression model.
- View a tree infographic that shows the sequences of AutoAI training stages.
- Generate a leaderboard of model pipelines ranked by cross-validation scores.
- Save a pipeline as a model.
- Get started
- To create an AutoAI experiment, click New asset > Build machine learning models automatically.
- Learn more
- Documentation about AutoAI
- Videos about AutoAI
Watch a video to see how to build an AutoAI experiment
This video provides a visual method to learn the concepts and tasks in this documentation.
Federated Learning
Use the Federated Learning tool to train a common model using distributed data. The data is never combined or shared, preserving data integrity while providing all participating parties with a model based on the aggregated data.
- Required services
- Watson Studio
- Watson Machine Learning
- Data format
- Any
- Data size
- Any size
- How you can build models
- Choose a training framework.
- Configure the common model.
- Configure a file for training the common model.
- Have remote parties train their data.
- Deploy the common model.
- Get started
- To create an experiment, click New asset > Train models on distributed data.
- Learn more
- Documentation about Federated Learning
- Videos about Federated Learning
Watch a video to see how to build a Federated Learning experiment
This video provides a visual method to learn the concepts and tasks in this documentation.
Metadata import
Use the metadata import tool to automatically discover and import technical and process metadata for data assets into a project or a catalog.
- Required service
- IBM Knowledge Catalog
- Data format
- Any
- Data size
- Any size
- How you can prepare data
- Import data assets from a connection to a data source.
- Get started
- To import metadata, click New asset > Import metadata for data assets.
- Learn more
- Documentation about metadata import
- Videos about IBM Knowledge Catalog
Watch a video to see how to import asset metadata
This video provides a visual method to learn the concepts and tasks in this documentation.
Metadata enrichment
Use the metadata enrichment tool to automatically profile data assets and analyze data quality in a project.
- Required service
- IBM Knowledge Catalog
- Data format
- Relational and structured: Tables and files in relational and nonrelational data sources
- Tabular: Avro, CSV, or Parquet files
- Data size
- Any size
- How you can prepare and analyze data
- Profile and analyze a select set of data assets in a project.
- Get started
- To enrich data, click New asset > Enrich data assets with metadata.
- Learn more
- Documentation about metadata enrichment
- Videos about IBM Knowledge Catalog
Watch a video to see how to enrich data assets
This video provides a visual method to learn the concepts and tasks in this documentation.
Data quality rule
Use the data quality tool to create rules that analyze data quality in a project.
- Required service
- IBM Knowledge Catalog
- Data format
- Relational and structured: Tables and files in relational and nonrelational data sources
- Tabular: Avro, CSV, or Parquet files
- Data size
- Any size
- How you can prepare and analyze data
- Analyze the quality of a select set of data assets in a project.
- Get started
- To create a data quality rule, click New asset > Measure and monitor data quality.
- Learn more
- Documentation about data quality rules
IBM Match 360 with Watson
Use IBM Match 360 with Watson to create master data entities that represent digital twins of your customers. Model and map your data, then run the matching algorithm to create master data entities. Customize and tune your matching algorithm to meet your organization's requirements.
- Required services
- IBM Match 360 with Watson IBM Knowledge Catalog
- Data size
- Up to 1,000,000 records (for the Beta Lite plan)
- How you can prepare data
- Model and map data from sources across your organization.
- Run the customizable matching algorithm to create master data entities.
- View and edit master data entities and their associated records.
- Get started
- To create an IBM Match 360 configuration asset, click New Asset > Consolidate data into 360-degree views.
- Learn more
- Documentation about IBM Match 360 with Watson
- Videos about IBM Match 360
Watch a video to see how to use IBM Match 360
This video provides a visual method to learn the concepts and tasks in this documentation.
RStudio IDE
Use RStudio IDE to analyze data or create Shiny applications by writing R code.
- Required service
- Watson Studio
- Data format
- Any
- Data size
- Any size
- How you can prepare data, analyze data, and build models
- Write code in R.
- Create Shiny apps.
- Use open source libraries and packages.
- Include rich text and media with your code.
- Prepare data.
- Visualize data.
- Discover insights from data.
- Build and train a model using open source libraries.
- Share your Shiny app in a Git repository.
- Get started
- To use RStudio, click Launch IDE > RStudio.
- Learn more
- Documentation about RStudio
- Videos about RStudio
Watch a video to see an overview of the RStudio IDE
This video provides a visual method to learn the concepts and tasks in this documentation.
Masking flows
Use the Masking flow tool to prepare masked copies or masked subsets of data from the catalog. Data is de-identified using advanced masking options with data protection rules.
- Required service
- IBM Knowledge Catalog
- Data format
- Relational: Tables in relational data sources
- Data size
- Any size
- How you can prepare data, analyze data, or build models
- Import data assets from governed catalog to project.
- Create masking flow job definitions to specify what data to mask with data protection rules.
- Optionally subset data to reduce size of copied data.
- Run masking flow jobs to load masked copies to target database connections.
- Get started
- Ensure that pre-requisite steps in IBM Knowledge Catalog are completed. To privatize data, do one of the following tasks:
- Click New asset > Copy and mask data.
- Click the menu options for individual data assets to mask that asset directly.
Watch a video to see how to create a masking flow
This video provides a visual method to learn the concepts and tasks in this documentation.
Orchestration Pipelines
Use the Pipelines canvas editor to create a flow to prepare, visualize, and analyze data, or build and train a model.
- Data format
- Any
- Data size
- Any
- How you can prepare data, analyze data, or build models
- Use a variety of nodes that each contain their own logs.
- Incorporate notebooks into the flow to run any Python or R code.
- Work with any kind of data in any way you want.
- Schedule runs of your flow.
- Import data from your mounted PVC, project, or ingest data from Github.
- Create your custom component with a Python code.
- Conditionalize your pipelines to monitor data quality however you want.
- Use webhook to send emails or messages to keep up to date on the status of your flow.
- Get started
- To create a new pipeline, click New asset > Automate model lifecycle.
- Learn more
- Documentation about Orchestration Pipelines
- Videos about Orchestration Pipelines
Watch a video to see how to create a pipeline
This video provides a visual method to learn the concepts and tasks in this documentation.
Data visualizations
Use data visualizations to discover insights from your data. By exploring data from different perspectives with visualizations, you can identify patterns, connections, and relationships within that data and quickly understand large amounts of information.
- Data format
- Tabular: Avro, CSV, JSON, Parquet, TSV, SAV, Microsoft Excel .xls and .xlsx files, SAS, delimited text files, and connected data. For more information about supported data sources, see Connectors.
- Data size
- No limit
- Get started
- To create a visualization, click Data asset in the list of asset types in your project, and select a data asset. Click the Visualization tab, and choose a chart type.
- Learn more
- Visualizing your data
Parent topic: Projects