Data integration use case

Last updated: Nov 26, 2024

To cope with the influx of volumes and disparate data sources, enterprises need to build automation and intelligence into their data integration processes. Cloud Pak for Data as a Service provides the platform and tools to dynamically and intelligently orchestrate data across a distributed landscape to create a high-performance network of instantly available information for data consumers.

Watch this video to see the data fabric use case for implementing a Data integration solution in Cloud Pak for Data.

This video provides a visual method to learn the concepts and tasks in this documentation.

Challenges

As their data types and volumes grow, enterprises face the following data integration challenges:

Ingesting data from across the enterprise: Processes need to be able to ingest data from any application or system regardless of whether the data resides on premises, in the cloud, or in a hybrid environment.
Integrating data from multiple sources: Data engineers must be able to combine data from multiple data sources into a single data set as a file or a virtual table.
Making the data available for users: Data engineers need to be able to publish each integrated data set to a single catalog, and all users who need to consume the data need to have self-service access to it.

You can solve these challenges and integrate your data by using Cloud Pak for Data as a Service.

Example: Golden Bank's challenges

Follow the story of Golden Bank as the data engineering team implements Data integration. Golden Bank has a large amount of customer and mortgage data that is stored in three external data sources. Lenders use this information to help them decide whether they should approve or deny mortgage applications. The bank wants to integrate the data from the different sources, and then deliver that transformed data to a single output file that can be shared.

Process

To implement a Data integration solution for your enterprise, your organization can follow this process:

Integrate the data
Share the data
Automate the data lifecycle

The DataStage, Data Virtualization, Data Replication, and IBM Knowledge Catalog services in Cloud Pak for Data as a Service provide all of the tools and processes that your organization needs to implement a Data integration solution.

Image showing the flow of the Data integration use case

1. Integrate the data

With a data fabric architecture that uses Cloud Pak for Data as a Service, data engineers can optimize data integration by using workloads and data policies to efficiently access and work with data and combine virtualized data from different sources, types, and clouds as if the data was from a single data source. In this step of the process, the raw data is extracted, ingested, virtualized, and transformed into consumable, high-quality data that is ready to be explored and then orchestrated in your AI lifecycle.

What you can use	What you can do	Best to use when
Data Virtualization	Query many data sources as one. Data engineers can create virtual data tables that can combine, join, or filter data from various relational data sources. Data engineers can then make the resulting combined data available as data assets in catalogs. For example, you can use the combined data to feed dashboards, notebooks, and flows so that the data can be explored.	You need to combine data from multiple sources to generate views. You need to make combined data available as data assets in a catalog.
DataStage	Data engineers can design and run complex ETL data pipelines that move and transform data.	You need to design and run complex data flows. The flows must handle large volumes of data and connect to a wide range of data sources, integrate and transform data, and deliver it to your target system in batch or real time.
Data Refinery	Access and refine data from diverse data source connections. Materialize the resulting data sets as snapshots in time that might combine, join, filter, or mask data to make it usable for data scientists to analyze and explore. Make the resulting data sets available in catalogs.	You need to visualize the data when you want to shape or cleanse it. You want to simplify the process of preparing large amounts of raw data for analysis.
Data Replication	Distribute a data integration workload across multiple sites. Provide continuous availability of data.	Your data is distributed across multiple sites. You need your data to be continuously available.

Example: Golden Bank's data integration

Risk analysts at Golden Bank calculate the daily interest rate that they recommend offering to borrowers for each credit score range. Data engineers use DataStage to aggregate anonymized mortgage application data with the personally identifiable information from mortgage applicants. DataStage integrates this information, including credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. The data engineers then load the data into a target output .csv file that can be published to a catalog and shared for use by lenders and analysts.

The catalog helps your teams understand your customer data and makes the right data available for the right use. Data scientists and other types of users can help themselves to the integrated data that they need while they remain compliant with corporate access and data protection policies. They can add data assets from a catalog into a project, where they collaborate to prepare, analyze, and model the data.

What you can use	What you can do	Best to use when
Catalogs	Use catalogs in IBM Knowledge Catalog to organize your assets to share among the collaborators in your organization. Take advantage of AI-powered semantic search and recommendations to help users find what they need.	Your users need to easily understand, collaborate, enrich, and access the high-quality data. You want to increase visibility of data and collaboration between business users. You need users to view, access, manipulate, and analyze data without understanding its physical format or location, and without having to move or copy it. You want users to enhance assets by rating and reviewing them.

Example: Golden Bank's catalog

The governance team leader at Golden Bank creates a catalog, "Mortgage Approval Catalog," and adds the data stewards and data scientists as catalog collaborators. The data stewards publish the data assets that they created into the catalog. The data scientists find the data assets, curated by the data stewards, in the catalog and copy those assets to a project. In their project, the data scientists can refine the data to prepare it for training a model.

Automate the data lifecycle

Your team can automate and simplify the data lifecycle with Orchestration Pipelines.

What you can use	What you can do	Best to use when
Orchestration Pipelines	Use pipelines to create repeatable and scheduled flows that automate your data ingestion and integration.	You want to automate some or all of the steps in a data integration flow.

Example: Golden Bank's automated data lifecycle

The data scientists at Golden Bank can use pipelines to automate their data integration lifecycle to keep the data current.

Tutorials for Data integration

Tutorial	Description	Expertise for tutorial
Integrate data	Extract, filter, join, and transform your data.	Use the DataStage drag and drop interface to transform data.
Virtualize external data	Virtualize and join data tables from external sources.	Use the Data Virtualization interface to virtualize data.
Replicate data	Set up near real time and continuous replication between source and target databases.	Use the Data Replication tool to replicate data.
Orchestrate and AI pipeline with data integration	Create an end-to-end pipeline that prepares data and trains a model.	Use the Orchestration Pipelines drag and drop interface to create a pipeline.