Multicloud data integration use case
To cope with the influx of volumes and disparate data sources, enterprises need to build automation and intelligence into their data integration processes. Cloud Pak for Data as a Service provides the platform and tools to dynamically and intelligently orchestrate data across a distributed landscape to create a high-performance network of instantly available information for data consumers.
Watch this video to see the data fabric use case for implementing a Multicloud data integration solution in Cloud Pak for Data.
This video provides a visual method as an alternative to following the written steps in this documentation.
As their data types and volumes grow, enterprises face the following data integration challenges:
Ingesting data from across the enterprise
Processes need to be able to ingest data from any application or system regardless of whether the data resides on premises, in the cloud, or in a hybrid environment.
Integrating data from multiple sources
Organizations need to be able to automate the bulk ingestion, cleansing, and complex transformations of data.
Making the data available for users
Data engineers need to be able to publish each integrated data set to a single catalog, and all users who need to consume the data need to have self-service access to it.
You can solve these challenges by implementing your Multicloud data integration with data fabric on Cloud Pak for Data as a Service.
Example: Golden Bank's challenges
Follow the story of Golden Bank as the data engineering team implements Multicloud data integration. Golden Bank has a large amount of customer and mortgage data that is stored in three external data sources. Lenders use this information to
help them decide whether they should approve or deny mortgage applications. The bank wants to integrate the data from the different sources, and then deliver that transformed data to a single output file that can be shared.
To implement a Multicloud data integration solution for your enterprise, your organization can follow this process:
The DataStage, Watson Query, and Watson Knowledge Catalog services in Cloud Pak for Data as a Service provide all of the tools and processes that your organization needs to implement a Multicloud data integration solution.
1. Integrate the data
With a data fabric architecture that uses Cloud Pak for Data as a Service, data engineers can optimize data integration by using workloads and data policies to efficiently access and work with data and combine virtualized data from different sources, types, and clouds as if the data was from a single data source. In this step of the process, the raw data is extracted, ingested, virtualized, and transformed into consumable, high-quality data that is ready to be explored and then orchestrated in your AI lifecycle.
|What you can use||What you can do||Best to use when|
|Watson Query||Query many data sources as one. Data engineers can create virtual data tables that can combine, join, or filter data from various relational data sources.
Data engineers can then make the resulting combined data available as data assets in catalogs. For example, you can use the combined data to feed dashboards, notebooks, and flows so that the data can be explored.
|You need to combine data from multiple sources to generate views.
You need to make combined data available as data assets in a catalog.
|DataStage||Data engineers can design and run complex data flows that move and transform data.||You need to design and run complex data flows that handle large volumes of data and connect to a wide range of data sources, integrate and transform data, and deliver it to your target system in batch or real time.|
|Data Refinery||Access and refine data from diverse data source connections.
Materialize the resulting data sets as snapshots in time that might combine, join, filter, or mask data to make it usable for data scientists to analyze and explore.
Make the resulting data sets available in catalogs.
|You need to visualize the data when you want to make changes to it.
You want to simplify the process of preparing large amounts of raw data for analysis.
Example: Golden Bank's data integration
Risk analysts at Golden Bank calculate the daily interest rate that they recommend offering to borrowers for each credit score range. Data engineers use DataStage to aggregate anonymized mortgage application data with the personally identifiable information from mortgage applicants. DataStage integrates this information, including credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. The data engineers then load the data into a target output .csv file that can be published to a catalog and shared for use by lenders and analysts.
2. Share the data
The catalog helps your teams understand your customer data and makes the right data available for the right use. Data scientists and other types of users can help themselves to the integrated data that they need while they remain compliant with corporate access and data protection policies. They can add data assets from a catalog into a project, where they collaborate to prepare, analyze, and model the data.
|What you can use||What you can do||Best to use when|
|Catalogs||Use catalogs in Watson Knowledge Catalog to organize your assets to share among the collaborators in your organization.
Take advantage of AI-powered semantic search and recommendations to help users find what they need.
|Your users need to easily understand, collaborate, enrich, and access the high-quality data.
You want to increase visibility of data and collaboration between business users.
You need users to view, access, manipulate, and analyze data without understanding its physical format or location, and without having to move or copy it.
You want users to enhance assets by rating and reviewing them.
Example: Golden Bank's catalog
The governance team leader at Golden Bank creates a catalog, "Mortgage Approval Catalog," and adds the data stewards and data scientists as catalog collaborators. The data stewards publish the data assets that they created into the catalog. The data scientists find the data assets, curated by the data stewards, in the catalog and copy those assets to a project. In their project, the data scientists can refine the data to prepare it for training a model.
Tutorials for Multicloud data integration
|Tutorial||Description||Expertise for tutorial|
|Integrate data||Extract, filter, join, and transform your data.||Use the DataStage drag and drop interface to transform data.
|Virtualize external data||Virtualize and join data tables from external sources.||Use the Watson Query interface to virtualize data.
Parent topic: Data fabric solution overview