Creating a virtualized table from files in Cloud Object Storage in Watson Query
You can use data that is stored as files on IBM® Cloud Object Storage, Amazon S3, or Ceph® data sources to create virtual tables. Segment or combine data from one or more files to create a virtual table.
Before you begin
To access data in cloud object storage, you must create a connection to the data source where the files are located, such as Amazon S3, Ceph, or IBM Cloud Object Storage. For more information about object storage data sources, see Data sources in object storage in Watson Query.
About this task
Watch the following video for an overview of virtualized files in Cloud Object Storage in Watson Query.
This video provides a visual method as an alternative to following the written steps in this documentation.
- On the navigation menu, click . The service menu opens to the Data sources page by default.
- On the service menu, click
and then click the The list of available data sources appears. You can narrow down the displayed assets by using the available filters.
If you specified a bucket name when you set up the data source connection, click to expand the object storage connection details to see the Service type and Bucket information. If you didn't specify a bucket name when you set up the data source connection, you can use the Bucket input field to find a specific bucket in the endpoint.
If the list of data sources does not appear, click Refresh.
- Select the object storage endpoint where you want to browse files and file
paths. For Cloud Object Storage, the endpoint is the URL for the object storage.A list of file paths or buckets in the endpoint appears. You can navigate through the file path structure or you can click to see details of the contents of the first file in the file path.
You cannot add the bucket to your cart. To add a file path to your cart, you must select the file path to preview the files in the path, and then click Add to cart. You cannot select a file at the bucket level, you must add the file to a file path in the bucket.
- Select the file or file path that you want to virtualize and click Add to
- You cannot virtualize a single file in a file path that contains multiple files. The URL resolves to the parent path where the file is located and the entire file path is virtualized. If you want to virtualize a single file, you can move it to a separate directory. The separate directory must not be a subdirectory of any other directory that is, or will be, virtualized.
- Files that you want to virtualize must be within a file path and not at the same level as the bucket. For example, you cannot virtualize a file s3a://mynewbigsqlbucket/mydata.csv; you must put mydata.csv into a file path and virtualize s3a://mynewbigsqlbucket/fi1epath1/mydata.csv because the virtualization process cannot create an external table by using only a bucket name without a path.
Click View cart to preview your file data selections as a virtual file.
From this window, you can edit schema names, preview files that participate in a merged table, or remove a selection from your cart.
If you have Watson™ Knowledge Catalog installed, you can publish your virtual table to a catalog. For more information, see Publishing virtual data to the catalog in Watson Query.
- Recommended: Update the type of partitioned columns from STRING to something more appropriate. Manually inspect and specify proper types for partitioning columns for best performance.
- Optional: Click and
select Modify columns. You can edit any column name that is not tagged as a Partitioning column and change column types by using the drop-down menu. When you are happy with your edits, click Apply. Updated column names are shown after you virtualize the table.
- Select the appropriate option to assign the virtual table to be created from file data:
Assign to When to use this option Project Select Project if you created the virtual table to use in a specific project. Then, choose the appropriate project. The table also appears in Virtualized data. Virtualized data Select Virtualized data if the table was not created to use in a specific project. This setting is the default if no projects exist.
- Select Publish to catalog if you also
want to publish to a selected catalog. A list of available catalogs is shown in the drop-down menu. Each catalog is tagged as Governed or Not governed.Note: You must have at least one catalog in Watson Knowledge Catalog.
You must have permission to publish to a catalog. An administrator can enable whether all virtual objects are published to a selected governed catalog, which prevents a user from publishing to a specified catalog.
- Specify a schema in the Schema
field. You can also create a schema by following these steps.
For more information, see Creating schemas for virtual objects.
- If you have the Watson Query Engineer or User role, leave the Schema field as default to create a schema with your user ID.
- If you have the Watson Query Manager role, leave the Schema field as default to create a schema with your user ID or enter the new schema name in the Schema field.
Click Virtualize to complete the process.
When the status window appears, you can select to view your virtualized data or virtualize more data. However, you must wait until virtualization is complete before you navigate away from the page.
What to do next
- View the table structure and metadata.
- Edit column names and the types of your object storage assets so that you can prepare accurate data for virtualization.
- Manage access to the table.
- Collect statistics for your virtualized table to optimize query performance. For more information, see Collecting statistics in Watson Query.
Optionally, on the Virtualized data page, publish your virtual object to the catalog. For more information, see Publishing virtual data to the catalog in Watson Query.