Using IBM Cloud SQL Query

IBM Cloud SQL Query is IBM's serverless SQL service on data on Cloud Object Storage (COS). It allows to run ANSI SQL on Parquet, CSV, JSON, ORC and AVRO data sets. You can use it to run your analytic queries, and you can use it to conduct complex transformations and write the result in any desired data format, partitioning and layout. SQL Query is based on Apache Spark SQL as the query engine in the background. This means you do not have to provision any Apache Spark instance or service. For more background information, check out the SQL Query documentation. As mentioned in that documentation, you can make a direct SQL query using the SQL Query Web UI or using a simple Python client (like the IBM Watson Studio Notebook) is sufficient.

This notebook is meant to be a generic starter to use the SQL Query API in order to run SQL statements in a programmatic way. It uses the ibmcloudsql Python library for this purpose. In the first section it is demonstrated how to setup the appropriate libraries, how to edit respectively generate SQL statements using the so-called sql-magic module, how to execute the SQL statements and finally how to work with the resulte. The later sections deal with topcis for the more advanced user. For information about using ibmcloudsql Python library, check out the documentation. The notebook has been verified to work with Python 3.6. As mentioned above it does not require a Spark service bound to the notebook.

Run the following sequence of cells at least once in your notebook environment in order to install required packages:

- pyarrow - provides rich, powerful features for working with columnar data
- sqlparse - a non-validating SQL parser 
- ibmcloudsql - SQL Query client library 
In [ ]:
!pip install pyarrow 
!pip install sqlparse
In [ ]:
!pip install --upgrade --user  ibmcloudsql

Import (load) the following packages:

- ibmcloudsql - IBM Cloud SQL Query python client
- sqlparse - a non-validating SQL parser 
- pandas - an open source data manipulation and analysis tool 
- pygments - a syntax highlighter library used to pretty print the SQL statements
In [3]:
import ibmcloudsql
import sqlparse 
import pandas as pd
import getpass
import pprint
from pygments import highlight
from pygments.lexers import get_lexer_by_name
from pygments.formatters import HtmlFormatter, Terminal256Formatter
lexer = get_lexer_by_name("sql", stripall=True)
formatter = Terminal256Formatter(style='vim')
apikey=''
instancecrn=''
targeturl=''

2. Configure the SQL Query client

Home

  1. You need an API key for an IBM cloud identity. This single key provides you accesses to both your Cloud Object Storage (COS) bucket for writing SQL results and to your SQL Query instance. To create API keys log on to the IBM Cloud console and go to Manage->Access (IAM): then select API Keys, click the Create an IBM Cloud API key button, give the key a custom name and click Create. In the next dialog click Show and copy the key to your clipboard and paste it below in this notebook.

  2. You need the instance CRN for the SQL Query instance. If you don't have an SQL Query instance created yet, create one first. If you already have one, you can find it in the IBM Cloud console dashboard. Make sure you select the right Resource Groups for Group. In the section Services you can see different types of services (created for the selected Group), and SQL service instances have the icon like the one at the top of this notebook. Select the instance of SQL Query that you want to use. In the SQL Query dashboard page that opens up you find a section titled Overview with Deployment Details and copy the text after CRN. Click the button to copy the CRN into your clipboard and paste it here into the notebook.

  3. You need to specify the location on COS where your query results should be written.This is because SQL Query instance, to process an ETL SQL statement, needs to store queried data on COS. This COS location comprises three parts of information that you can find in the Cloud Object Storage UI for your instance in the IBM Cloud console. You need to provide it as a target_cos_url using the format cos://<endpoint>/<bucket>/[<prefix>].

In case you want to use the cloud object storage bucket that is associated with your Watson Studio project as target location for your query results you can make use of the project token for access and construct the target_cos_url as follows:

2.1 Using the project bucket

Only follow the instructions in this section when you want to write your SQL query results to the bucket that has been created for the project for which you have created this notebook. In any other case proceed directly with section 2.2.

Inserting the project token:
Click the More option in the toolbar above (the three stacked dots) and select Insert project token.

  • If you haven't created an access token for this project before, you will see a dialog that asks you to create one first. Follow the link to open your project settings, scroll down to Access tokens and click New token. Give the token a custom name and make sure you select Editor as Access role for project. After you created your access token you can come back to this notebook, select the empty cell below and again select Insert project token from the toolbar at the top.

This will add a new cell at the top of your notebook with content that looks like this:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='<some id>', project_access_token='<some access token>')
pc = project.project_context

Leave that cell content as inserted and run the cell. Then then proceed with the following cell below:

In [ ]:
cos_bucket = project.get_metadata()['entity']['storage']['properties']
targeturl="cos://" + cos_bucket['bucket_region'] + "/" + cos_bucket['bucket_name'] + "/"
targeturl

2.2 Setting the SQL Query parameters

Now let's instantiate and configure an SQLQuery Client. During instantiation the client accesses the SQL Query instance you selected and returns the link to the SQL Query service instance web console. You can use this link later to author an SQL statement interactively.

In addition to the parameters entered interactively here, there is two more optional parameters max_concurrent_jobs (from version 0.4) and max_tries
For details see use.

In [4]:
if apikey == '':
    apikey=getpass.getpass('Enter IBM Cloud API Key: ')
else:
    apikey=getpass.getpass('Enter a new IBM Cloud API Key or leave empty to use the previous one: ') or apikey
if instancecrn == '':
    instancecrn=input('Enter SQL Query instance CRN [email protected]@de.ibm.com use: ')
else:
    instancecrn=input('Enter new SQL Query instance CRN to use (leave empty to use ' + instancecrn + '): ') or instancecrn
if targeturl == '':
    targeturl=input('Enter target URL for SQL results: ')
else:
    targeturl=input('Enter new target URL for SQL results (leave empty to use ' + targeturl + '): ') or targeturl
 
 
sqlClient = ibmcloudsql.SQLQuery(apikey, instancecrn, client_info='SQL Query Starter Notebook', target_cos_url=targeturl, max_concurrent_jobs=4, max_tries=3 )
#sqlClient.configure()  # use this if you want to change the API key or SQL Query CRN later

sqlClient.logon()
Enter IBM Cloud API Key: ········
Enter SQL Query instance CRN [email protected]@de.ibm.com use: crn:v1:bluemix:public:sql-query:us-south:a/d86af7367f70fba4f306d3c19c938f2f:d1b2c005-e3d8-48c0-9247-e9726a7ed510::
Enter target URL for SQL results: cos://us-south/sqltempregional/
In [9]:
print('\nYour SQL Query web console link:\n')
sqlClient.sql_ui_link()
Your SQL Query web console link:

https://sql-query.cloud.ibm.com/sqlquery/?instance_crn=crn:v1:bluemix:public:sql-query:us-south:a/d86af7367f70fba4f306d3c19c938f2f:d1b2c005-e3d8-48c0-9247-e9726a7ed510::
Out[9]:
'https://sql-query.cloud.ibm.com/sqlquery/?instance_crn=crn:v1:bluemix:public:sql-query:us-south:a/d86af7367f70fba4f306d3c19c938f2f:d1b2c005-e3d8-48c0-9247-e9726a7ed510::'

3. Get the schema of your data

Home

In order to work with your data using SQL Query you need to know the schema of your data. Since ibmcloudsql version 0.4, you can directly query the schema of the data. Let's have a look at the employees.parquet dataset provided with SQL Query as a sample dataset. Note, that the first invocation uses the parameter dry_run=True which is done here to demonstrate you can get a preview of the statement being executed.

NOTE: If you want to work with hive table, there is another API: describe_table() that returns the schema of your defined table.

In [7]:
sqlClient.get_schema_data("cos://us-geo/sql/employees.parquet", type="parquet", dry_run=True)

sqlClient.get_schema_data("cos://us-geo/sql/employees.parquet", type="parquet")
        SELECT * FROM DESCRIBE(cos://us-geo/sql/employees.parquet STORED AS PARQUET)
        INTO cos://us-south/sqltempregional/ STORED AS JSON
        
Out[7]:
name nullable type
0 employeeID True integer
1 lastName True string
2 firstName True string
3 title True string
4 titleOfCourtesy True string
5 birthDate True timestamp
6 hireDate True timestamp
7 address True string
8 city True string
9 region True string
10 postalCode True string
11 country True string
12 homePhone True string
13 extension True integer
14 photo True string
15 notes True string
16 reportsTo True string
17 photoPath True string

4. Create your SQL statement

Home

Now that we know the schema of our data sets we can create an SQL statement to be executed on these data. You provide the SQL statement in the form of a string. In the subsequent cell a sample SQL statement on the above listed example datasets provided with SQL Query is given. You can either use that sample statement for the subsequent steps or copy and paste your own statement, which you may have authored using the SQL Query Web Console (link above) of your SQL Query service instance.

In [8]:
sql=input('Enter your SQL statement (leave empty to use a simple sample SQL)')

if sql == '':
    sql='SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, \
         cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c \
         WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01" \
         ORDER BY c.CompanyName'
if " INTO " not in sql:
    sql += ' INTO {}myQueryResult STORED AS CSV'.format(targeturl)
formatted_sql = sqlparse.format(sql, reindent=True, indent_tabs=True, keyword_case='upper')
lexer = get_lexer_by_name("sql", stripall=True)
formatter = Terminal256Formatter(style='tango')
result = highlight(formatted_sql, lexer, formatter)
from IPython.core.display import display, HTML
print('\nYour SQL statement is:\n')
print(result)
Enter your SQL statement (leave empty to use a simple sample SQL)

Your SQL statement is:

SELECT o.OrderID,
	c.CompanyName,
	e.FirstName,
	e.LastName
FROM cos://us-geo/SQL/orders.parquet STORED AS PARQUET o,
	cos://us-geo/SQL/employees.parquet STORED AS PARQUET e,
	cos://us-geo/SQL/customers.parquet STORED AS PARQUET c
WHERE e.EmployeeID = o.EmployeeID
		AND c.CustomerID = o.CustomerID
		AND o.ShippedDate > o.RequiredDate
		AND o.OrderDate > "1998-01-01"
ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myQueryResult STORED AS CSV

Since version 0.4, ibmcloudsql provides a new alternartive way to construct your SQL statement, using the functionality called *sql_magic*. The documentation is available here. The same example SQL statement as above, can be generated with the sql_magic api as demonstrated in the following code.

In [9]:
sqlClient.reset_()
(sqlClient.select_("o.OrderID, c.CompanyName, e.FirstName, e.LastName")
        .from_cos_("cos://us-geo/sql/orders.parquet", format_type="parquet", alias="o")
        .from_cos_("cos://us-geo/sql/employees.parquet", format_type="parquet", alias="e")
        .from_cos_("cos://us-geo/sql/customers.parquet", alias="c")
        .where_('e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01"')
        .order_by_("c.CompanyName")
        .store_at_(targeturl + "myResult", format_type="csv")
)

sqlClient.print_sql()
SELECT o.OrderID,
       c.CompanyName,
       e.FirstName,
       e.LastName
FROM cos://us-geo/sql/orders.parquet STORED AS parquet o, cos://us-geo/sql/employees.parquet STORED AS parquet e, cos://us-geo/sql/customers.parquet STORED AS parquet c
WHERE e.EmployeeID = o.EmployeeID
  AND c.CustomerID = o.CustomerID
  AND o.ShippedDate > o.RequiredDate
  AND o.OrderDate > "1998-01-01"
ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myResult STORED AS CSV

Note: sql_magic stores the constructed SQL string in the SQL client object so that you can use it (e.g. to submit() it for execution) in subsequent method calls on the client object.

5. Run your SQL statement

Home

There are three options to execute an SQL statement, as described in the documentation.

  1. Synchronously with result dataframe - submit, wait (until the query is completed), and return the queried data as dataframe: run_sql()

  2. Asynchronously - submit, and return the control immediately (along with job_id) : submit_sql() (submit() when using sql_magic)

  3. Synchronously with optional result dataframe - submit, wait (until the query is completed), and return a tuple (data, job_id), where data is the optional result dataframe: execute_sql() (run() when using sql_magic)

The last option is useful to help avoiding Python runtime memory overload. Another alternative to deal with big result sets is to paginate the result i.e. use the pagesize option, for details see paginated result.

A SQL Query instance with free Lite plan can process one query at a time. A standard plan instance can by default process 5 concurrent queries, which can be increased via support ticket. When the maximum has been reached the Python SDK will retry (with progressive backup timing) as many times as has been specified in the max_tries when initializing the client.

5.1 Synchronous Execution()

This method provides a synchronous SQL execution mechanism. In the subsequent cell the above SQL statement is submitted. The method waits for the excution to be finish. The queried data are returned synchronously as a dataframe.

In [10]:
result_df = sqlClient.run_sql(sql)
if isinstance(result_df, str):
    print(result_df)
In [11]:
result_df.head(10)
Out[11]:
OrderID CompanyName FirstName LastName
0 10924 Berglunds snabbköp Janet Leverling
1 11058 Blauer See Delikatessen Anne Dodsworth
2 10827 Bon app' Nancy Davolio
3 11076 Bon app' Margaret Peacock
4 11045 Bottom-Dollar Markets Michael Suyama
5 10970 Bólido Comidas preparadas Anne Dodsworth
6 11054 Cactus Comidas para llevar Laura Callahan
7 11008 Ernst Handel Robert King
8 11072 Ernst Handel Margaret Peacock
9 10816 Great Lakes Food Market Margaret Peacock

In the subsequent call you see how easy it can be to visualize your result.

In [12]:
ax = result_df.FirstName.value_counts().plot(kind='bar', title="Orders per Employee")
ax.set_xlabel("First Name")
ax.set_ylabel("Order Count");

5.2 Asynchronous Execution

In the subsequent cell the earlier created SQL statement is submitted. The method returns right away without waiting for the completion of the job execution.

submit_sql() runs the given SQL string, while submit() runs the internally via sql_magic module generated SQL string .

In [13]:
jobId = sqlClient.submit_sql(sql)
print("SQL query submitted and running in the background. jobId = " + jobId)
SQL query submitted and running in the background. jobId = 603d9403-0ad8-484c-8191-8de5116ae861
In [14]:
print("Job status for " + jobId + ": " + sqlClient.get_job(jobId)['status'])
Job status for 603d9403-0ad8-484c-8191-8de5116ae861: running

Use the wait_for_job() method as a blocking call until your job has finished:

In [15]:
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
if job_status == 'failed':
    details = sqlClient.get_job(jobId)
    print("Error: {}\nError Message: {}".format(details['error'], details['error_message']))
Job 603d9403-0ad8-484c-8191-8de5116ae861 terminated with status: completed

Use the get_result() method to retrieve a dataframe for the SQL result set:

In [16]:
result_df = sqlClient.get_result(jobId)
print("OK, we have a dataframe for the SQL result that has been stored by SQL Query in " + sqlClient.get_job(jobId)['resultset_location'])
OK, we have a dataframe for the SQL result that has been stored by SQL Query in cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/myQueryResult/jobid=603d9403-0ad8-484c-8191-8de5116ae861

Alternative method for asynchronous query submission using the internal statement created earlier using sql_magic :

In [17]:
print("Internal SQL query string created earlier using sql_magic:\n ")
sqlClient.print_sql()
Internal SQL query string created earlier using sql_magic:
 
SELECT o.OrderID,
       c.CompanyName,
       e.FirstName,
       e.LastName
FROM cos://us-geo/sql/orders.parquet STORED AS parquet o, cos://us-geo/sql/employees.parquet STORED AS parquet e, cos://us-geo/sql/customers.parquet STORED AS parquet c
WHERE e.EmployeeID = o.EmployeeID
  AND c.CustomerID = o.CustomerID
  AND o.ShippedDate > o.RequiredDate
  AND o.OrderDate > "1998-01-01"
ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myResult STORED AS CSV
In [18]:
jobId= sqlClient.submit()
print("\nSQL query submitted and running in the background. jobId = " + jobId)
SQL query submitted and running in the background. jobId = 777911ff-9bee-4ae4-aad4-361f789f20e6
In [19]:
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
if job_status == 'failed':
    details = sqlClient.get_job(jobId)
    print("Error: {}\nError Message: {}".format(details['error'], details['error_message']))
Job 777911ff-9bee-4ae4-aad4-361f789f20e6 terminated with status: completed

5.3 Synchronous execution with optional result dataframe

The synchronuous execution of an SQL statement with run_sql() can cause Python runtime memory overload when the result set is big. Therefore a synchronous execution method is provided which allows to control whether the result set is to be returned as dataframe or only stored on cloud object storage.

execute_sql() submits the sql statement, waits (until the query is completed), and returns a named tuple (data, job_id), with data being a dataframe optionally filled with the queried data. The option get_result controls if data are returned in the dataframe data or only stored on cloud object storage. Setting get_result = "false" is recommended if the result set is expected to be big. An alternative to deal with big result sets is to use pagination which is possible using the the pagesize option as described later.

In [20]:
result = sqlClient.execute_sql(sql, get_result=True) 
display(result.data)
OrderID CompanyName FirstName LastName
0 10924 Berglunds snabbköp Janet Leverling
1 11058 Blauer See Delikatessen Anne Dodsworth
2 10827 Bon app' Nancy Davolio
3 11076 Bon app' Margaret Peacock
4 11045 Bottom-Dollar Markets Michael Suyama
5 10970 Bólido Comidas preparadas Anne Dodsworth
6 11054 Cactus Comidas para llevar Laura Callahan
7 11008 Ernst Handel Robert King
8 11072 Ernst Handel Margaret Peacock
9 10816 Great Lakes Food Market Margaret Peacock
10 11040 Great Lakes Food Market Margaret Peacock
11 11061 Great Lakes Food Market Margaret Peacock
12 10960 HILARION-Abastos Janet Leverling
13 11065 LILA-Supermercado Laura Callahan
14 11071 LILA-Supermercado Nancy Davolio
15 11039 LINO-Delicateses Nancy Davolio
16 10927 La corne d'abondance Margaret Peacock
17 11051 La maison d'Asie Robert King
18 11070 Lehmanns Marktstand Andrew Fuller
19 11073 Pericles Comidas clásicas Andrew Fuller
20 11068 Queen Cozinha Laura Callahan
21 10828 Rancho grande Anne Dodsworth
22 11019 Rancho grande Michael Suyama
23 11077 Rattlesnake Canyon Grocery Nancy Davolio
24 11062 Reggiani Caseifici Margaret Peacock
25 11059 Ricardo Adocicados Andrew Fuller
26 11075 Richter Supermarkt Laura Callahan
27 10847 Save-a-lot Markets Margaret Peacock
28 11074 Simons bistro Robert King

The equivalent method to run a SQL statement that was generated with sql_magic is run().

5.4 Use paginated SQL

There is another alternative to deal with large result sets to avoid Python runtime memory overload when reading the result as a dataframe. You can return the result in small enough "chunks" or "pages". All of the SQL execution methods previously introduced provide the optional paramemter pagesize. When set the result set is written in multiple objects with each having as many rows as specified in pagesize. Since this is implemented using the SQL Query syntax clause of PARTITIONED EVERY <num> ROW your query must not already contain another PARTITIONED clause when you set the pagesize parameter.

The following cells demonstrate the usage of the optional pagesize parameter.

In [21]:
pagination_sql='SELECT OrderID, c.CustomerID CustomerID, CompanyName, City, Region, PostalCode \
                FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, \
                     cos://us-geo/sql/customers.parquet STORED AS PARQUET c \
                WHERE c.CustomerID = o.CustomerID \
                INTO {}paginated_orders STORED AS PARQUET'.format(targeturl)

formatted_etl_sql = sqlparse.format(pagination_sql, reindent=True, indent_tabs=True, keyword_case='upper')
result = highlight(formatted_etl_sql, lexer, formatter)
print('\nExample Statement is:\n')
print(result)

jobId = sqlClient.submit_sql(pagination_sql, pagesize=10)
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
job_details = sqlClient.get_job(jobId)
if job_status == 'failed':
    print("Error: {}\nError Message: {}".format(job_details['error'], job_details['error_message']))
Example Statement is:

SELECT OrderID,
	c.CustomerID CustomerID,
	CompanyName,
	City,
	Region,
	PostalCode
FROM cos://us-geo/SQL/orders.parquet STORED AS PARQUET o,
	cos://us-geo/SQL/customers.parquet STORED AS PARQUET c
WHERE c.CustomerID = o.CustomerID INTO cos://us-south/sqltempregional/paginated_orders STORED AS PARQUET

Job 37bdcd23-9d91-4d4e-8d03-1e0e73dd12e6 terminated with status: completed

Let's check how many pages with each 10 rows have been written:

In [22]:
print("Number of pages written by job {}: {}".format(jobId, len(sqlClient.list_results(jobId))))
Number of pages written by job 37bdcd23-9d91-4d4e-8d03-1e0e73dd12e6: 85

The following cell retrieves the first page of the result as a data frame. The desired page is specified as the optional parameter pagenumber to the get_result() method.

In [23]:
pagenumber=1
sqlClient.get_result(jobId, pagenumber=pagenumber).head(100)
Out[23]:
OrderID CustomerID CompanyName City Region PostalCode
0 11011 ALFKI Alfreds Futterkiste Berlin NULL 12209
1 10952 ALFKI Alfreds Futterkiste Berlin NULL 12209
2 10835 ALFKI Alfreds Futterkiste Berlin NULL 12209
3 10702 ALFKI Alfreds Futterkiste Berlin NULL 12209
4 10692 ALFKI Alfreds Futterkiste Berlin NULL 12209
5 10643 ALFKI Alfreds Futterkiste Berlin NULL 12209
6 10926 ANATR Ana Trujillo Emparedados y helados México D.F. NULL 05021
7 10759 ANATR Ana Trujillo Emparedados y helados México D.F. NULL 05021
8 10625 ANATR Ana Trujillo Emparedados y helados México D.F. NULL 05021
9 10308 ANATR Ana Trujillo Emparedados y helados México D.F. NULL 05021

The following cell gets the next page. Run it multiple times in order to retrieve the subsequent pages, one page after the another.

In [24]:
pagenumber+=1
sqlClient.get_result(jobId, pagenumber).head(100)
Out[24]:
OrderID CustomerID CompanyName City Region PostalCode
0 10856 ANTON Antonio Moreno Taquería México D.F. NULL 05023
1 10682 ANTON Antonio Moreno Taquería México D.F. NULL 05023
2 10677 ANTON Antonio Moreno Taquería México D.F. NULL 05023
3 10573 ANTON Antonio Moreno Taquería México D.F. NULL 05023
4 10535 ANTON Antonio Moreno Taquería México D.F. NULL 05023
5 10507 ANTON Antonio Moreno Taquería México D.F. NULL 05023
6 10365 ANTON Antonio Moreno Taquería México D.F. NULL 05023
7 11016 AROUT Around the Horn London NULL WA1 1DP
8 10953 AROUT Around the Horn London NULL WA1 1DP
9 10920 AROUT Around the Horn London NULL WA1 1DP

5.5 get_job()

The method get_job() provides you with some execution related details. This way you can figure out the status of the job, the result_location on COS, the query startstart_time and end_time, and the key performance metrics such as bytes_read.

In [25]:
job_details = sqlClient.get_job(jobId)
pprint.pprint(job_details)
{'bytes_read': 17111,
 'end_time': '2021-03-11T07:56:57.135Z',
 'job_id': '37bdcd23-9d91-4d4e-8d03-1e0e73dd12e6',
 'plan_id': '72e394a3-9a56-432f-bc9d-8bc891f6adea',
 'resultset_format': 'parquet',
 'resultset_location': 'cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/paginated_orders/jobid=37bdcd23-9d91-4d4e-8d03-1e0e73dd12e6',
 'rows_read': 921,
 'rows_returned': 830,
 'statement': 'SELECT OrderID, c.CustomerID CustomerID, CompanyName, City, '
              'Region, PostalCode                 FROM '
              'cos://us-geo/sql/orders.parquet STORED AS PARQUET '
              'o,                      cos://us-geo/sql/customers.parquet '
              'STORED AS PARQUET c                 WHERE c.CustomerID = '
              'o.CustomerID                 INTO '
              'cos://us-south/sqltempregional/paginated_orders STORED AS '
              'PARQUET PARTITIONED EVERY 10 ROWS',
 'status': 'completed',
 'submit_time': '2021-03-11T07:56:30.924Z',
 'user_id': '[email protected]'}

6. Running ETL SQLs

Home

The following ETL SQL statement joins two data sets from COS and writes the result to COS using hive style partitioning with two columns.

In [26]:
etl_sql='SELECT OrderID, c.CustomerID CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax \
         EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, \
         ShipCity, ShipRegion, ShipPostalCode, ShipCountry FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, \
         cos://us-geo/sql/customers.parquet STORED AS PARQUET c \
         WHERE c.CustomerID = o.CustomerID \
         INTO {}customer_orders STORED AS PARQUET PARTITIONED BY (ShipCountry, ShipCity)'.format(targeturl)
formatted_etl_sql = sqlparse.format(etl_sql, reindent=True, indent_tabs=True, keyword_case='upper')
result = highlight(formatted_etl_sql, lexer, formatter)
print('\nExample ETL Statement is:\n')
print(result)
Example ETL Statement is:

SELECT OrderID,
	c.CustomerID CustomerID,
	CompanyName,
	ContactName,
	ContactTitle,
	Address,
	City,
	Region,
	PostalCode,
	Country,
	Phone,
	Fax EmployeeID,
	OrderDate,
	RequiredDate,
	ShippedDate,
	ShipVia,
	Freight,
	ShipName,
	ShipAddress,
	ShipCity,
	ShipRegion,
	ShipPostalCode,
	ShipCountry
FROM cos://us-geo/SQL/orders.parquet STORED AS PARQUET o,
	cos://us-geo/SQL/customers.parquet STORED AS PARQUET c
WHERE c.CustomerID = o.CustomerID INTO cos://us-south/sqltempregional/customer_orders STORED AS PARQUET PARTITIONED BY (ShipCountry,
																																																																																																																									ShipCity)

In [27]:
jobId = sqlClient.submit_sql(etl_sql)
print("SQL query submitted and running in the background. jobId = " + jobId)
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
job_details = sqlClient.get_job(jobId)
if job_status == 'failed':
    print("Error: {}\nError Message: {}".format(job_details['error'], job_details['error_message']))
SQL query submitted and running in the background. jobId = 39afb149-0a39-4fb9-93de-a029d5dbe959
Job 39afb149-0a39-4fb9-93de-a029d5dbe959 terminated with status: completed

Now let's have a look at the cloud object storage location where the result objects are stored. The following cell uses the get_cos_summary() method to print a summary of the objects that have been written by the previous ETL SQL statement.

In [28]:
resultset_location = job_details['resultset_location']
sqlClient.get_cos_summary(resultset_location)
Out[28]:
{'url': 'cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959/',
 'total_objects': 71,
 'total_volume': '401.7 KB',
 'oldest_object_timestamp': 'March 11, 2021, 07H:57M:20S',
 'newest_object_timestamp': 'March 11, 2021, 07H:57M:27S',
 'smallest_object_size': '0.0 B',
 'smallest_object': 'customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959/_SUCCESS',
 'largest_object_size': '7.1 KB',
 'largest_object': 'customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959/ShipCountry=UK/ShipCity=London/part-00015-7d19bbc6-259d-4942-a90a-9105f406cbe4-attempt_20210311075721_0082_m_000015_187.c000.snappy.parquet'}

Note the total_volume value. We will reference it for comparison in the next steps. Also note the number in total_objects, which indicates that the ETL produced 69 partitions (71 objects minus two bookkeeping objects).

In the following cell we use the list_results() method to print a list of these 71 objects that have been written by the above ETL SQL statement. Note the partition columns and their values being part of the object names now. When you have a closer look at the cloud object storage URL of the objects you will notice that it contains the values of the partionting columns, e.g. ShipCountry=Argentina/ShipCity=Buenos Aires . All records in this result partition do have the value Argentina for column ShipCountryand Buenos Aires for ChipCity.

The fact that the result data are partitioned this way and that this fact is made externally visible by making it part of the object name can be leveraged for example by a query engine to optimize the execution performance. This type of partitioning called hive-style-partitioning is the basis for optimizing the execution of an SQL statement using predicates that match with the partitioning columns.

In [53]:
pd.set_option('display.max_colwidth', None)
result_objects_df = sqlClient.list_results(jobId)
print("List of objects written by ETL SQL:")
result_objects_df.head(200)
List of objects written by ETL SQL:
Out[53]:
Object LastModified Size StorageClass Bucket ObjectURL
0 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854 2021-03-11 08:30:48.665000+00:00 0.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854
1 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/_SUCCESS 2021-03-11 08:30:51.219000+00:00 0.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/_SUCCESS
2 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00000-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000000_668.snappy.parquet 2021-03-11 08:30:49.924000+00:00 17499.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00000-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000000_668.snappy.parquet
3 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00001-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000001_669.snappy.parquet 2021-03-11 08:30:49.813000+00:00 17574.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00001-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000001_669.snappy.parquet
4 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00002-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083050_0119_m_000002_670.snappy.parquet 2021-03-11 08:30:50.509000+00:00 17854.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00002-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083050_0119_m_000002_670.snappy.parquet

Now let's take a look at the result data with the get_result() method. Note that the result dataframe contains the two partitioning columns. The values for these have been put together by get_result() from the object names above because in hive style partitioning the partition column values are not stored in the objects but rather in the object names.

In [30]:
sqlClient.get_result(jobId).head(100)
Out[30]:
OrderID CustomerID CompanyName ContactName ContactTitle Address City Region PostalCode Country ... RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipRegion ShipPostalCode ShipCountry ShipCity
0 10409 OCEAN Océano Atlántico Ltda. Yvonne Moncada Sales Agent Ing. Gustavo Moncada 8585 Piso 20-A Buenos Aires NULL 1010 Argentina ... 1997-02-06 06:00:00 1997-01-14 00:00:00.000 1 29.83 Océano Atlántico Ltda. Ing. Gustavo Moncada 8585 Piso 20-A NULL 1010 Argentina Buenos Aires
1 10448 RANCH Rancho grande Sergio Gutiérrez Sales Representative Av. del Libertador 900 Buenos Aires NULL 1010 Argentina ... 1997-03-17 06:00:00 1997-02-24 00:00:00.000 2 38.82 Rancho grande Av. del Libertador 900 NULL 1010 Argentina Buenos Aires
2 10521 CACTU Cactus Comidas para llevar Patricio Simpson Sales Agent Cerrito 333 Buenos Aires NULL 1010 Argentina ... 1997-05-27 05:00:00 1997-05-02 00:00:00.000 2 17.22 Cactus Comidas para llevar Cerrito 333 NULL 1010 Argentina Buenos Aires
3 10531 OCEAN Océano Atlántico Ltda. Yvonne Moncada Sales Agent Ing. Gustavo Moncada 8585 Piso 20-A Buenos Aires NULL 1010 Argentina ... 1997-06-05 05:00:00 1997-05-19 00:00:00.000 1 8.12 Océano Atlántico Ltda. Ing. Gustavo Moncada 8585 Piso 20-A NULL 1010 Argentina Buenos Aires
4 10716 RANCH Rancho grande Sergio Gutiérrez Sales Representative Av. del Libertador 900 Buenos Aires NULL 1010 Argentina ... 1997-11-21 06:00:00 1997-10-27 00:00:00.000 2 22.57 Rancho grande Av. del Libertador 900 NULL 1010 Argentina Buenos Aires
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2 10261 QUEDE Que Delícia Bernardo Batista Accounting Manager Rua da Panificadora 12Rio de Janeiro RJ 02389-673 Brazil ... 1996-08-16 05:00:00 1996-07-30 00:00:00.000 2 3.05 Que Delícia Rua da Panificadora 12 RJ 02389-673 Brazil Rio de Janeiro
3 10287 RICAR Ricardo Adocicados Janete Limeira Assistant Sales Agent Av. Copacabana 267 Rio de Janeiro RJ 02389-890 Brazil ... 1996-09-19 05:00:00 1996-08-28 00:00:00.000 3 12.76 Ricardo Adocicados Av. Copacabana 267 RJ 02389-890 Brazil Rio de Janeiro
4 10291 QUEDE Que Delícia Bernardo Batista Accounting Manager Rua da Panificadora 12Rio de Janeiro RJ 02389-673 Brazil ... 1996-09-24 05:00:00 1996-09-04 00:00:00.000 2 6.40 Que Delícia Rua da Panificadora 12 RJ 02389-673 Brazil Rio de Janeiro
5 10299 RICAR Ricardo Adocicados Janete Limeira Assistant Sales Agent Av. Copacabana 267 Rio de Janeiro RJ 02389-890 Brazil ... 1996-10-04 05:00:00 1996-09-13 00:00:00.000 2 29.76 Ricardo Adocicados Av. Copacabana 267 RJ 02389-890 Brazil Rio de Janeiro
6 10379 QUEDE Que Delícia Bernardo Batista Accounting Manager Rua da Panificadora 12Rio de Janeiro RJ 02389-673 Brazil ... 1997-01-08 06:00:00 1996-12-13 00:00:00.000 1 45.03 Que Delícia Rua da Panificadora 12 RJ 02389-673 Brazil Rio de Janeiro

100 rows × 23 columns

The following cell runs an SQL query against the partitioned data that has been produced by the previous ETL SQL statement. The query uses WHERE predicates on the columns the dataset is partitioned by. This allows for performance optimization during query execution. The query engine will physically only read the objects that match these predicate values exploiting the fact that the predicate columns match the partitioning columns.

In [31]:
optimized_sql='SELECT * FROM {} STORED AS PARQUET WHERE ShipCountry = "Austria" AND ShipCity="Graz" \
               INTO {} STORED AS PARQUET'.format(resultset_location, targeturl)
formatted_optimized_sql = sqlparse.format(optimized_sql, reindent=True, indent_tabs=True, keyword_case='upper')
result = highlight(formatted_optimized_sql, lexer, formatter)
print('\nRunning SQL against the previously produced hive style partitioned objects as input:\n')
print(result)

jobId = sqlClient.submit_sql(optimized_sql)
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
job_details = sqlClient.get_job(jobId)
if job_status == 'failed':
    print("Error: {}\nError Message: {}".format(job_details['error'], job_details['error_message']))
Running SQL against the previously produced hive style partitioned objects as input:

SELECT *
FROM cos://s3.us-south.cloud-OBJECT-storage.appdomain.cloud/sqltempregional/customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959 STORED AS PARQUET
WHERE ShipCountry = "Austria"
		AND ShipCity="Graz" INTO cos://us-south/sqltempregional/ STORED AS PARQUET

Job 5fa54210-eea9-47dd-88dc-4063ee2d6a33 terminated with status: completed

In the following cell we use get_job() to verify in the job details of the just run optimized SQL that hive style partitioning has been leveraged. Note the bytes_read value that is significantly lower than the total_volume value of the data in the queries data set. The number is 6408 bytes - remember that the total volume of input data is 408 KB. The number of bytes_read matches the size of the partion customer_orders/jobid=6ff7b1d0-b69c-4d1b-8ebf-968d69436f56/ShipCountry=Argentina/ShipCity=Buenos Aires/ . This reveals the fact that the execution engine leverages the hive style partitioning to optimze performance by only reading the partitions which match the filter predicate. This I/O avoidance is the reason for the increase of query performance and the lower the query cost.

In [32]:
sqlClient.get_job(jobId)
Out[32]:
{'job_id': '5fa54210-eea9-47dd-88dc-4063ee2d6a33',
 'status': 'completed',
 'statement': 'SELECT * FROM cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959 STORED AS PARQUET WHERE ShipCountry = "Austria" AND ShipCity="Graz"                INTO cos://us-south/sqltempregional/ STORED AS PARQUET',
 'plan_id': '72e394a3-9a56-432f-bc9d-8bc891f6adea',
 'submit_time': '2021-03-11T07:57:40.150Z',
 'resultset_location': 'cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=5fa54210-eea9-47dd-88dc-4063ee2d6a33',
 'rows_returned': 30,
 'rows_read': 30,
 'bytes_read': 6090,
 'resultset_format': 'parquet',
 'end_time': '2021-03-11T07:57:47.170Z',
 'user_id': '[email protected]'}

Partitioning data is also a worth a consideration when writing big result sets as this enables parallel writing and avoids memory problems.

6.1 Using analyze() for partitioning recommendation

Home

The ibmcloudsql SDK provides the analyze() on an existing job job_id and returns hints on how to partition data. Below is an example in that we first query and store data in CSV format, before using the result in another query. Because of the data (from the first query) is not partitioned, the second query takes a few minutes. To understand how it can be improved, we call analyze(job_id), which suggests that we can partition the data using either PARTITIONED INTO or PARTITIONED EVERY with the recommended value.

In [10]:
etl_sql='SELECT * FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO {cos_out_url} STORED AS CSV'.format(cos_out_url=targeturl)
result = sqlClient.execute_sql(etl_sql, get_result=False)
job_details = sqlClient.get_job(result.job_id)
 
import time
tic = time.perf_counter()
sql="""
SELECT COUNT(sid_sds011) as count_sids_sds011 
FROM (SELECT DISTINCT sensor_id as sid_sds011 FROM {cos_in_url} STORED AS CSV)""".format(cos_in_url=job_details['resultset_location'])
sqlClient.run_sql(sql)
toc = time.perf_counter()
print(f"Total time  in {toc - tic:0.4f} seconds")
print("==========")
sqlClient.analyze(result.job_id)
Total time  in 234.3550 seconds
==========
Job 1d3c28d1-ab36-4ca7-9c9e-a9c93eb377ee has 1 object, with 1740.8 MB in total.
Best practices: object sizes ~ 128 MB
Current SQL:
 SELECT * FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS CSV

Consider using: PARTITIONED INTO 13 OBJECTS/BUCKETS
Suggested SQL:
 SELECT *
FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED INTO 13 OBJECTS

Consider using: PARTITIONED EVERY 2006029 ROWS
Suggested SQL:
 SELECT *
FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED EVERY 2006029 ROWS

Out[10]:
'Job 1d3c28d1-ab36-4ca7-9c9e-a9c93eb377ee has 1 object, with 1740.8 MB in total.\nBest practices: object sizes ~ 128 MB\nCurrent SQL:\n SELECT * FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS CSV\n\nConsider using: PARTITIONED INTO 13 OBJECTS/BUCKETS\nSuggested SQL:\n SELECT *\nFROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED INTO 13 OBJECTS\n\nConsider using: PARTITIONED EVERY 2006029 ROWS\nSuggested SQL:\n SELECT *\nFROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED EVERY 2006029 ROWS\n'

In the next cell we apply the suggested partition size of 2006029 rows and repartition the data accordingly (pagesize=2006029) before we run a query against the repartitioned data, which now finishes in much less time (typically 1/4 to 1/2 of the time).

In [12]:
etl_sql='SELECT * FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO {cos_out_url} STORED AS CSV'.format(cos_out_url=targeturl)
result = sqlClient.execute_sql(etl_sql, get_result=False, pagesize=2006029)
job_details = sqlClient.get_job(result.job_id)

import time
tic = time.perf_counter()
sql="""
SELECT COUNT(sid_sds011) as count_sids_sds011 
FROM (SELECT DISTINCT sensor_id as sid_sds011 FROM {cos_in_url} STORED AS CSV)""".format(cos_in_url=job_details['resultset_location'])
sqlClient.run_sql(sql)
toc = time.perf_counter()
print(f"Total time  in {toc - tic:0.4f} seconds")
Total time  in 59.6802 seconds

7. Work with result objects

Home

The result set, apart from being returned to the client as pandas.Dataframe, is also stored permanently on cloud object storage in the form of cloud object storage objects. There is a set of methods that allow you to work with the results on an object level.

7.1 list_results()

The result objects written to cloud object storage by default with a virtual sub folder structure in the following naming convention:

    cos://endpoint/bucketname/resultSetName/jobid=<JOB_ID>/_SUCCESS 

    cos://endpoint/bucketname/resultSetName/jobid=<JOB_ID>

    cos://endpoint/bucketname/resultSetName/jobid=<JOB_ID>/part-<xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>.<format>

where cos://endpoint/bucketname is the target URL and resultSetName the user picked result set name as specified in the SQL INTO clause. The first two objects are of zero byte size and are merely used for bookkeepung purposes. The first _SUCCESS object indicates that all objects of the given data set have been successfully written. The second object is empty as well and denotes the name prefix of all objects belonging to a specific data set. You may think of it as the "root" in a naming hierachy for all result set related objects. The third object can in fact exist more than once if a PARTIONED INTO or PARTITIONED EVERY clause has been used.

The following cell runs a simple SQL and lists the result objects for it. Since it uses a PARTITIONED INTO 3 OBJECT clause you can see three data objects being written.

In [52]:
pd.set_option('display.max_colwidth', None)
sql="SELECT * FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET INTO {} STORED AS PARQUET PARTITIONED INTO 3 OBJECTS".format(targeturl)
jobId = sqlClient.submit_sql(sql)
sqlClient.wait_for_job(jobId)
sqlClient.list_results(jobId).head(100)
Out[52]:
Object LastModified Size StorageClass Bucket ObjectURL
0 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854 2021-03-11 08:30:48.665000+00:00 0.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854
1 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/_SUCCESS 2021-03-11 08:30:51.219000+00:00 0.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/_SUCCESS
2 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00000-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000000_668.snappy.parquet 2021-03-11 08:30:49.924000+00:00 17499.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00000-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000000_668.snappy.parquet
3 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00001-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000001_669.snappy.parquet 2021-03-11 08:30:49.813000+00:00 17574.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00001-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083049_0119_m_000001_669.snappy.parquet
4 jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00002-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083050_0119_m_000002_670.snappy.parquet 2021-03-11 08:30:50.509000+00:00 17854.0 STANDARD sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6 cos://s3.us.cloud-object-storage.appdomain.cloud/sqlquerynotebooks-donotdelete-pr-jhus7nit02fbf6/jobid=92b6dcc8-2026-4b11-8d0b-1e6dbc56e854/part-00002-e28588d7-c156-4160-9e6b-1493f67a8402-c000-attempt_20210311083050_0119_m_000002_670.snappy.parquet

7.2 delete_result()

You can delete the result set from Cloud Object Storage using the delete_result() method which deletes all cloud object storage objects related to the specified result set.

In [36]:
sqlClient.delete_result(jobId)
Out[36]:
Deleted Object
0 jobid=133c1dbf-e854-490a-ac41-75c692d1c75c/part-00002-b4cb5bde-0b01-4841-a9a5-357f41943f37-c000-attempt_20210311082030_0116_m_000002_1112.snappy.parquet
1 jobid=133c1dbf-e854-490a-ac41-75c692d1c75c/part-00001-b4cb5bde-0b01-4841-a9a5-357f41943f37-c000-attempt_20210311082030_0116_m_000001_1111.snappy.parquet
2 jobid=133c1dbf-e854-490a-ac41-75c692d1c75c/part-00000-b4cb5bde-0b01-4841-a9a5-357f41943f37-c000-attempt_20210311082030_0116_m_000000_1110.snappy.parquet
3 jobid=133c1dbf-e854-490a-ac41-75c692d1c75c
4 jobid=133c1dbf-e854-490a-ac41-75c692d1c75c/_SUCCESS

7.3 Enforce exact target object name

As described above an SQL job always writes a folder structure with three (or more when partitioned) objects to the cloud object store. The fact that the jobId is generated into the name of the objects ensures that in case the same SQL statement is executed multiple times the result set is not overwritten but stored in execution specific objects. If this is not desired, but the result is supposed to be overwritten each time the SQL statement is executed JOBPREFIX NONE clause can be used. In the subsequent sample the optional clause JOBPREFIX NONE implies that the jobId is no longer part of the object name.

But note, that this in turn implies each time the same target path - as specified in the INTO clause - will be used and therefore the result data set objects be "overwritten". The following sample demonstrates the effect of the JOBPREFIX NONE clause in an SQL statement on the naming of the result objects. Note, that the jobID is no longer part of the object names.

In [37]:
sql="SELECT * FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET LIMIT 100 INTO {}first100orders.parquet JOBPREFIX NONE STORED AS PARQUET".format(targeturl)
jobId = sqlClient.submit_sql(sql)
sqlClient.wait_for_job(jobId)
sqlClient.list_results(jobId).head(100)
Out[37]:
Object LastModified Size StorageClass Bucket ObjectURL
0 first100orders.parquet 2021-03-11 08:20:37.846000+00:00 0.0 STANDARD sqltempregional cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/first100orders.parquet
1 first100orders.parquet/_SUCCESS 2021-03-11 08:20:39.997000+00:00 0.0 STANDARD sqltempregional cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/first100orders.parquet/_SUCCESS
2 first100orders.parquet/part-00000-9162cb53-b4a2-4fc4-96cb-6feb164dd207-c000-attempt_20210311082039_0082_m_000000_343.snappy.parquet 2021-03-11 08:20:39.720000+00:00 10271.0 STANDARD sqltempregional cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/first100orders.parquet/part-00000-9162cb53-b4a2-4fc4-96cb-6feb164dd207-c000-attempt_20210311082039_0082_m_000000_343.snappy.parquet

With this your result object name is closer your actually specified path in the INTO clause, but it is still not EXACTLY the same. When you write a single partitioned result there is only one object with the data and you may want to have a single result object without any virtual folder structure. For this case you can use the method rename_exact_result() on a jobId for a job that has a single partitoned result.

In [38]:
sqlClient.rename_exact_result(jobId)
sqlClient.list_results(jobId).head(100)
Out[38]:
Object LastModified Size StorageClass Bucket ObjectURL
0 first100orders.parquet 2021-03-11 08:20:45.943000+00:00 10271.0 STANDARD sqltempregional cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/first100orders.parquet

After applying this method to the jobs' result the list_results() method shows exactly one object - the one containing the result data - with the result set name as specified in the INTO clause.

8 Work with hive tables

Home

So far all SQL statements used cloud object storage URLs to specify the input data of an SQL statement. This can be cumbersome. With ibmcloudsql version 0.4 SQL Query introduced hive meta store support. This enables the usage of table names in SQL statements rather than cloud object storage URLs. The hive meta store is a catalog which holds for each table all related metadata including the table name, the schema, the backing cloud object storage objects, partitioning information etc.

NOTE: Using this requires Standard Plan for SQL Query.

One can create tables, list tables, drop tables etc. The APIs for catalog management can be found here .

Let's first create table for a simple sample dataset customers.csv provided by SQL Query. The create_table() method creates an object in the hive metastore containing the table name and the schema of the table. When not specifying the optional paramter schema the schema is automatically discovered from the data on COS.

In [39]:
sqlClient.create_table("customers", cos_url="cos://us-geo/sql/customers.csv", format_type="csv", force_recreate=True)

With describe_table() you can retrieve the schema information in the catalog.

In [40]:
customers_schema = sqlClient.describe_table("customers")
customers_schema.head(100)
Out[40]:
col_name data_type comment
0 customerID string NaN
1 companyName string NaN
2 contactName string NaN
3 contactTitle string NaN
4 address string NaN
5 city string NaN
6 region string NaN
7 postalCode string NaN
8 country string NaN
9 phone string NaN
10 fax string NaN

In the following cell we run the create_table() again but this time we specify the schema explicitly:

In [41]:
sqlClient.create_table("customers", cos_url="cos://us-geo/sql/customers.csv", format_type="csv", force_recreate=True,
                       schema="(customerID string, companyName string, contactName string, contact_Title string, address string, city string)")

When you have hive-partitioned dataset on COS (i.e. the sub folder hierarchy adhere to the hive naming convention) you can create a partitioned table using the create_partitioned_table() method.

In [5]:
sqlClient.create_partitioned_table("customers_partitioned", cos_url="cos://us-geo/sql/customers_partitioned.csv", format_type="csv", force_recreate=True)
customers_partitioned_schema = sqlClient.describe_table("customers_partitioned")
customers_partitioned_schema.head(100)
Out[5]:
col_name data_type comment
0 customerID string NaN
1 companyName string NaN
2 contactName string NaN
3 contactTitle string NaN
4 address string NaN
5 city string NaN
6 region string NaN
7 postalCode string NaN
8 phone string NaN
9 fax string NaN
10 COUNTRY string NaN
11 # Partition Information NaN NaN
12 # col_name data_type comment
13 COUNTRY string NaN

Note: After creatung a partitioned table it is always initially emply until you have added the partitions to it. A convenient method to do this is via the recover_table_partitions() method.

In [6]:
sqlClient.recover_table_partitions("customers_partitioned")

You can get a list of created tables with show_tables().

In [7]:
 sqlClient.show_tables()
Out[7]:
tableName
0 counties
1 customer_statistics
2 customers
3 customers_partitioned
4 eu_demographic
5 geographic
6 geographic_full
7 jdbc_test_table
8 plumpf
9 position_data
10 position_data2
11 position_data3
12 provinces
13 tdec
14 tdt
15 tsint
16 us_demographic
17 world_demographic

Alternatively to the above convenience methods for managing hive tables you can use DDL statements and run them via submit_sql(). This gives you the full set of hive table management commands and options, which goes beyond what the above convenience methods cover.

9. Manage SQL jobs

Home

9.1. Work with Job History

The set of APIs that allow you to interact with jobs is provided via this link.

Many of them are useful when you are launching many SQL statements e.g.

  1. get the list of running jobs: get_jobs_with_status()
  2. get the list of jobs that have been launched in the given session: myjobs()
  3. get up-to-30 most recent jobs: get_jobs()
  4. export the list of jobs: export_job_history()

The next cell lists the most recent 30 jobs in your instance of SQL Query:

In [43]:
pd.set_option('display.max_colwidth', None) 
#You can change the value -1 for display.max_colwidth to a positive integer if you want to truncate the cell content to shrink the overall table display size.

job_history_df = sqlClient.get_jobs()
job_history_df.head(100)
Out[43]:
job_id status user_id statement resultset_location submit_time end_time rows_read rows_returned bytes_read objects_skipped objects_qualified error error_message
0 f6a16184-247b-4b76-b62e-b3e9e382ad79 completed [email protected] \n DROP TABLE customers_partitioned cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=f6a16184-247b-4b76-b62e-b3e9e382ad79 2021-03-11T08:22:08.433Z 2021-03-11T08:22:22.533Z None None None None None
1 838aed1b-437d-483f-8f6d-d4db7364b0f4 completed [email protected] \n SHOW TABLES \n INTO cos://us-south/sqltempregional/ STORED AS CSV\n cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=838aed1b-437d-483f-8f6d-d4db7364b0f4 2021-03-11T08:22:03.336Z 2021-03-11T08:22:07.100Z None 18 None None None
2 94ce0699-568a-4c60-bdc0-5a8c57135e81 completed [email protected] \n CREATE TABLE customers (customerID string, companyName string, contactName string, contact_Title string, address string, city string)\n USING csv\n LOCATION cos://us-geo/sql/customers.csv\n cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=94ce0699-568a-4c60-bdc0-5a8c57135e81 2021-03-11T08:21:56.024Z 2021-03-11T08:22:00.918Z None None None None None
3 549edc2d-1326-4fbc-926c-8b2bde20bac0 completed [email protected] \n DROP TABLE customers cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=549edc2d-1326-4fbc-926c-8b2bde20bac0 2021-03-11T08:21:46.881Z 2021-03-11T08:21:53.749Z None None None None None
4 3f167e7c-fee4-4a58-91a3-0b66e35dfdfa completed [email protected] \n SHOW TABLES \n INTO cos://us-south/sqltempregional/ STORED AS CSV\n cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=3f167e7c-fee4-4a58-91a3-0b66e35dfdfa 2021-03-11T08:21:41.781Z 2021-03-11T08:21:45.859Z None 18 None None None
5 c72cb124-477d-412f-aea3-94d16f0614f4 completed to[email protected] \n DESCRIBE TABLE customers INTO cos://us-south/sqltempregional/ STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=c72cb124-477d-412f-aea3-94d16f0614f4 2021-03-11T08:21:36.634Z 2021-03-11T08:21:40.728Z None 11 None None None
6 3d23d836-7a32-4305-a501-836b23090fdc completed [email protected] \n CREATE TABLE customers\n USING csv\n LOCATION cos://us-geo/sql/customers.csv\n cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=3d23d836-7a32-4305-a501-836b23090fdc 2021-03-11T08:21:27.155Z 2021-03-11T08:21:34.152Z None None None None None
7 f80170a2-1d96-4d7a-bc31-b722055e6ef6 completed [email protected] \n DROP TABLE customers cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=f80170a2-1d96-4d7a-bc31-b722055e6ef6 2021-03-11T08:21:06.817Z 2021-03-11T08:21:25.822Z None None None None None
8 6efba320-892c-4b1e-b55e-0a148713d5b7 completed [email protected] \n SHOW TABLES \n INTO cos://us-south/sqltempregional/ STORED AS CSV\n cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=6efba320-892c-4b1e-b55e-0a148713d5b7 2021-03-11T08:20:46.488Z 2021-03-11T08:21:05.566Z None 18 None None None
9 666ad0f6-2858-4757-a488-53df4de43733 completed [email protected] SELECT * FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET LIMIT 100 INTO cos://us-south/sqltempregional/first100orders.parquet JOBPREFIX NONE STORED AS PARQUET cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/first100orders.parquet 2021-03-11T08:20:34.165Z 2021-03-11T08:20:44.020Z 830 100 30606 None None
10 133c1dbf-e854-490a-ac41-75c692d1c75c completed [email protected] SELECT * FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED INTO 3 OBJECTS cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=133c1dbf-e854-490a-ac41-75c692d1c75c 2021-03-11T08:20:26.117Z 2021-03-11T08:20:31.991Z 830 830 30606 None None
11 88a4d2f9-420c-47bb-8f65-751bb4fbaa3e completed [email protected] \nSELECT COUNT(sid_sds011) as count_sids_sds011 \nFROM (SELECT DISTINCT sensor_id as sid_sds011 FROM cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=9cff2fd1-f49e-4c18-b8ff-59d838942e95 STORED AS CSV) cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=88a4d2f9-420c-47bb-8f65-751bb4fbaa3e 2021-03-11T08:11:39.377Z 2021-03-11T08:13:51.845Z 27282007 1 1853447492 None None
12 9cff2fd1-f49e-4c18-b8ff-59d838942e95 completed [email protected] SELECT * FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS CSV PARTITIONED EVERY 2006029 ROWS cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=9cff2fd1-f49e-4c18-b8ff-59d838942e95 2021-03-11T08:08:00.773Z 2021-03-11T08:11:38.642Z 27282007 27282007 293240764 None None
13 6e331f1e-983e-4791-8502-401bd14bb493 completed [email protected] \nSELECT COUNT(sid_sds011) as count_sids_sds011 \nFROM (SELECT DISTINCT sensor_id as sid_sds011 FROM cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=c07dbf97-f4ea-4151-97fc-6004391568d0 STORED AS CSV) cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=6e331f1e-983e-4791-8502-401bd14bb493 2021-03-11T08:01:48.436Z 2021-03-11T08:02:46.870Z 27282007 1 1853461821 None None
14 c07dbf97-f4ea-4151-97fc-6004391568d0 completed [email protected] SELECT * FROM cos://us-geo/sql/oklabdata/parquet/sds011/2017/07/ STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=c07dbf97-f4ea-4151-97fc-6004391568d0 2021-03-11T07:57:49.426Z 2021-03-11T08:01:46.994Z 27282007 27282007 293233004 None None
15 5fa54210-eea9-47dd-88dc-4063ee2d6a33 completed [email protected] SELECT * FROM cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959 STORED AS PARQUET WHERE ShipCountry = "Austria" AND ShipCity="Graz" INTO cos://us-south/sqltempregional/ STORED AS PARQUET cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=5fa54210-eea9-47dd-88dc-4063ee2d6a33 2021-03-11T07:57:40.150Z 2021-03-11T07:57:47.170Z 30 30 6090 None None
16 39afb149-0a39-4fb9-93de-a029d5dbe959 completed [email protected] SELECT OrderID, c.CustomerID CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, ShipPostalCode, ShipCountry FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE c.CustomerID = o.CustomerID INTO cos://us-south/sqltempregional/customer_orders STORED AS PARQUET PARTITIONED BY (ShipCountry, ShipCity) cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/customer_orders/jobid=39afb149-0a39-4fb9-93de-a029d5dbe959 2021-03-11T07:57:14.149Z 2021-03-11T07:57:28.907Z 921 830 43058 None None
17 37bdcd23-9d91-4d4e-8d03-1e0e73dd12e6 completed [email protected] SELECT OrderID, c.CustomerID CustomerID, CompanyName, City, Region, PostalCode FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE c.CustomerID = o.CustomerID INTO cos://us-south/sqltempregional/paginated_orders STORED AS PARQUET PARTITIONED EVERY 10 ROWS cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/paginated_orders/jobid=37bdcd23-9d91-4d4e-8d03-1e0e73dd12e6 2021-03-11T07:56:30.924Z 2021-03-11T07:56:57.135Z 921 830 17111 None None
18 27923b78-ed54-4254-a0d1-6dbfa939a4bc completed [email protected] SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01" ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myQueryResult STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/myQueryResult/jobid=27923b78-ed54-4254-a0d1-6dbfa939a4bc 2021-03-11T07:56:21.268Z 2021-03-11T07:56:29.467Z 1760 29 41499 None None
19 777911ff-9bee-4ae4-aad4-361f789f20e6 completed [email protected] SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS parquet o, cos://us-geo/sql/employees.parquet STORED AS parquet e, cos://us-geo/sql/customers.parquet STORED AS parquet c WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01" ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myResult STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/myResult/jobid=777911ff-9bee-4ae4-aad4-361f789f20e6 2021-03-11T07:56:12.204Z 2021-03-11T07:56:20.874Z 1760 29 41499 None None
20 603d9403-0ad8-484c-8191-8de5116ae861 completed [email protected] SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01" ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myQueryResult STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/myQueryResult/jobid=603d9403-0ad8-484c-8191-8de5116ae861 2021-03-11T07:56:02.269Z 2021-03-11T07:56:09.619Z 1760 29 41499 None None
21 f76c3890-5426-4fba-89b5-9b01d28f8e20 completed [email protected] SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01" ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/myQueryResult STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/myQueryResult/jobid=f76c3890-5426-4fba-89b5-9b01d28f8e20 2021-03-11T07:53:37.952Z 2021-03-11T07:53:56.991Z 1760 29 41499 None None
22 b6270a52-372b-4739-9ef1-59a1da1062db completed [email protected] \n SELECT * FROM DESCRIBE(cos://us-geo/sql/employees.parquet STORED AS PARQUET)\n INTO cos://us-south/sqltempregional/ STORED AS JSON\n cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=b6270a52-372b-4739-9ef1-59a1da1062db 2021-03-11T07:52:47.173Z 2021-03-11T07:52:53.635Z None 18 None None None
23 bac19cd0-7278-4a96-bb22-4934b9c43227 completed [email protected] SELECT * FROM cos://us-south/sqltempregional/my_job_history/ STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/ STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=bac19cd0-7278-4a96-bb22-4934b9c43227 2021-03-08T14:24:45.216Z 2021-03-08T14:24:52.113Z 90 10 51598 None None
24 6d39ec96-11e0-48f3-bf88-159a85a20a18 completed [email protected] SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=6d39ec96-11e0-48f3-bf88-159a85a20a18 2021-03-08T14:24:21.273Z 2021-03-08T14:24:25.499Z 9 9 8593 None None
25 f9372e35-7805-40ef-bfc8-c1082108abb2 failed [email protected] SELECT xyz FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/ STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=f9372e35-7805-40ef-bfc8-c1082108abb2 2021-03-08T14:23:57.985Z 2021-03-08T14:24:00.017Z None None None None None SQL execution failed A non-existing column is used: xyz. Explore the table schema using DESCRIBE. For COS URI tables use "SELECT * FROM DESCRIBE(<table locator>)" to explore the infered schema (use the exact same options as in your query). For Hive table use "DESCRIBE TABLE <table name>".
26 0058781a-eb24-4265-877e-fed727b1a31e completed [email protected] WITH orders_shipped AS (SELECT OrderID, EmployeeID, (CASE WHEN shippedDate < requiredDate THEN 'On Time' ELSE 'Late' END) AS Shipped FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET) SELECT e.FirstName, e.LastName, COUNT(o.OrderID) As NumOrders, Shipped FROM orders_shipped o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e WHERE e.EmployeeID = o.EmployeeID GROUP BY e.FirstName, e.LastName, Shipped ORDER BY e.LastName, e.FirstName, NumOrders DESC INTO cos://us-south/sqltempregional/ STORED AS CSV cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=0058781a-eb24-4265-877e-fed727b1a31e 2021-03-08T14:23:47.098Z 2021-03-08T14:23:53.924Z 839 18 20626 None None
27 540cec48-e41c-4ff6-8a45-72f15efc5083 completed [email protected] WITH orders as (SELECT customerid, named_struct('count', count(orderid), 'orderids', collect_list(orderid)) orders FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET GROUP BY customerid) SELECT c.customerid, named_struct('name', companyname, 'contact', contactname, 'tile', contacttitle, 'phone', PHONE) company, \t named_struct('street', address, 'city', city, 'zip', postalcode, 'country', country) address, \t orders FROM cos://us-geo/sql/customers.parquet STORED AS PARQUET c, orders o WHERE o.customerid=c.customerid INTO cos://us-south/sqltempregional/ STORED AS JSON PARTITIONED EVERY 10 ROWS cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=540cec48-e41c-4ff6-8a45-72f15efc5083 2021-03-08T14:23:35.344Z 2021-03-08T14:23:42.207Z 921 89 18561 None None
28 315d4c95-5df1-49ad-9b41-4db7b2149370 completed [email protected] SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED EVERY 2 ROWS cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/jobid=315d4c95-5df1-49ad-9b41-4db7b2149370 2021-03-08T14:23:15.661Z 2021-03-08T14:23:26.091Z 9 9 8593 None None
29 6f0c63f4-5a6c-4046-be87-3a9cfe06e9aa completed [email protected] SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/myresult.parquet JOBPREFIX NONE STORED AS PARQUET cos://s3.us-south.cloud-object-storage.appdomain.cloud/sqltempregional/myresult.parquet 2021-03-08T14:23:01.256Z 2021-03-08T14:23:06.025Z 9 9 8593 None None

If you need a longer history of your jobs then you need to periodically export the current job history to a location on COS. This can be achieved conveniently with the method export_job_history()

In [44]:
sqlClient.export_job_history(targeturl + "my_job_history/",   "job_export_" , ".parquet")  
# sqlClient.export_job_history(targeturl + "my_job_history/" , "job_export_" , ".parquet")  
Exported 30 new jobs

Each time when you run this function it performs a delta check of the currently available recent history of 30 jobs with the already stored jobs on the specified location and writes out an additional parquet partition with only the new jobs that hadn't been exported previously:

In [45]:
# Run another SQL:
sql="SELECT * FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET LIMIT 100 INTO {} STORED AS PARQUET".format(targeturl)
jobId = sqlClient.submit_sql(sql)
sqlClient.wait_for_job(jobId)

#Export job history again:
sqlClient.export_job_history(targeturl + "my_job_history/",   "job_export_" , ".parquet")

# Query exported job history:
pd.set_option('display.max_colwidth', 20)
sql = "SELECT * FROM {}my_job_history/ STORED AS PARQUET INTO {} STORED AS PARQUET".format(targeturl, targeturl)
sqlClient.run_sql(sql)
Exported 1 new jobs
Out[45]:
job_id status user_id statement resultset_location submit_time end_time rows_read rows_returned bytes_read objects_skipped objects_qualified error error_message job_export
0 f821cb4d-df68-4a... completed [email protected] SELECT * FROM co... cos://s3.us-sout... 2021-03-07T18:21... 2021-03-07T18:21... 9.0 9.0 8593.0 NaN NaN 2021-03-07
1 2eaae838-16f5-4d... failed [email protected] SELECT xyz FROM ... cos://s3.us-sout... 2021-03-07T18:20... 2021-03-07T18:20... NaN NaN NaN NaN NaN SQL execution fa... A non-existing c... 2021-03-07
2 74466e3e-7c68-4d... completed [email protected] WITH orders_ship... cos://s3.us-sout... 2021-03-07T18:20... 2021-03-07T18:20... 839.0 18.0 20626.0 NaN NaN 2021-03-07
3 fae57226-e692-4b... completed [email protected] WITH orders as (... cos://s3.us-sout... 2021-03-07T18:20... 2021-03-07T18:20... 921.0 89.0 18561.0 NaN NaN 2021-03-07
4 f65e33c9-6148-48... completed [email protected] SELECT * FROM co... cos://s3.us-sout... 2021-03-07T18:19... 2021-03-07T18:19... 9.0 9.0 8593.0 NaN NaN 2021-03-07
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85 2b7519b2-90ee-48... completed [email protected] WITH prefiltered... cos://s3.us-sout... 2021-03-08T14:21... 2021-03-08T14:21... 9642.0 65.0 2443783.0 NaN NaN 2021-03-08
86 ccc8e011-542d-4b... completed [email protected] WITH prefiltered... cos://s3.us-sout... 2021-03-08T14:21... 2021-03-08T14:21... 9642.0 65.0 2443783.0 NaN NaN 2021-03-08
87 60d5f9c5-31bf-4d... completed [email protected] WITH prefiltered... cos://s3.us-sout... 2021-03-08T14:21... 2021-03-08T14:21... 9642.0 65.0 2443783.0 NaN NaN 2021-03-08
88 f6fa1050-7813-49... completed [email protected] WITH orders as (... cos://s3.us-sout... 2021-03-08T14:21... 2021-03-08T14:21... 921.0 89.0 18561.0 NaN NaN 2021-03-08
89 5690ed84-2c2b-4e... completed [email protected] SELECT * FROM co... cos://s3.us-sout... 2021-03-08T14:21... 2021-03-08T14:21... 9.0 9.0 8593.0 NaN NaN 2021-03-08

90 rows × 15 columns

Assume you want to automate your ETL process and there is a situation where you have to submit many SQL statements at once. Doing so you have to be aware of the fact that there is a limit to the number of queries that can be served by a single SQL Query instance at a time. This is 1 for Lite Plan; and 5 for Standard Plan. If some of your jobs are long running it may happen that some jobs are not yet completed when the session ends.

Since ibmcloudsql 0.4, if you're using Watson Studio,there is also the possibility to retry jobs which did not complete within the session they have been started in. At a later point in time when you start a new session you may want to restart those jobs which did not complete. You do so by connecting to the Project-Lib, and save the list of jobs submitted in a file in the project bucket. When you re-run the session, the status of the execution of all jobs is updated and the jobs which did not complete in the previous session are re-executed.

The following code in the following cell demonstrates this behaviour. Re-run the cell a couple of times and check the file "tracked_jobs_new.json" in the project bucket to observe the behavior.

This can be used in Watson Studio with ProjectLib activated using a file stored as an asset in the project, or it can also be used using a file from a local machine.

REF: The API docs is provided via this link.

In [49]:
 
file_name="tracked_jobs1.json"
sqlClient.connect_project_lib(project, file_name)

sql_stmt1='SELECT o.OrderID  FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o LIMIT 5 ' 
sql_stmt2='SELECT o.OrderID  FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o LIMIT 10 ' 
sql_stmt3='SELECT o.OrderID  FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o LIMIT 15 '
sql_stmt4='SELECT o.OrderID  FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o LIMIT 20'
 

jobs = [ ]

jobs.append(sqlClient.submit_and_track_sql(sql_stmt1))
jobs.append(sqlClient.submit_and_track_sql(sql_stmt2))
jobs.append(sqlClient.submit_and_track_sql(sql_stmt3))  
jobs.append(sqlClient.submit_and_track_sql(sql_stmt4))

Here, if you are given a list of job_id, you can run them all, and each will be checked with those saved in the file_name stored in ProjectLib.

In [ ]:
sqlClient.process_failed_jobs_until_all_completed(jobs)

Home

In this notebook you learned how you can use the ibmcloudsql library in a Python notebook to submit SQL queries on data in IBM Cloud Object Storage and how you can interact with the query results. If you want to automate such an SQL query execution as part of your cloud solution, apart from the solution using ProjectLib, you can use the IBM Cloud Functions framework. There is a dedicated SQL function available that lets you set up a cloud function to run SQL statements with IBM Cloud SQL Query. You can find the documentation for doing this here.

Authors

Torsten Steinbach, Torsten is the lead architect for IBM Cloud Data Lake. Previously he has worked as IBM architect for a series of data management products and services, including DB2, PureData for Analytics and Db2 on Cloud.

Tuan M. HoangTrong, Tuan is the research staff member in the Distributed AI, TimeSeries Group.

Copyright © IBM Corp. 2020-2021. This notebook and its source code are released under the terms of the MIT License.