Google BigQuery connector (DataStage)

Use the Google BigQuery connector in DataStage® to write data to a BigQuery data warehouse

Prerequisite

Create the connection. For instructions, see Connecting to a data source in DataStage and the Google BigQuery connection.

Configure the Google BigQuery connector for write mode

Specify how the Google BigQuery connector writes data to a BigQuery table. Select the Write mode on the target connector under General target properties. The following table shows the ways that you can configure the connector for the Write mode.

Table 1. Google BigQuery connector write mode
Method	Description
Insert	The Insert mode inserts data into the table. The data will be instantly available in the table. After that, you can immediately perform any other operations.
Streaming insert	The Streaming insert mode inserts data into the specified table by using the BigQuery streaming API. Select this method if you want to query data in the table for real-time analysis. It can take up to 90 minutes for the data in the table to be available for subsequent operations like Update, Replace, or Truncate. Meanwhile, if you do any of these operations, you might receive error messages because these operations can cause inconsistencies in table data. For more information, see Streaming data availability in the Google Cloud documentation.
Merge	The Merge mode first updates the table based on the key columns that you specify, and then inserts the data into the table.
Update	The Update mode updates only the table data based on the key columns.
Delete	The Delete mode deletes only the table data based on the key columns.
Delete then Insert	The Delete then Insert mode first deletes the table data based on the key columns and the inserts the data into the table.
Update statement	The Update statement mode writes rows to the table when you provide an SQL statement that contains the temporary staging table `TEMP_EXTERNAL_TABLE`. The data from the input link is written to the temporary table. If `TEMP_EXTERNAL_TABLE` is missing, then the job fails to operate. Example: `insert into bqtest.testGCS select * from TEMP_EXTERNAL_TABLE`
Call procedure statement	The Call procedure statement mode runs your own procedure statement syntax on the database. You can provide the procedure statement syntax to the text box.

Configure the Google BigQuery connector table action

Specify the action to take on the target table to handle the new data set. The following table shows the ways that you can configure the Table action.

Table 2. Google BigQuery connector table actions
Method	Description
Append	Append creates the table if the table doesn’t exist. If the table already exists, no action is performed.
Replace	Replace drops the existing table and creates a new table with the existing job design schema. An existing table is dropped only when there is no existing streaming buffer attached to the table.
Truncate	Truncate deletes all the records from a table, but it does not remove the table itself.

Partitioned reads

You can configure the Google BigQuery to run on multiple processing nodes and read data from the data source. Each of the processing nodes for the stage retrieves a set (partition) of records from the data source. The partitions are then combined to produce the entire result set for the output link of the stage.

To enable partitioned reads, in the Source stage under General source properties, for Read method, select Select statement. Then select Enable partitioned reads.

When the connector is configured for partitioned reads, it runs the statement that is specified in the Select statement property on each processing node. You can use special placeholders in the statements to ensure that each of the processing nodes retrieves a distinct partition of records from the data source.

You can use the following placeholders in the statements to specify distinct partitions of records to retrieve in individual processing nodes:

[[node-number]]: The node number is replaced at run time with the index of the current processing node. The indexes are zero-based. The placeholder is replaced with the value 0 on the first processing node, value 1 on the second processing node, value 2 on the third processing node, and so forth.
[[node-count]]: The total number of processing nodes for the stage. By default, this number is the number of nodes in the parallel configuration file. The location of the parallel configuration file is specified in the APT_CONFIG_FILE environment variable.

Example SQL statements:

select * from testds.testPartitionRead  where MOD(C1, [[node-count]]) = [[node-number]]

select * from testds.testPartitionRead  where MOD(ABS(FARM_FINGERPRINT(C2)), [[node-count]]) = [[node-number]]

TIME data types

The Google BigQuery connector supports these TIME data types:

TIMESTAMP

Extended options:

Microseconds
Timezone
Microseconds + Timezone

For information about how Google uses the TIMESTAMP data type, see the Timestamp type.

TIME

Extended option:

Microseconds

For information about how Google uses the TIME data type, see Time type.

Lookup stage support (table)

In Stage tab under General you can find Lookup type, where you can choose if you want to use Sparse or Normal lookup method.

Table 3.
Lookup method	Description
Sparse Lookup	In a sparse lookup, the connector runs the specified SQL statement one time for each parameter set that arrives in the form of a record on the input link to the Lookup stage. The specified input parameters in the statement must have corresponding columns that are defined on the reference link. Each input record contains a set of parameter values that are represented by key columns. The BigQuery connector sets the parameter values on the bind variables in the SQL statement. After that the BigQuery connector runs the statement or block. The result of the lookup is routed as one or more records through the reference link from the BigQuery connector back to the Lookup stage and from the Lookup stage to the output link of the Lookup stage. A sparse lookup is also known as a direct lookup because the lookup is performed directly on the database. Typically, you use a sparse lookup when the target table is too large to fit in memory. You can also use the sparse lookup method for real-time jobs. You can use the sparse lookup method only in parallel jobs.
Normal Lookup	In a normal lookup, the connector runs the specified SQL statement only one time. The SQL statement cannot contain any input parameters. In normal lookup you can set limits like Row limit (the maximum number of rows to return) and Byte limit (the maximum number of bytes to return). The Lookup stage searches the result set data that is retrieved and looks for matches for the parameter sets that arrive in the form of records on the input link to the Lookup stage. A normal lookup is also known as an in-memory lookup because the lookup is performed on the cached data in memory. Typically, you use a normal lookup when the target table is small enough that all of the rows in the table can fit in memory.

For more information about Lookup Stage, see: https://www.ibm.com/docs/en/iis/11.7?topic=data-lookup-stage.