0 / 0

Extracting the data from Microsoft Excel (DataStage)

Last updated: Mar 12, 2025
Extracting the data from Microsoft Excel (Excel stage) in DataStage

You can use the Excel stage to extract several types of data from a selected data range in a Microsoft Excel file.

Data ranges

When you use the Excel stage, you can extract data from a specified data range in a Microsoft Excel spreadsheet.

Data range represents a cell, a row, a column, or a selection of cells that contain one or more continuous blocks of cells. Data range is specified by the range expression. In the Excel stage, you can use a range expression to specify the data range to extract.

For example, Employee_Salary!A1:G8 describes a data range in which the first cell is A1 and the last cell is G8 in the Employee_Salary spreadsheet.

Table 1. Example of a Microsoft Excel file; Employee_Salary spreadsheet
1 A
EMPNO
B
FIRSTNAME
C
LASTNAME
D
DEPT
E
JOB
F
SALARY
G
BONUS
2 20 MICHAEL THOMPSON B01 MANAGER 94250 800
3 30 SALLY KWAN C01 MANAGER 98250 800
4 60 IRVING STERN D11 MANAGER 72250 500
5 70 EVA PULASKI D21 MANAGER 96170 700
6 50 JOHN GEYER E01 MANAGER 80175 800
7 90 ELEEN HENDERSON E11 MANAGER 89750 600
8 100 THEODORE SPENSER E21 MANAGER 86150 500

The Excel stage maps the Microsoft Excel row and column in the specified data range to IBM® DataStage® row and column, and extracts the records.

The following table describes the records that are extracted by the Excel stage when the range expression is Employee_Salary!A2:G8.

Table 2. Example of DataStage row and column
EMPNO FIRSTNAME LASTNAME DEPT JOB SALARY BONUS
20 MICHAEL THOMPSON B01 MANAGER 94250 800
30 SALLY KWAN C01 MANAGER 98250 800
60 IRVING STERN D11 MANAGER 72250 500
70 EVA PULASKI D21 MANAGER 96170 700
50 JOHN GEYER E01 MANAGER 80175 800
90 ELEEN HENDERSON E11 MANAGER 89750 600
100 THEODORE SPENSER E21 MANAGER 86150 500

If you want to use the value of cells in the first row as the IBM InfoSphere® DataStage column name, then you can use the Column header property. If the Column header property is set to the First row of data ranges, and if you specify the range expression as Employee_Salary!A1:G8, the first row is treated as header, and the value of the cells in the first row is used as default DataStage column name in the job. You can generate range expression at design time by using Excel stage.

Types of data that can be extracted from Microsoft Excel

You can use the Excel stage to extract several types of data from a Microsoft Excel file.

File properties
The following table lists the information that can be extracted as file properties:
Table 3. Data that can be extracted as file properties
Data Description
File name Name of the file. For example: Workbook1.xls
File path Path of the file. For example: C:\excel\Workbook1.xls
File size Size of the file in bytes.
Last modified date The date and time that the file was last modified.
Document properties
The following table lists the information that can be extracted as document properties:
Table 4. Data that can be extracted as document properties
Data Description
Authors Authors of the document.
Document comments Comments of the document.
Content creation date The date and time that the document was created.
Key words Key words of the document.
Revision number Revision number of the document.
Subject Subject of the document.
Title Title of the document.
Company Company property value of the document.
Category Category of the document.
Manager Manager of the document.
Custom properties Custom properties of the document. You must specify the name of the custom property to extract.
Sheet information
The following table lists the information that can be extracted as sheet information:
Table 5. Data that can be extracted as sheet information
Data Description
Sheet name Name of the Microsoft Excel sheet.
Header (left, center, right) Header of the specified position.
Footer (left, center, right) Footer of the specified position.
Row information
The following table lists the information that can be extracted as row information:
Table 6. Data that can be extracted as row information
Data Description
Row number Microsoft Excel row number within the sheet. The first row number is 1.
Is hidden Whether the row is hidden or not. Writes true if the row or the sheet to which this row belongs is hidden.
Cell information
You can extract the cell information based on the Microsoft Excel column or the cell position. You can specify the source Microsoft Excel column based on the relative position within the data range when you are extracting the cell information based on the Microsoft Excel column.
The following table lists information that can be extracted as cell information:
Table 7. Data that can be extracted as cell information
Data Description
Value Value of a cell. If the cell has a formula, the stage extracts the value from the cache.
Comment Comment of a cell.
Author of Comment Author of the comment of a cell.
Formula Formula of a cell in text.
Hyperlink Type Type of hyperlink of a cell.
Hyperlink Address The address this hyperlink points to. The format depends on the type of this hyperlink.
Hyperlink label Text label for this hyperlink.

Runtime column propagation

In IBM DataStage, you can configure a job to propagate extra columns that are not defined in the metadata through the rest of the job. This process is known as runtime column propagation (RCP).

When runtime column propagation is enabled, the Excel stage propagates Microsoft Excel columns based on the first data range. If wildcard characters are used in the file name, the first file that matches the expression is used. The setting of the hidden columns property determines whether a hidden column is propagated. For each propagated Microsoft Excel column, only cell values are extracted. To extract information such as the file name, sheet name, or the row number, you can define the additional columns in the configuration window.

Column naming rules

IBM DataStage columns are named based on the Microsoft Excel column letter of the first data range. The column name is prefixed by "Column_" followed by the Microsoft Excel column letter. For example, Column_A, Column_B, Column_C, and so on.

If the job already has a column with the name, the job stops.

Data types

All columns that are added by the Excel stage are in Unicode Varchar type with undefined length.