Extracting the data from Microsoft Excel (DataStage)
You can use the Excel stage to extract several types of data from a selected data range in a Microsoft Excel file.
Data ranges
When you use the Excel stage, you can extract data from a specified data range in a Microsoft Excel spreadsheet.
Data range represents a cell, a row, a column, or a selection of cells that contain one or more continuous blocks of cells. Data range is specified by the range expression. In the Excel stage, you can use a range expression to specify the data range to extract.
For example, Employee_Salary!A1:G8 describes a data range in which the first cell is A1 and the last cell is G8 in the Employee_Salary spreadsheet.
1 | A EMPNO
|
B FIRSTNAME
|
C LASTNAME
|
D DEPT
|
E JOB
|
F SALARY
|
G BONUS
|
---|---|---|---|---|---|---|---|
2 | 20 | MICHAEL | THOMPSON | B01 | MANAGER | 94250 | 800 |
3 | 30 | SALLY | KWAN | C01 | MANAGER | 98250 | 800 |
4 | 60 | IRVING | STERN | D11 | MANAGER | 72250 | 500 |
5 | 70 | EVA | PULASKI | D21 | MANAGER | 96170 | 700 |
6 | 50 | JOHN | GEYER | E01 | MANAGER | 80175 | 800 |
7 | 90 | ELEEN | HENDERSON | E11 | MANAGER | 89750 | 600 |
8 | 100 | THEODORE | SPENSER | E21 | MANAGER | 86150 | 500 |
The Excel stage maps the Microsoft Excel row and column in the specified data range to IBM® DataStage® row and column, and extracts the records.
The following table describes the records that are extracted by the Excel stage when the range expression is Employee_Salary!A2:G8.
EMPNO | FIRSTNAME | LASTNAME | DEPT | JOB | SALARY | BONUS |
---|---|---|---|---|---|---|
20 | MICHAEL | THOMPSON | B01 | MANAGER | 94250 | 800 |
30 | SALLY | KWAN | C01 | MANAGER | 98250 | 800 |
60 | IRVING | STERN | D11 | MANAGER | 72250 | 500 |
70 | EVA | PULASKI | D21 | MANAGER | 96170 | 700 |
50 | JOHN | GEYER | E01 | MANAGER | 80175 | 800 |
90 | ELEEN | HENDERSON | E11 | MANAGER | 89750 | 600 |
100 | THEODORE | SPENSER | E21 | MANAGER | 86150 | 500 |
If you want to use the value of cells in the first row as the IBM InfoSphere® DataStage column name, then you can use the Column header property. If the Column header property is set to the First row of data ranges, and if you specify the range expression as Employee_Salary!A1:G8, the first row is treated as header, and the value of the cells in the first row is used as default DataStage column name in the job. You can generate range expression at design time by using Excel stage.
Types of data that can be extracted from Microsoft Excel
You can use the Excel stage to extract several types of data from a Microsoft Excel file.
- File properties
- The following table lists the information that can be extracted as file properties:
Table 3. Data that can be extracted as file properties Data Description File name Name of the file. For example: Workbook1.xls File path Path of the file. For example: C:\excel\Workbook1.xls File size Size of the file in bytes. Last modified date The date and time that the file was last modified.
- Document properties
- The following table lists the information that can be extracted as document properties:
Table 4. Data that can be extracted as document properties Data Description Authors Authors of the document. Document comments Comments of the document. Content creation date The date and time that the document was created. Key words Key words of the document. Revision number Revision number of the document. Subject Subject of the document. Title Title of the document. Company Company property value of the document. Category Category of the document. Manager Manager of the document. Custom properties Custom properties of the document. You must specify the name of the custom property to extract.
- Sheet information
- The following table lists the information that can be extracted as sheet information:
Table 5. Data that can be extracted as sheet information Data Description Sheet name Name of the Microsoft Excel sheet. Header (left, center, right) Header of the specified position. Footer (left, center, right) Footer of the specified position.
- Row information
- The following table lists the information that can be extracted as row information:
Table 6. Data that can be extracted as row information Data Description Row number Microsoft Excel row number within the sheet. The first row number is 1. Is hidden Whether the row is hidden or not. Writes true if the row or the sheet to which this row belongs is hidden.
- Cell information
- You can extract the cell information based on the Microsoft Excel column or the cell position. You can specify the source Microsoft Excel column based on the relative position within the data range when you are extracting the cell information based on the Microsoft Excel column.
Runtime column propagation
In IBM DataStage, you can configure a job to propagate extra columns that are not defined in the metadata through the rest of the job. This process is known as runtime column propagation (RCP).
When runtime column propagation is enabled, the Excel stage propagates Microsoft Excel columns based on the first data range. If wildcard characters are used in the file name, the first file that matches the expression is used. The setting of the hidden columns property determines whether a hidden column is propagated. For each propagated Microsoft Excel column, only cell values are extracted. To extract information such as the file name, sheet name, or the row number, you can define the additional columns in the configuration window.
Column naming rules
IBM DataStage columns are named based on the Microsoft Excel column letter of the first data range. The column name is prefixed by "Column_" followed by the Microsoft Excel column letter. For example, Column_A, Column_B, Column_C, and so on.
If the job already has a column with the name, the job stops.
Data types
All columns that are added by the Excel stage are in Unicode Varchar type with undefined length.