Duplicate records in a data set must be removed before data
mining can begin. For example, in a marketing database, individuals may appear multiple times with
different address or company information. You can use the Distinct node to find or remove duplicate
records in your data, or to create a single, composite record from a group of duplicate
records.
To use the Distinct node, you must first define a set of key fields that
determine when two records are considered to be duplicates.
If you do not pick all your fields as key fields, then two "duplicate" records
may not be truly identical because they can still differ in the values of the remaining fields. In
this case, you can also define a sort order that is applied within each group of duplicate records.
This sort order gives you fine control over which record is treated as the first within a group.
Otherwise, all duplicates are considered to be interchangeable and any record might be selected. The
incoming order of the records is not taken into account, so it doesn't help to use an upstream Sort
node (see "Sorting records within the Distinct node" on this page).
Mode. Specify whether to create a composite record, or
to either include or exclude (discard) the first record.
Create a composite record for each group. Provides a
way for you to aggregate non-numeric fields. Selecting this option makes the Composite tab available
where you specify how to create the composite records.
Include only the first record in each group. Selects
the first record from each group of duplicate records and discards the rest. The first
record is determined by the sort order defined under the setting Within groups, sort
records by, and not by the incoming order of the records.
Discard only the first record in each group. Discards
the first record from each group of duplicate records and selects the remainder instead. The
first record is determined by the sort order defined under the setting
Within groups, sort records by, and not by the incoming order of the records.
This option is useful for finding duplicates in your data so that you can examine them
later in the flow.
Key fields for grouping. Lists the field or fields used
to determine whether records are identical. You can:
Add fields to this list using the field picker button.
Delete fields from the list by using the red X (remove) button.
Within groups, sort records by. Lists the fields used
to determine how records are sorted within each group of duplicates, and whether they are sorted in
ascending or descending order. You can:
Add fields to this list using the field picker button.
Delete fields from the list by using the red X (remove) button.
Move fields using the up or down buttons, if you are sorting by more than
one field.
You must specify a sort order if you have chosen to include or exclude the
first record in each group, and it matters to you which record is treated as the first.
You may also want to specify a sort order if you have chosen to create a
composite record, for certain options on the Composite tab.
Specify whether, by default, records are sorted in
Ascending or Descending order of the sort key
values.
Sorting records within the Distinct node
Copy link to section
If the order of records within a group of duplicates is important to you, then
you must specify the order using the Within groups, sort records by option in
the Distinct node. Do not rely on an upstream Sort node. Remember that the incoming order of the
records is not taken into account -- only the order specified within the node.
If you do not specify any sort fields (or you specify insufficient sort
fields), then the records within each group of duplicates will be unordered (or incompletely
ordered) and the results may be unpredictable.
For example, assume we have a very large set of log records pertaining to a
number of machines. The log contains data such as the following:
Table 1. Machine log data
Timestamp
Machine
Temperature
17:00:22
Machine A
31
13:11:30
Machine B
26
16:49:59
Machine A
30
18:06:30
Machine X
32
16:17:33
Machine A
29
19:59:04
Machine C
35
19:20:55
Machine Y
34
15:36:14
Machine X
28
12:30:41
Machine Y
25
14:45:49
Machine C
27
19:42:00
Machine B
34
20:51:09
Machine Y
36
19:07:23
Machine X
33
To reduce the number of records down to the latest record for each machine,
use Machine as the key field and use Timestamp as the sort field
(in descending order). The input order does not affect the result because the sort selection
specifies which of the many rows for a given Machine is to be returned, and the final data output
would be as follows.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.