You can use Balance nodes to correct imbalances in datasets so
they conform to specified test criteria.
For example, suppose that a dataset has only two values--low
or high--and that 90% of the cases are low while only 10% of the
cases are high. Many modeling techniques have trouble with such biased data because
they will tend to learn only the low outcome and ignore the high one, since it is more
rare. If the data is well balanced with approximately equal numbers of low and
high outcomes, models will have a better chance of finding patterns that
distinguish the two groups. In this case, a Balance node is useful for creating a balancing
directive that reduces cases with a low outcome.
Balancing is carried out by duplicating and then discarding records based on
the conditions you specify. Records for which no condition holds are always passed through. Because
this process works by duplicating and/or discarding records, the original sequence of your data is
lost in downstream operations. Be sure to derive any sequence-related values before adding a Balance
node to the data stream.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.