You can use Sample nodes to select a subset of records for
analysis, or to specify a proportion of records to discard. Various sample types are supported,
including stratified, clustered, and nonrandom (structured) samples.
Sampling can be used for several reasons:
To improve performance by estimating models on a subset of the data. Models
that are estimated from a sample are often as accurate as models derived from the full data set. And
they can be even more accurate if you can use the improved performance to experiment with more
methods than you might otherwise attempt.
To select groups of related records or transactions for analysis, such as
selecting all the items in an online shopping cart (or market basket), or all the properties in a
specific neighborhood.
To identify units or cases for random inspection in the interest of quality
assurance, fraud prevention, or security.
Note: If you simply want to partition your data into training and test samples for purposes of
validation, a Partition node can be used instead. For more information, see Partition node.
Types of samples
Copy link to section
Clustered samples. Sample groups or clusters rather
than individual units. For example, suppose you have a data file with one record per student. If you
cluster by school and the sample size is 50%, then 50% of schools are chosen and all students from
each of the selected schools are picked. Students in the other schools are ignored. On average, you
would expect about 50% of students to be picked, but because schools vary in size, the percentage
might not be exact. Similarly, you could cluster shopping cart items by transaction ID to make sure
that all items from selected transactions are maintained.
Stratified samples. Select samples independently within
non-overlapping subgroups of the population, or strata. For example, you can ensure that men and
women are sampled in equal proportions, or that every region or socioeconomic group within an urban
population is represented. You can also specify a different sample size for each stratum (for
example, if you think that one group is under-represented in the original data).
Systematic or 1-in-n sampling. When selection at random
is difficult to obtain, units can be sampled systematically (at a fixed interval) or sequentially.
Sampling weights. Sampling weights are automatically
computed while drawing a complex sample and roughly correspond to the "frequency" that each sampled
unit represents in the original data. Therefore, the sum of the weights over the sample should
estimate the size of the original data.
Sampling frame
Copy link to section
A sampling frame defines the potential source of cases to be included in a
sample or study. Sometimes, it is feasible to identify every member of a population and include any
one of them in a sample--for example, when sampling items that come off a production line. More
often, you are not able to access every possible case. For example, you cannot be sure who will vote
in an election until after the election happens. In this case, you could use the electoral register
as your sampling frame even if some registered people won’t vote. And some people might vote despite
not having been listed at the time you checked the register. Anybody not in the sampling frame has
no prospect of being sampled. Whether your sampling frame is close enough in nature to the
population you are trying to evaluate is a question that must be addressed for each real-life
case.
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.