Join feature engineering details
Joining data relies on a unique set of criteria and implementation details.
Attention: The AutoAI experiment feature for joining multiple data sources to create a single training data set is deprecated. Support for joining data in an AutoAI experiment will be removed on Dec 7, 2022. After Dec 7, 2022, AutoAI experiments with joined data and deployments of resulting models will no longer run. To join multiple data sources, use a data preparation tool such as Data Refinery or DataStage to join and prepare data, then use the resulting data set for training an AutoAI experiment. Redeploy the resulting model.
How data is joined
When AutoAI joins one table with another table, it evaluates the provided information to determine whether:
- there is a time column in the table on the left
- there is a time column in the table on the right
- the connection between two tables is a one-to-one relation (one row on the left matches with at most one row on the right) or one-to-many (one row on the left matches with at least one row on the right table)
- the column that it will extract features from in the right table is numerical, categorical, or timestamps
Based on the information from the two tables, AutoAI then classifies the join data into one of the types of data as shown in this table.
Column type | Cutoff timestamp in main table | Timestamp | Join path type | Extractors |
---|---|---|---|---|
Numerical | no | yes | one-to-many | TimeSeries |
Numerical | yes | yes | one-to-many | TimeSeriesCutoff |
Numerical | yes/no | no | one-to-many | NumberSet |
Categorical | no | yes | one-to-many | SymbolicSequence SymbolicSequencePattern |
Categorical | yes | yes | one-to-many | SymbolicSequenceCutoff SymbolicSequencePattern |
Categorical | yes/no | no | one-to-many | ItemSet ItemSetPattern |
Timestamp | yes | yes | one-to-many | TimeStampSeriesCutoff |
Timestamp | no | yes | one-to-many | TimeStampSeries |
Timestamp | yes/no | yes/no | one-to-one | Timestamp |
Any | yes/no | yes/no | one-to-one | Identity |
Note: Each extractor in the table deals with one type of unstructured data as the results of the data join and grouping.
Performing feature engineering on the joined data
In its turn, each extractor has a rich set of aggregation functions that turns the given type of data into multiple features. These aggregation functions are popular functions that data scientists and data miners use when they see the given type of data. For example, the TimeSeries extractor extracts statistics that summarize the time series such as mean, max, variance, min, sum, lag values, trends, and sliding window statistics. The SymbolicSequence extractor deals with the sequence of categorical values. It extracts symbols that have a high correlation with the target and other statistics of the sequence such as the length of the sequences, and distinct number of symbols.
Because the set of features might contain bad features, AutoAI uses basic feature selection methods to remove redundant features, remove inconsistent distribution features, and remove features with low correlation with the prediction target.
Extractor details
Each extractor in AutoAI has a list of popular aggregation functions to transform the joined data into features. These aggregation functions are either simple summarization statistics of the joined data or discriminative patterns extracted by using information from the target columns.
TimeSeries
TimeSeries extractor deals with temporally ordered numbers. It generates features summarizing timeseries data taking into account the temporal order of the input data. The list of features is summarized as follows:
- min: minimum value of the timeseries
- max: maximum value of the timeseries
- mean: mean value of the timeseries
- variance: variance of the timeseries, this feature is useful for anomaly or event prediction problems.
- sum: sum of the timeseries
- trend: the ratio between the most recent value and the sum of the values of the timeseries. This feature is usually useful for demand and event prediction.
- recent: k recent values (or timeseries lag values), where k is configurable. This feature is useful for demand prediction problems because timeseries are usually auto-correlated.
- sliding-window statistics: max, min, mean, and variance of the decaying sliding window with length w = W/2, where W is the length of the timeseries. For example, if the timeseries is [1, 2, 3, 4, 5, 6, 7, 8] then it will examine the sliding window with length w1 = 8/2 = 4, that is, calculating the statistics from the time series [5, 6, 7, 8]. These features are similar to the general statistics but they capture more recent information in the data.
- sliding-window trends: calculate the ratio between max, min, mean, and variance of the exponentially decaying sliding windows with length w = W/2 and the max, min, mean, and variance of the original timeseries respectively. Where W is the length of the timeseries. For example, if the timeseries is [1, 2, 3, 4, 5, 6, 7, 8], then it will examine the sliding window with length w1 = 8/2 = 4, that is, it calculates the statistics from the time series [5, 6, 7, 8], and then it normalizes these statistics with the corresponding ones that are calculated from the input timeseries. These features hint about the change or trends in the recent data, useful for anomaly detection.
TimeSeriesCutoff
The TimeSeriesCutoff extractor generates features summarizing the timeseries data taking into account the temporal order of the input data and also the available of the cutoff information. Recall that cutoff timestamp is the moment when the prediction is expected to be made. Most of features are similar to TimeSeries features except for the normalized sum feature that uses the cutoff information. The list of features is summarized as:
- min: minimum value of the timeseries
- max: maximum value of the timeseries
- mean: mean value of the timeseries
- variance: variance of the timeseries
- sum: sum of the timeseries
- trend: the ratio between the most recent value and the sum of the values of the timeseries
- normalised_sum: sum of the timeseries normalized by the time span from the earliest timestamp to the cutoff timestamp. The normalization is useful in cases where the timeseries lengths are varied and time intervals are not uniformly distributed.
- recent: k recent values of the timeseries or lag values, where k is configurable
- sliding-window statistics: see TimeSeries features
- sliding-window trends: see TimeSeries features
Timestamp
The Timestamp extractor generates calendar features. These features are useful for problem like demand prediction. They capture seasonality information in the data. The list of features is summarized as:
- month: month number in a year, for example January is 1 and May is 5
- day_of_month: day number in a month, for example 29 May is 29
- day_of_week: day number in a week, for example Sunday is 1, Monday is 2, and so forth • day_of_year: day - number in a year, for example 01/01/2000 is 0
- hour: hour in day
- minute: minute
TimestampSeries
The TimeStampSeries extracts features from timestamp series. The list of features is summarized as:
- count: the number of elements in the series
TimestampSeriesCutoff
The TimeStampSeriesCutoff extracts features from timestamp series with the presence of cutoff information. These features are useful for problems where time intervals between events are irregular. With the presence of cutoff timestamp you can get some features showing irregular between events. The list of features is summarized as:
- count: the number of elements in the series
- normalised_count: the number of elements in the series normalized by the time interval between the earliest timestamp to the cutoff timestamp. The normalization is useful in cases where the series lengths are varied and time intervals between events are not uniformly distributed.
- max_gap_to_cutoff: the time interval between the earliest timestamp and the cutoff timestamp.
- recent_gap_to_cutoff: k recent gaps between the timestamps and the cutoff, where k is configurable.
SymbolicSequence
The SymbolicSequence extracts features from symbolic sequences. The list of features is summarized as:
- count: the number of elements in the sequence
- distinct_count: the number of elements in the sequence
- recent: k recent symbols, where k is a hyper-parameter that can be configurable
SymbolicSequencePattern
The SymbolicSequencePattern extractor extracts features from a list of symbolic sequences. It calculates top-k most correlated symbols to the prediction target. In particular, for each symbol, it evaluates the correlation or the information gain of the target when we know whether a symbol is present or absent in a sequence. This feature is called discriminative symbols that are widely used in the sequential pattern mining community. The configurable parameters for SymbolicSequencePattern are:
- High correlated symbols: assume that we have the following symbolic sequences and corresponding labels: {(aba, 0), (ab, 0), (ab, 0), (b, 1), (b, 1), (b, 1)}. We can see that when the symbol
a
is present in the sequence, the label is always 0 while whena
is absent in a sequence the label is always 1. Therefore,a
is a discriminative pattern, knowing whethera
is present in a sequence we can predict the target variable perfectly. AutoAI checks for each symbol and calculates its correlation (for regression) problems or information gain (for classification problems) with the prediction target and considers the frequency of top symbols with highest correlation (information gain) as features. These features are named as SeqMI_a when the problem is classification and SeqCOR_a when the problem is regression and the discriminative symbol isa
. In this example,b
is not a discriminative pattern because its occurrence in a sequence does not hint anything about the target variable.
SymbolicSequenceCutoff
The SymbolicSequenceCutoff extractor extracts features from symbolic sequences with the presence of cutoff timestamps. The list of features is summarized as:
- count: the number of elements in the sequence
- distinct_count: the number of elements in the sequence
- recent: k recent symbols, where k is a hyper-parameter that can be configurable
Itemset
The ItemSet class extracts features from itemset or multi-itemset.
- count: the number of elements in the itemset
- distinct_count: the number of elements in the itemset
ItemSetPattern
The ItemSetPattern extractor extracts features from a list of itemsets. It calculates top-k most correlated items to the prediction target. In particular, for each item, it evaluates the correlation or the information gain of the target when an item is present or absent in an itemset. This feature is known as discriminative items widely used in the pattern mining community.
- High correlated items: assume that we have the following itemsets (we use itemset to refer to multi-set as well in this document) and corresponding labels: {(aba, 0), (ab, 0), (ab, 0), (b, 1), (b, 1), (b, 1)}. We can see that when
a
is present in an itemset, the label is always 0 while whena
is absent in an itemset the label is always 1. Therefore,a
is a discriminative pattern, knowing whethera
is present in an itemset we can predict the target variable perfectly. AutoAI checks for each item and calculates its correlation (for regression) problems or information gain (for classification problems) with the prediction target and considers the frequency of top items with highest correlation (information gain) as features. These features are namedItemSetMI_a
when the problem is classification andItemSetCOR_a
when the problem is regression and the discriminative symbol isa
. In this example,b
is not a discriminative pattern because its occurrence in an itemset does not hint anything about the target variable.
NumberSet
The NumberSet extractor extracts features from a set or multiset of numbers. This extractor generates features summarizing the set of numbers. The list of features is summarized as:
- min: minimum value of the set
- max: maximum value of the set
- mean: mean value of the set
- variance: variance of the set
- sum: sum of the set
- count: the size of the set
Identity
When the connection is one-to-oneone, identity extractor keeps the value as is.
Category
Categorical columns are transformed with frequency transformations where the most frequent category is transformed to 0, the second most frequent category is transformed to 1 and so on.
Learn more
Parent topic: Building an experiment with joined data