Using autoai-lib for Python (Beta)

The autoai-lib library for Python contains a set of functions that help you to interact with IBM Watson Machine Learning AutoAI experiments. Using the autoai-lib library, you can review and edit the data transformations that take place in the creation of the pipeline.

Installing autoai-lib for Python

Follow the instructions in Installing custom libraries to install autoai-lib.

The autoai-lib functions

The instantiated project object that is created after you have imported the autoai-lib library exposes these functions:

autoai_libs.transformers.exportable.NumpyColumnSelector()

Selects a subset of columns of a numpy array

Usage:

autoai_libs.transformers.exportable.NumpyColumnSelector(columns=None)
Option Description
columns list of column indices to select

 

autoai_libs.transformers.exportable.CompressStrings()

Removes spaces and special characters from string columns of an input numpy array X.

Usage:

autoai_libs.transformers.exportable.CompressStrings(compress_type='string', dtypes_list=None, misslist_list=None, missing_values_reference_list=None, activate_flag=True)
Option Description
compress_type type of string compression. ‘string’ for removing spaces from a string and ‘hash’ for creating an int hash. Default is ‘string’. ‘hash’ is used when there are columns with strings and cat_imp_strategy=’most_frequent’
dtypes_list list containing strings that denote the type of each column of the input numpy array X (strings are among ‘char_str’,’int_str’,’float_str’,’float_num’, ‘float_int_num’,’int_num’,’boolean’,’Unknown’). If None, the column types are discovered. Default is None.
misslist_list list containing lists of missing values of each column of the input numpy array X. If None, the missing values of each column are discovered. Default is None.
missing_values_reference_list reference list of missing values in the input numpy array X
activate_flag flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.NumpyReplaceMissingValues()

Given a numpy array and a reference list of missing values for it, replaces missing values with a special value (typically a special missing value such as np.nan).

Usage:

autoai_libs.transformers.exportable.NumpyReplaceMissingValues(missing_values, filling_values=np.nan)
Option Description
missing_values reference list of missing values
filling_values special value assigned to unknown values

 

autoai_libs.transformers.exportable.NumpyReplaceUnknownValues()

Given a numpy array and a reference list of known values for each column, replaces values that are not part of a reference list with a special value (typically np.nan). This is typically used to remove labels for columns in a test dataset that have not been seen in the corresponding columns of the training dataset.

Usage:

autoai_libs.transformers.exportable.NumpyReplaceUnknownValues(known_values_list=None, filling_values=None, missing_values_reference_list=None)
known_values_list reference list of lists of known values for each column
filling_values special value assigned to unknown values
missing_values_reference_list reference list of missing values

 

autoai_libs.transformers.exportable.boolean2float()

Converts a 1-D numpy array of strings that represent booleans to floats and replaces missing values with np.nan. Also changes type of array from ‘object’ to ‘float’.

Usage:

autoai_libs.transformers.exportable.boolean2float(activate_flag=True)
Option Description
activate_flag flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.CatImputer()

This is a wrapper for categorical imputer. Internally it currently uses sklearn  SimpleImputer

Usage:

autoai_libs.transformers.exportable.CatImputer(strategy, missing_values, sklearn_version_family=global_sklearn_version_family, activate_flag=True)
Option Description
strategy string, optional, default=”mean”. The imputation strategy for missing values.
-mean: replace using the mean along each column. Can only be used with numeric data.
- median:replace using the median along each column. Can only be used with numeric data.
- most_frequent:replace using most frequent value each column. Used with strings or numeric data.
- constant:replace with fill_value. Can be used with strings or numeric data.
missing_values number, string, np.nan (default) or None. The placeholder for the missing values. All occurrences of missing_values will be imputed.
sklearn_version_family str indicating the sklearn version for backward compatibiity with versions 019, and 020dev. Currently unused. Default is None.
activate_flag flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.CatEncoder()

This is a wrapper for categorical encoder. If encoding parameter is ‘ordinal’, internally it currently uses sklearn OrdinalEncoder. If encoding parameter is  ‘onehot’, or ‘onehot-dense’  internally it currently uses sklearn OneHotEncoder

Usage:

autoai_libs.transformers.exportable.CatEncoder(encoding, categories, dtype, handle_unknown, sklearn_version_family=global_sklearn_version_family, activate_flag=True)
Option Description
encoding str, ‘onehot’, ‘onehot-dense’ or ‘ordinal’. The type of encoding to use (default is ‘ordinal’)
‘onehot’: encode the features using a one-hot aka one-of-K scheme (or also called ‘dummy’ encoding). This creates a binary column for each category and returns a sparse matrix.
‘onehot-dense’: the same as ‘onehot’ but returns a dense array instead of a sparse matrix.
‘ordinal’: encode the features as ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
categories ‘auto’ or a list of lists/arrays of values. Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the ith column. The passed categories must be sorted and should not mix strings and numeric values. The used categories can be found in the encoder.categories_ attribute.
dtype number type, default np.float64 Desired dtype of output.
handle_unknown ‘error’ (default) or ‘ignore’. Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. Ignoring unknown categories is not supported for encoding='ordinal'.
sklearn_version_family str indicating the sklearn version for backward compatibiity with versions 019, and 020dev. Currently unused. Default is None.
activate_flag flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.float32_transform()

Transforms a float64 numpy array to float32.

Usage:

autoai_libs.transformers.exportable.float32_transform(activate_flag=True)
Option Description
activate_flag flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.FloatStr2Float()

Given numpy array X and dtypes_list that denotes the types of its columns, it replaces columns of strings that represent floats (type ‘float_str’ in dtypes_list) to columns of floats and replaces their missing values with np.nan.

Usage:

autoai_libs.transformers.exportable.FloatStr2Float(dtypes_list, missing_values_reference_list=None, activate_flag=True)
Option Description
dtypes_list list containing strings that denote the type of each column of the input numpy array X (strings are among ‘char_str’,’int_str’,’float_str’,’float_num’, ‘float_int_num’,’int_num’,’boolean’,’Unknown’).
missing_values_reference_list reference list of missing values
activate_flag flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.NumImputer()

This is a wrapper for numerical imputer.

Usage:

autoai_libs.transformers.exportable.NumImputer(strategy, missing_values, activate_flag=True)
Option Description
strategy num_imp_strategy : string, optional (default=”mean”). The imputation strategy:
- If “mean”, then replace missing values using the mean along the axis.
- If “median”, then replace missing values using the median along the axis.
- If “most_frequent”, then replace missing using the most frequent value along the axis.
missing_values integer or “NaN”, optional (default=”NaN”). The placeholder for the missing values. All occurrences of missing_values will be imputed:
- For missing values encoded as np.nan, use the string value “NaN”.
- activate_flag: flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified.

 

autoai_libs.transformers.exportable.OptStandardScaler()

This is a wrapper for scaling of numerical variables. It currently uses sklearn StandardScaler internally.

Usage:

autoai_libs.transformers.exportable.OptStandardScaler(use_scaler_flag=True, num_scaler_copy=True, num_scaler_with_mean=True, num_scaler_with_std=True)
Option Description
num_scaler_copy boolean, optional, default True. If False, try to avoid a copy and do in-place scaling instead. This is not guaranteed to always work. With in-place, for example, if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
num_scaler_with_mean boolean, True by default. If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
num_scaler_with_std boolean, True by default. If True, scale the data to unit variance (or equivalently, unit standard deviation).
use_scaler_flag boolean, flag that indicates that this transformer will be active. If False, transform(X) outputs the input numpy array X unmodified. Default is True.

 

autoai_libs.transformers.exportable.NumpyPermuteArray()

Rearranges columns or rows of a numpy array based on a list of indices.

Usage:

autoai_libs.transformers.exportable.NumpyPermuteArray(permutation_indices=None, axis=None)
Option Description
permutation_indices list of indexes based on which columns will be rearranged
axis 0 permute along columns, 1, permute along rows

 

Feature transformation

These methods apply to the feature transformations described in AutoAI implementation details.

 

autoai_libs.cognito.transforms.transform_utils.TA1(fun, name=None, datatypes=None, feat_constraints=None, tgraph=None, apply_all=True, col_names=None, col_dtypes=None)

For unary stateless functions, such as square, log, etc. use TA1.

Usage:

autoai_libs.cognito.transforms.transform_utils.TA1(fun, name=None, datatypes=None, feat_constraints=None, tgraph=None, apply_all=True, col_names=None, col_dtypes=None)
Option Description
fun the function pointer
name a string name that uniquely identifies this transformer from others
datatypes a list of datatypes either of which are valid input to the transformer function (numeric, float, int, etc.)
feat_constraints all constraints which must be satisfied by a column to be considered a valid input to this transform
tgraph tgraph object should be the invoking TGraph( ) object. Note that this is optional and you may pass None, but that will result in some failure to detect some inefficiencies due to lack of caching
apply_all only use applyAll = True. It means that the transformer will enumerate all features (or feature sets) that match the specified criteria and apply the provided function to each.
col_names names of the feature columns in a list
col_dtypes list of the datatypes of the feature columns

 

autoai_libs.cognito.transforms.transform_utils.TA2()

For binary stateless functions, such as sum, product, use TA2.

Usage:

autoai_libs.cognito.transforms.transform_utils.TA2(fun, name, datatypes1, feat_constraints1, datatypes2, feat_constraints2, tgraph=None, apply_all=True, col_names=None, col_dtypes=None)
Option Description
fun the function pointer
name: a string name that uniquely identifies this transformer from others  
datatypes1 a list of datatypes either of which are valid inputs (first parameter) to the transformer function (numeric, float, int, etc.)
feat_constraints1 all constraints which must be satisfied by a column to be considered a valid input (first parameter) to this transform
datatypes2 a list of datatypes either of which are valid inputs (second parameter) to the transformer function (numeric, float, int, etc.)
feat_constraints2 all constraints which must be satisfied by a column to be considered a valid input (second parameter) to this transform
tgraph tgraph object should be the invoking TGraph( ) object. Note that this is optional and you may pass None, but that will result in some missing out on inefficiencies due to lack of caching
apply_all only use applyAll = True. It means that the transformer will enumerate all features (or feature sets) that match the specified criteria and apply the provided function to each.
col_names names of the feature columns in a list
col_dtypes list of the datatypes of the feature columns

 

autoai_libs.cognito.transforms.transform_utils.TB1()

For unary state-based transformations (with fit/transform) use, such as frequent count.

Usage:

autoai_libs.cognito.transforms.transform_utils.TB1(tans_class, name, datatypes, feat_constraints, tgraph=None, apply_all=True, col_names=None, col_dtypes=None)
Option Description
tans_class a class that implements fit( )  and transform( ) in accordance with the transformation function definition
name a string name that uniquely identifies this transformer from others
datatypes list of datatypes either of which are valid input to the transformer function (numeric, float, int, etc.)
feat_constraints all constraints which must be satisfied by a column to be considered a valid input to this transform
tgraph tgraph object should be the invoking TGraph( ) object. Note that this is optional and you may pass None, but that will result in some missing out on inefficiencies due to lack of caching
apply_all only use applyAll = True. It means that the transformer will enumerate all features (or feature sets) that match the specified criteria and apply the provided function to each.
col_names names of the feature columns in a list.
col_dtypes list of the datatypes of the feature columns.

 

autoai_libs.cognito.transforms.transform_utils.TB2()

For binary state-based transformations (with fit/transform) use, such as group-by.

Usage:

autoai_libs.cognito.transforms.transform_utils.TB2(tans_class, name, datatypes1, feat_constraints1, datatypes2, feat_constraints2, tgraph=None, apply_all=True)
Option Description
tans_class a class that implements fit( )  and transform( ) in accordance with the transformation function definition
name a string name that uniquely identifies this transformer from others
datatypes1 a list of datatypes either of which are valid inputs (first parameter) to the transformer function (numeric, float, int, etc.)
feat_constraints1 all constraints which must be satisfied by a column to be considered a valid input (first parameter)to this transform
datatypes2 a list of datatypes either of which are valid inputs (second parameter) to the transformer function (numeric, float, int, etc.)
feat_constraints2 all constraints which must be satisfied by a column to be considered a valid input (second parameter) to this transform
tgraph tgraph object should be the invoking TGraph( ) object. Note that this is optional and you may pass None, but that will result in some missing out on inefficiencies due to lack of caching
apply_all only use applyAll = True. It means that the transformer will enumerate all features (or feature sets) that match the specified criteria and apply the provided function to each.

 

autoai_libs.cognito.transforms.transform_utils.TAM()

For a transform that applies at the data level, such as PCA, use TAM.

Usage:

autoai_libs.cognito.transforms.transform_utils.TAM(tans_class, name, tgraph=None, apply_all=True, col_names=None, col_dtypes=None)
Option Description
tans_class a class that implements fit( )  and transform( ) in accordance with the transformation function definition
name a string name that uniquely identifies this transformer from others
tgraph tgraph object should be the invoking TGraph( ) object. Note that this is optional and you may pass None, but that will result in some missing out on inefficiencies due to lack of caching
apply_all only use applyAll = True. It means that the transformer will enumerate all features (or feature sets) that match the specified criteria and apply the provided function to each.
col_names names of the feature columns in a list
col_dtypes list of the datatypes of the feature columns

 

autoai_libs.cognito.transforms.transform_utils.TGen()

TGen is a general wrapper and can be used for most functions (may not be most efficient though).

Usage:

autoai_libs.cognito.transforms.transform_utils.TGen(fun, name, arg_count, datatypes_list, feat_constraints_list, tgraph=None, apply_all=True, col_names=None, col_dtypes=None)
Option Description
fun the function pointer
name a string name that uniquely identifies this transformer from others
arg_count number of inputs to the function, in this example it is 1, for binary, it will be 2, and so on
datatypes_list a list of arg_count lists that correspond to the acceptable input data types for each parameter. In the above example, since arg_count=1, there is one list within the outer list, and it contains a single type called ‘numeric’. In another case, it could be a specific case ‘int’ or even more specific ‘int64’, multiple of those
feat_constraints_list a list of arg_count lists that correspond to some constraints that should be imposed on selection of the input features
tgraph tgraph object should be the invoking TGraph( ) object. Note that this is optional and you may pass None, but that will result in some missing out on inefficiencies due to lack of caching
apply_all only use applyAll = True. It means that the transformer will enumerate all features (or feature sets) that match the specified criteria and apply the provided function to each.
col_names names of the feature columns in a list
col_dtypes list of the datatypes of the feature columns

 

autoai_libs.cognito.transforms.transform_utils.FS1()

Feature selection, type 1 (using pairwise correlation between each feature and target.)

Usage:

autoai_libs.cognito.transforms.transform_utils.FS1(cols_ids_must_keep, additional_col_count_to_keep, ptype)
Option Description
cols_ids_must_keep serial numbers of the columns that must be kept irrespective of their feature importance
additional_col_count_to_keep how many columns need to be retained
ptype classification or regression

 

autoai_libs.cognito.transforms.transform_utils.FS2()

Feature selection, type 2.

Usage:

autoai_libs.cognito.transforms.transform_utils.FS2(cols_ids_must_keep, additional_col_count_to_keep, ptype, eval_algo)
Option Description
cols_ids_must_keep serial numbers of the columns that must be kept irrespective of their feature importance
additional_col_count_to_keep how many columns need to be retained
ptype classification or regression