Managing job performance

Last updated: Jan 12, 2024

Masking flows, which copy and mask data, can be implemented by using Spark. However, Spark has various characteristics and limitations that affect masking flow job performance and can result in job failures.

Managing job failures

During a masking flow job, Spark might attempt to read all of a data source into memory. Therefore, masking flow jobs can experience errors due to out of memory conditions. The largest volume of data that can fit into the largest deployed Spark processing node is approximately 12GBs. For a masking flow to be able to copy and mask more than this volume of data, the Spark engine must be able to partition the data into smaller chunks. Spark can then process each chunk separately and in parallel with other chunks.

There are some data sources, such as Parquet files, where Spark can automatically partition the data. However, none of these data sources are supported by masking flows now.

An alternative is to specify an index column when you're creating a masking flow to allow Spark to partition the data. For the relational data sources currently supported, the first step is to find or create an index column that divides the data into roughly equal portions. The next step is to provide a column name in the index so Spark can create a query that will use the index. The masking flow will automatically discover and provide other data that is required by Spark: the minimum and maximum values of the column, and the number of partitions of data to use.

Best practices

For best results, the data in the partitioning column must be evenly distributed. The purpose of a partition column is to allow Spark to divide the data into sizeable pieces for processing. For example, if the data is 10 TB divided by 50 buckets, each chunk is still 0.2 TB. The greater the number of buckets, the faster the processing.

Choose a partition column, where the values are evenly distributed over a large range. Identifier columns are often good choices. For example, a Customers table has a Customer_ID column containing 1.5 M unique IDs. A state column is not a good example because there are only 50 unique values.