Data skipping configuration options
The following configuration options are available:
Key | Default setting | Description |
---|---|---|
spark.ibm.metaindex. evaluation.enabled |
false |
When set to true , queries will run in evaluation mode. If queries run in evaluation mode, all of the indexed data sets will only be processed for skipping statistics and no data is read. Evaluation mode is useful to see the skipping statistics you might get for a given query. |
spark.ibm.metaindex. index.bloom.fpp |
0.01 | The false/positive rate of the bloom filter. |
spark.ibm.metaindex.index.minmax. readoptimized.parquet |
true |
If enabled, min/max statistics for Parquet objects on numerical columns (excluding decimal type) will be read from the footers. By collecting only the min/max indexes, index collection is further optimized by running with higher parallelism. |
spark.ibm.metaindex.index.minmax. readoptimized.parquet.parallelism |
10000 | The parallelism to use when collecting only min/max indexes on Parquet files. |
spark.ibm.metaindex. parquet.mdlocation.type |
EXPLICIT_BASE_PATH_LOCATION |
The type of URL stored in the metadata location |
spark.ibm.metaindex. parquet.mdlocation |
/tmp/dataskipping_metadata |
The metadata location (interpreted according to the URL type). |
spark.ibm.metaindex. parquet.index.chunksize |
25000 | The number of objects to index in each chunk. |
spark.ibm.metaindex. parquet.maxRecordsPerFile |
100000 | The maximum number of records per metadata file. |
Types of metadata location
The parameter spark.ibm.metaindex.parquet.mdlocation
is interpreted according to the URL type defined in the parameter spark.ibm.metaindex.parquet.mdlocation.type
.
The following options are available:
EXPLICIT_BASE_PATH_LOCATION
: This is the default. An explicit definition of the base path to the metadata, which is combined with a data set identifier. This case can be used to configure the MetaIndexManager JVM wide settings and have all of data sets metadata saved under the base path.EXPLICIT_LOCATION
: An explicit full path to the metadata.
You can set the spark.ibm.metaindex.parquet.mdlocation
in two ways:
-
By setting a JVM wide configuration
This configuration is useful for setting the base location once for all data sets and should be used with the
EXPLICIT_BASE_PATH_LOCATION
type.The location of the metadata for each data set will be inferred automatically by combining the base path with a data set identifier.
- For Scala:
val jmap = new java.util.HashMap[String,String]() jmap.put("spark.ibm.metaindex.parquet.mdlocation", "/path/to/base/metadata/location") jmap.put("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION") MetaIndexManager.setConf(jmap)
- For Python:
md_backend_config = dict([ ('spark.ibm.metaindex.parquet.mdlocation', "/path/to/base/metadata/location"), ("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")]) MetaIndexManager.setConf(spark, md_backend_config)
- For Scala:
-
By setting the metadata location for a specific MetaIndexManager instance
This configuration is useful for setting a specific metadata location for a certain data set and should be used with the
EXPLICIT_LOCATION
type.- For Scala:
val im = new MetaIndexManager(spark, uri, ParquetMetadataBackend) val jmap = new java.util.HashMap[String, String]() jmap.put("spark.ibm.metaindex.parquet.mdlocation", "/exact/path/to/metadata/location") jmap.put("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_LOCATION") im.setMetadataStoreParams(jmap)
- For Python:
im = MetaIndexManager(spark, uri, 'com.ibm.metaindex.metadata.metadatastore.parquet.ParquetMetadataBackend') md_backend_config = dict([ ('spark.ibm.metaindex.parquet.mdlocation', "/exact/path/to/metadata/location"), ("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_LOCATION")]) im.setMetadataStoreParameters(md_backend_config)
- For Scala:
Limitations
Data skipping might not work if type casting is used in the WHERE clause. For example, given a min/max index on a column with a short
data type, the following query will not benefit from data skipping:
select * from table where shortType > 1
Spark will evaluate this expression as (cast(shortType#3 as int) > 1)
because the constant 1
is of type Integer
.
Note that in some cases Spark can automatically cast the literal to the right type. For example, the previous query works for all other numerical types except for byte
type, which would require casting as well.
To benefit from data skipping in such cases, make sure that the literal has the same type as the column type, for example:
select * from table where shortType > cast(1 as short)