Data skipping configuration options

The following configuration options are available:

Key Default setting Description
spark.ibm.metaindex.
evaluation.enabled
false When set to true, queries will run in evaluation mode. If queries run in evaluation mode, all of the indexed data sets will only be processed for skipping statistics and no data is read. Evaluation mode is useful to see the skipping statistics you might get for a given query.
spark.ibm.metaindex.
index.bloom.fpp
0.01 The false/positive rate of the bloom filter.
spark.ibm.metaindex.index.minmax.
readoptimized.parquet
true If enabled, min/max statistics for Parquet objects on numerical columns (excluding decimal type) will be read from the footers. By collecting only the min/max indexes, index collection is further optimized by running with higher parallelism.
spark.ibm.metaindex.index.minmax.
readoptimized.parquet.parallelism
10000 The parallelism to use when collecting only min/max indexes on Parquet files.
spark.ibm.metaindex.
parquet.mdlocation.type
EXPLICIT_BASE_PATH_LOCATION The type of URL stored in the metadata location
spark.ibm.metaindex.
parquet.mdlocation
/tmp/dataskipping_metadata The metadata location (interpreted according to the URL type).
spark.ibm.metaindex.
parquet.index.chunksize
25000 The number of objects to index in each chunk.
spark.ibm.metaindex.
parquet.maxRecordsPerFile
100000 The maximum number of records per metadata file.

Types of metadata location

The parameter spark.ibm.metaindex.parquet.mdlocation is interpreted according to the URL type defined in the parameter spark.ibm.metaindex.parquet.mdlocation.type.

The following options are available:

  • EXPLICIT_BASE_PATH_LOCATION: This is the default. An explicit definition of the base path to the metadata, which is combined with a data set identifier. This case can be used to configure the MetaIndexManager JVM wide settings and have all of data sets metadata saved under the base path.
  • EXPLICIT_LOCATION: An explicit full path to the metadata.

You can set the spark.ibm.metaindex.parquet.mdlocation in two ways:

  • By setting a JVM wide configuration

    This configuration is useful for setting the base location once for all data sets and should be used with the EXPLICIT_BASE_PATH_LOCATION type.

    The location of the metadata for each data set will be inferred automatically by combining the base path with a data set identifier.

    • For Scala:
        val jmap = new java.util.HashMap[String,String]()
        jmap.put("spark.ibm.metaindex.parquet.mdlocation", "/path/to/base/metadata/location")
        jmap.put("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")
        MetaIndexManager.setConf(jmap)
      
    • For Python:
        md_backend_config = dict([
        ('spark.ibm.metaindex.parquet.mdlocation', "/path/to/base/metadata/location"),
        ("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")])
        MetaIndexManager.setConf(spark, md_backend_config)
      
  • By setting the metadata location for a specific MetaIndexManager instance

    This configuration is useful for setting a specific metadata location for a certain data set and should be used with the EXPLICIT_LOCATION type.

    • For Scala:
      val im = new MetaIndexManager(spark, uri, ParquetMetadataBackend)
      val jmap = new java.util.HashMap[String, String]()
      jmap.put("spark.ibm.metaindex.parquet.mdlocation", "/exact/path/to/metadata/location")
      jmap.put("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_LOCATION")
      im.setMetadataStoreParams(jmap)
      
    • For Python:
      im = MetaIndexManager(spark, uri, 'com.ibm.metaindex.metadata.metadatastore.parquet.ParquetMetadataBackend')
      md_backend_config = dict([
      ('spark.ibm.metaindex.parquet.mdlocation', "/exact/path/to/metadata/location"),
      ("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_LOCATION")])
      im.setMetadataStoreParameters(md_backend_config)
      

Limitations

Data skipping might not work if type casting is used in the WHERE clause. For example, given a min/max index on a column with a short data type, the following query will not benefit from data skipping:

select * from table where shortType > 1

Spark will evaluate this expression as (cast(shortType#3 as int) > 1) because the constant 1 is of type Integer.

Note that in some cases Spark can automatically cast the literal to the right type. For example, the previous query works for all other numerical types except for byte type, which would require casting as well.

To benefit from data skipping in such cases, make sure that the literal has the same type as the column type, for example:

select * from table where shortType > cast(1 as short)