Data skipping library for notebooks

Data skipping can significantly boost the performance of SQL queries by skipping over irrelevant data based on summary metadata associated with each data object. Data skipping is a performance optimization feature which means that using data skipping does not affect the content of the query results.

The data skipping library is pre-installed on all Spark environments in Watson Studio.

All Spark native data formats are supported, including Parquet, ORC, CSV, JSON and Avro. When an SQL query is triggered, the summary metadata is scanned and all objects whose metadata does not overlap with the given predicate in the query are skipped.

To use data skipping, you need to create indexes on one or more columns of the data set. After this is done, Spark SQL queries can benefit from data skipping. In general, you should index the columns which are queried most often in the WHERE clause.

Three index types are supported:

Index type Description Applicable to predicates in WHERE clauses Column types
MinMax Stores minimum and maximum values for a column <,<=,=,>=,> All types except for complex types. See Supported Spark SQL data types.
ValueList Stores the list of unique values for the column =,IN,LIKE All types except for complex types. See Supported Spark SQL data types.
BloomFilter Uses a bloom filter to test for set membership =,IN Byte, String, Long, Integer, Short

You should use bloom filters for columns with very high cardinality. Index creation invokes a Spark job which writes metadata (indexes) to a user specified location, in Parquet format.

Note that the metadata is typically much smaller in volume than the data. If changes are made to some of the objects in the data set after the index was created, data skipping will still work correctly but will not skip data in the changed objects. To avoid this, you should refresh the indexes.

Geospatial data skipping

Data skipping is also supported in spatial queries on geospatial data sets with latitude and longitude columns. To benefit from data skipping, you can collect min/max indexes on the latitude and longitude columns. See the Geospatio-temporal library for notebooks for how to use spatial indexing functions.

The list of supported geospatial functions includes the following:

  • ST_Distance
  • ST_Intersects
  • ST_Contains
  • ST_Equals
  • ST_Crosses
  • ST_Touches
  • ST_Within
  • ST_Overlaps
  • ST_EnvelopesIntersect
  • ST_IntersectsInterior

Learn more

Check out the following sample notebooks to learn how to use the data skipping library: