Use spatial indexing to query spatial data

Use spatial indexing to query spatial data¶

This notebook shows you how to work with the spatio-temporal library that is pre-installed on all Spark environments in Watson Studio. The spatio-temporal library supports spatial indexing functions which you will learn to use in this notebook.

The spatial indexing functions enable efficient access to simple geometric objects such as points, lines and polygons in spatial databases. These functions greatly enhance time-critical search of spatial data.

Much of the spatial indexing functionality in spatial databases like Db2 Spatial, Oracle Spatial, or Microsoft SQL Server Spatial is not accessible. However, with the spatial indexing support in the spatio-temporal library you can now index and query arbitrary geometries. The spatial indices support various queries, which you will learn to use in this notebook, including:

containing
within_distance
intersects

The time taken to search for geometries in a database can decrease significantly depending on the size and complexity of the geometries. For example, you can query which ZIP code a polygon belongs to by entering a single point. A sequential search across the polygons would be very slow and expensive; however searching a spatial index of all ZIP code polygons for a country answers this query significantly faster.

This notebook runs on Python with Spark.

Create a spatial context¶

Before you can start using the spatial indexing functions in the spatio-temporal library in your notebook, you must register STContext to access the st functions.

To register STContext:

In [ ]:

from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)

Read sample data¶

In this notebook, you will use sample data listing US county boundaries.

You will use the geojson_reader function to read a GeoJSON file directly from a publically accessible URL into a Pandas dataframe.

In [ ]:

import requests

url = 'https://api.dataplatform.cloud.ibm.com/v2/gallery-assets/entries/1ec43d48a694c6c1d052ddca4d68bdc4/data?accessKey=1ec43d48a694c6c1d052ddca4d69054c'
r = requests.get(url, allow_redirects=True)
open('us_county.geojson', 'wb').write(r.content)


county_df = stc.geojson_reader().read('us_county.geojson')
county_df.head(3)

Create a spatial index¶

There are several options available to choose from for spatial indexing, which include grid_index, r_star_tree_index and tessellation_index. Click the following links for a quick introduction about each of these indexes:

Note: this tessellation index is not a standard tessellation index. Instead, it is similar to a grid index, only in this case the grid is uniform with respect to the size (in meters) of the grid, as opposed to a typical grid index that is uniform with respect to the number of latitude/longitude divisions.

In this notebook, you will use the tessellation index. To create a tessellation spatial index, you need to set the following two parameters:

Bounding box: Defines the boundary of the spatial index. If you know exactly where your geometries are and can define a boundary that contains all these geometries, you should provide his bounding box information to the method because it will reduce the amount of "tiles" that need to be created and thus increase performance. However, if you don't know much about the geometries or you want to play safe and not exclude any geometry, that might potentially fall out of the given bounding box (both are very common situations), you can use the whole earth as the bounding box by simply leave the bbox parameter parameter value set to None, which is the default value.
Tile size: Tile size is the size of the tile in the tessellation index. The value is given by the length of the tile in the unit of meters. You should provide a tile size that is close to the size of your geometries for better performance. For example, if your geometries are $ 100km^2 (i.e. 10^8 m^2)$ then $ 10^4m$ could be a good value for tile size.

After deciding the bounding box and tile size, you can create the spatial index and import your geometries into the spatial index. For this you will use the from_df method which puts the geometries in a pandas DataFrame into the spatial index. You only need to specify the name of the geometry ID column and the name of the geometry column. The extra parameter called verbosity, which controls how many processing logs to print out, can be set to error which allows only summary and failure entries to display.

In [ ]:

tile_size = 100000 #in meters
si = stc.tessellation_index(tile_size=tile_size) # we leave bbox as None to use full earth as boundingbox
si.from_df(county_df, 'NAME', 'geometry', verbosity='error') #Populate the spatial index

Query the spatial index¶

You can use the following APIs to query the spatial index:

contained_in
contained_in_with_info
containing
containing_with_info
intersects
intersects_with_info
within_distance
within_distance_with_info
nearest_neighbors
nearest_neighbors_with_info

Here a a few questions that you can get answers for.

Which county does White Plains Hospital belong to? In other words, which county polygon contains the point location for White Plains Hospital?

In [ ]:

white_plains_hospital = stc.point(41.026132, -73.769585)
si.containing(white_plains_hospital)

Which county does the city White Plains belong to? (i.e. which county polygon contains the polygon of White Plains?)

In [ ]:

white_plains_WKT = 'POLYGON((-73.792 41.024,-73.794 41.031,-73.779 41.046,-73.78 41.049,-73.779 41.052,-73.776 41.054,-73.775 41.057,-73.767 41.058,-73.769 41.062,-73.768 41.067,-73.762 41.073,-73.759 41.074,-73.748 41.069,-73.746 41.056,-73.742 41.056,-73.74 41.053,-73.74 41.049,-73.749 41.04,-73.748 41.035,-73.739 41.034,-73.729 41.029,-73.725 41.025,-73.72 41.016,-73.717 41.015,-73.716 41.006,-73.718 41.002,-73.732 40.988,-73.732 40.985,-73.739 40.979,-73.745 40.978,-73.749 40.981,-73.749 40.986,-73.751 40.986,-73.756 40.991,-73.759 40.991,-73.76 40.993,-73.765 40.994,-73.769 40.997,-73.774 41.002,-73.775 41.006,-73.788 41.018,-73.792 41.024))'
wkt_reader = stc.wkt_reader()
white_plains = wkt_reader.read(white_plains_WKT)
si.containing(white_plains)

Which are the 3 nearest counties to White Plains Hospital?

In [ ]:

counties = si.nearest_neighbors(white_plains_hospital, 3)
counties

Which are the counties within 20 km of White Plains Hospital?

In [ ]:

counties = si.within_distance(white_plains_hospital, 20000)
counties

Summary¶

This notebook showed you a simple spatial indexing example. You learnt how to create a spatial context, read some sample geometric data, create an index, and then query the index.

Author¶

Linsong Chu, Research Engineer at IBM Research

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up