This notebook shows you how to work with the spatio-temporal library that is pre-installed on all Spark environments in Watson Studio. The spatio-temporal library supports spatial indexing functions which you will learn to use in this notebook.
The spatial indexing functions enable efficient access to simple geometric objects such as points, lines and polygons in spatial databases. These functions greatly enhance time-critical search of spatial data.
Much of the spatial indexing functionality in spatial databases like Db2 Spatial, Oracle Spatial, or Microsoft SQL Server Spatial is not accessible. However, with the spatial indexing support in the spatio-temporal library you can now index and query arbitrary geometries. The spatial indices support various queries, which you will learn to use in this notebook, including:
containing
within_distance
intersects
The time taken to search for geometries in a database can decrease significantly depending on the size and complexity of the geometries. For example, you can query which ZIP code a polygon belongs to by entering a single point. A sequential search across the polygons would be very slow and expensive; however searching a spatial index of all ZIP code polygons for a country answers this query significantly faster.
This notebook runs on Python with Spark.
Before you can start using the spatial indexing functions in the spatio-temporal library in your notebook, you must register STContext
to access the st
functions.
To register STContext
:
from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)
In this notebook, you will use sample data listing US county boundaries.
You will use the geojson_reader
function to read a GeoJSON file directly from a publically accessible URL into a Pandas
dataframe.
import requests
url = 'https://api.dataplatform.cloud.ibm.com/v2/gallery-assets/entries/1ec43d48a694c6c1d052ddca4d68bdc4/data?accessKey=1ec43d48a694c6c1d052ddca4d69054c'
r = requests.get(url, allow_redirects=True)
open('us_county.geojson', 'wb').write(r.content)
county_df = stc.geojson_reader().read('us_county.geojson')
county_df.head(3)
There are several options available to choose from for spatial indexing, which include grid_index, r_star_tree_index and tessellation_index. Click the following links for a quick introduction about each of these indexes:
Note: this tessellation index is not a standard tessellation index. Instead, it is similar to a grid index, only in this case the grid is uniform with respect to the size (in meters) of the grid, as opposed to a typical grid index that is uniform with respect to the number of latitude/longitude divisions.
In this notebook, you will use the tessellation index. To create a tessellation spatial index, you need to set the following two parameters:
bbox parameter
parameter value set to None
, which is the default value.meters
. You should provide a tile size that is close to the size of your geometries for better performance. For example, if your geometries are $ 100km^2 (i.e. 10^8 m^2)$ then $ 10^4m$ could be a good value for tile size.After deciding the bounding box and tile size, you can create the spatial index and import your geometries into the spatial index. For this you will use the from_df
method which puts the geometries in a pandas DataFrame into the spatial index. You only need to specify the name of the geometry ID column and the name of the geometry column. The extra parameter called verbosity, which controls how many processing logs to print out, can be set to error
which allows only summary and failure entries to display.
tile_size = 100000 #in meters
si = stc.tessellation_index(tile_size=tile_size) # we leave bbox as None to use full earth as boundingbox
si.from_df(county_df, 'NAME', 'geometry', verbosity='error') #Populate the spatial index
You can use the following APIs to query the spatial index:
Here a a few questions that you can get answers for.
white_plains_hospital = stc.point(41.026132, -73.769585)
si.containing(white_plains_hospital)
white_plains_WKT = 'POLYGON((-73.792 41.024,-73.794 41.031,-73.779 41.046,-73.78 41.049,-73.779 41.052,-73.776 41.054,-73.775 41.057,-73.767 41.058,-73.769 41.062,-73.768 41.067,-73.762 41.073,-73.759 41.074,-73.748 41.069,-73.746 41.056,-73.742 41.056,-73.74 41.053,-73.74 41.049,-73.749 41.04,-73.748 41.035,-73.739 41.034,-73.729 41.029,-73.725 41.025,-73.72 41.016,-73.717 41.015,-73.716 41.006,-73.718 41.002,-73.732 40.988,-73.732 40.985,-73.739 40.979,-73.745 40.978,-73.749 40.981,-73.749 40.986,-73.751 40.986,-73.756 40.991,-73.759 40.991,-73.76 40.993,-73.765 40.994,-73.769 40.997,-73.774 41.002,-73.775 41.006,-73.788 41.018,-73.792 41.024))'
wkt_reader = stc.wkt_reader()
white_plains = wkt_reader.read(white_plains_WKT)
si.containing(white_plains)
counties = si.nearest_neighbors(white_plains_hospital, 3)
counties
counties = si.within_distance(white_plains_hospital, 20000)
counties
This notebook showed you a simple spatial indexing example. You learnt how to create a spatial context, read some sample geometric data, create an index, and then query the index.
Linsong Chu, Research Engineer at IBM Research
Copyright © 2019-2024 IBM. This notebook and its source code are released under the terms of the MIT License.