Geospatial Data¶
What geospatial data is supported?¶
In general, the Python Geo Interface is supported; see https://gist.github.com/sgillies/2217756.
There are three layers of information, modeled after GeoJSON (see https://tools.ietf.org/html/rfc7946):
The outermost layer is a FeatureCollection
Within a FeatureCollection are Features, each of which represents one “thing”, like a building or a river. Each feature has:
type (string; required)
id (object; required)
geometry (Geometry, see below; optional)
bbox (“bounding box”, 4 doubles; optional)
properties (map[string, object]; optional)
Within a Feature is a Geometry, which represents a shape, like a point or a polygon. Each geometry has:
type (one of “Point”, “MultiPoint”, “LineString”, “MultiLineString”, “Polygon”, “MultiPolygon”; required)
coordinates (doubles; exactly how these are structured depends on the type of the geometry)
Although the Python Geo Interface is non-prescriptive when it comes to the data types of the id and properties, Tamr has a more restricted set of supported types. See https://docs.tamr.com/reference#attribute-types.
The Dataset
class supports the __geo_interface__
property. This will produce one FeatureCollection
for the entire dataset.
There is a companion iterator itergeofeatures()
that returns a generator that allows you to
stream the records in the dataset as Geospatial features.
To produce a GeoJSON representation of a dataset:
dataset = client.datasets.by_name("my_dataset")
with open("my_dataset.json", "w") as f:
json.dump(dataset.__geo_interface__, f)
By default, itergeofeatures()
will use the first dataset attribute with geometry type to fill in the feature geometry. You can override this by specifying the geometry attribute to use in the geo_attr
parameter to itergeofeatures
.
Dataset
can also be updated from a feature collection that supports the Python Geo Interface:
import geopandas
geodataframe = geopandas.GeoDataFrame(...)
dataset = client.dataset.by_name("my_dataset")
dataset.from_geo_features(geodataframe)
By default the features’ geometries will be placed into the first dataset attribute with geometry
type. You can override this by specifying the geometry attribute to use in the geo_attr
parameter to from_geo_features
.
Rules for converting from Tamr records to Geospatial Features¶
The record’s primary key will be used as the feature’s id
. If the primary key is a single attribute, then the value of that attribute will be the value of id
. If the primary key is composed of multiple attributes, then the value of the id
will be an array with the values of the key attributes in order.
Tamr allows any number of geometry attributes per record; the Python Geo Interface is limited to one. When converting Tamr records to Python Geo Features, the first geometry attribute in the schema will be used as the geometry; all other geometry attributes will appear as properties with no type conversion. In the future, additional control over the handling of multiple geometries may be provided; the current set of capabilities is intended primarily to support the use case of working with FeatureCollections within Tamr, and FeatureCollection has only one geometry per feature.
An attribute is considered to have geometry type if it has type RECORD
and contains an attribute named point
, multiPoint
, lineString
, multiLineString
, polygon
, or multiPolygon
.
If an attribute named bbox
is available, it will be used as bbox
. No conversion is done on the value of bbox
. In the future, additional control over the handling of bbox
attributes may be provided.
All other attributes will be placed in properties
, with no type conversion. This includes all geometry attributes other than the first.
Rules for converting from Geospatial Features to Tamr records¶
The Feature’s id
will be converted into the primary key for the record. If the record uses a simple key, no value translation will be done. If the record uses a composite key, then the value of the Feature’s id
must be an array of values, one per attribute in the key.
If the Feature contains keys in properties
that conflict with the record keys, bbox
, or geometry, those keys are ignored (omitted).
If the Feature contains a bbox
, it is copied to the record’s bbox
.
All other keys in the Feature’s properties
are propagated to the same-name attribute on the record, with no type conversion.
Streaming data access¶
The Dataset
method itergeofeatures()
returns a generator that allows you to stream the records in the dataset as Geospatial features:
my_dataset = client.datasets.by_name("my_dataset")
for feature in my_dataset.itergeofeatures():
do_something(feature)
Note that many packages that consume the Python Geo Interface will be able to consume this iterator directly. For example::
from geopandas import GeoDataFrame
df = GeoDataFrame.from_features(my_dataset.itergeofeatures())
This allows construction of a GeoDataFrame directly from the stream of records, without materializing the intermediate dataset.