Geospatial Data¶
What geospatial data is supported?¶
In general, the Python Geo Interface is supported; see https://gist.github.com/sgillies/2217756
There are three layers of information, modeled after GeoJSON; see https://tools.ietf.org/html/rfc7946 :
- The outermost layer is a FeatureCollection
- Within a FeatureCollection are Features, each of which represents one “thing”, like a building
or a river. Each feature has:
- type (string; required)
- id (object; required)
- geometry (Geometry, see below; optional)
- bbox (“bounding box”, 4 doubles; optional)
- properties (map[string, object]; optional)
- Within a Feature is a Geometry, which represents a shape, like a point or a polygon. Each
geometry has:
- type (one of “Point”, “MultiPoint”, “LineString”, “MultiLineString”, “Polygon”, “MultiPolygon”; required)
- coordinates (doubles; exactly how these are structured depends on the type of the geometry)
Although the Python Geo Interface is non-prescriptive when it comes to the data types of the id and properties, Unify has a more restricted set of supported types. See https://docs.tamr.com/reference#attribute-types
The Dataset
class supports the
__geo_interface__
property. This will produce one FeatureCollection
for the entire dataset.
There is a companion iterator itergeofeatures()
that returns a generator that allows you to
stream the records in the dataset as Geospatial features.
To produce a GeoJSON representation of a dataset:
dataset = client.datasets.by_name("my_dataset")
with open("my_dataset.json", "w") as f:
json.dump(dataset.__geo_interface__, f)
Dataset
can also be updated from a feature collection that supports the Python Geo Interface:
import geopandas
geodataframe = geopandas.GeoDataFrame(...)
dataset = client.dataset.by_name("my_dataset")
dataset.from_geo_features(geodataframe)
Rules for converting from Unify records to Geospatial Features¶
The record’s primary key will be used as the feature’s id
. If the primary key is a single
attribute, then the value of that attribute will be the value of id
. If the primary key is
composed of multiple attributes, then the value of the id
will be an array with the values
of the key attributes in order.
Unify allows any number of geometry attributes per record; the Python Geo Interface is limited to one. When converting Unify records to Python Geo Features, the first geometry attribute in the schema will be used as the geometry; all other geometry attributes will appear as properties with no type conversion. In the future, additional control over the handling of multiple geometries may be provided; the current set of capabilities is intended primarily to support the use case of working with FeatureCollections within Unify, and FeatureCollection has only one geometry per feature.
An attribute is considered to have geometry type if it has type RECORD
and contains an attribute
named point
, multiPoint
, lineString
, multiLineString
, polygon
, or
multiPolygon
.
If an attribute named bbox
is available, it will be used as bbox
. No conversion is done
on the value of bbox
. In the future, additional control over the handling of bbox
attributes
may be provided.
All other attributes will be placed in properties
, with no type conversion. This includes
all geometry attributes other than the first.
Rules for converting from Geospatial Features to Unify records¶
The Feature’s id
will be converted into the primary key for the record. If the record uses
a simple key, no value translation will be done. If the record uses a composite key, then the
value of the Feature’s id
must be an array of values, one per attribute in the key.
If the Feature contains keys in properties
that conflict with the record keys, bbox
,
or geometry, those keys are ignored (omitted).
If the Feature contains a bbox
, it is copied to the record’s bbox
.
All other keys in the Feature’s properties
are propagated to the same-name attribute on the
record, with no type conversion.
Streaming data access¶
The Dataset
method itergeofeatures()
returns a generator that allows you to
stream the records in the dataset as Geospatial features:
my_dataset = client.datasets.by_name("my_dataset")
for feature in my_dataset.itergeofeatures():
do_something(feature)
Note that many packages that consume the Python Geo Interface will be able to consume this iterator directly. For example:
from geopandas import GeoDataFrame
df = GeoDataFrame.from_features(my_dataset.itergeofeatures())
This allows construction of a GeoDataFrame directly from the stream of records, without materializing the intermediate dataset.