Datasets

Dataset

class tamr_unify_client.dataset.resource.Dataset(client, data, alias=None)[source]

A Tamr dataset.

property name
Type

str

property external_id
Type

str

property description
Type

str

property version
Type

str

property tags
Type

list[str]

property key_attribute_names
Type

list[str]

property attributes

Attributes of this dataset.

Returns

Attributes of this dataset.

Return type

AttributeCollection

upsert_from_dataframe(df, *, primary_key_name, ignore_nan=True)[source]

Upserts a record for each row of df with attributes for each column in df.

Parameters
  • df (DataFrame) – The data to upsert records from.

  • primary_key_name (str) – The name of the primary key of the dataset. Must be a column of df.

  • ignore_nan (bool) – Whether to convert NaN values to null before upserting records to Tamr. If False and NaN is in df, this function will fail. Optional, default is True.

Return type

dict

Returns

JSON response body from the server.

Raises

KeyError – If primary_key_name is not a column in df.

upsert_records(records, primary_key_name, **json_args)[source]

Creates or updates the specified records.

Parameters
  • records (iterable[dict]) – The records to update, as dictionaries.

  • primary_key_name (str) – The name of the primary key for these records, which must be a key in each record dictionary.

  • **json_args – Arguments to pass to the JSON dumps function, as documented here. Some of these, such as indent, may not work with Tamr.

Returns

JSON response body from the server.

Return type

dict

delete_records(records, primary_key_name)[source]

Deletes the specified records.

Parameters
  • records (iterable[dict]) – The records to delete, as dictionaries.

  • primary_key_name (str) – The name of the primary key for these records, which must be a key in each record dictionary.

Returns

JSON response body from the server.

Return type

dict

delete_records_by_id(record_ids)[source]

Deletes the specified records.

Parameters

record_ids (iterable) – The IDs of the records to delete.

Returns

JSON response body from the server.

Return type

dict

delete_all_records()[source]

Removes all records from the dataset.

Returns

HTTP response from the server

Return type

requests.Response

refresh(**options)[source]

Brings dataset up-to-date if needed, taking whatever actions are required.

Parameters

**options – Options passed to underlying Operation . See apply_options() .

Returns

The refresh operation.

Return type

Operation

profile()[source]

Returns profile information for a dataset.

If profile information has not been generated, call create_profile() first. If the returned profile information is out-of-date, you can call refresh() on the returned object to bring it up-to-date.

Returns

Dataset Profile information.

Return type

DatasetProfile

create_profile(**options)[source]

Create a profile for this dataset.

If a profile already exists, the existing profile will be brought up to date.

Parameters

**options – Options passed to underlying Operation . See apply_options() .

Returns

The operation to create the profile.

Return type

Operation

records()[source]

Stream this dataset’s records as Python dictionaries.

Returns

Stream of records.

Return type

Python generator yielding dict

status()[source]

Retrieve this dataset’s streamability status.

Returns

Dataset streamability status.

Return type

DatasetStatus

usage()[source]

Retrieve this dataset’s usage by recipes and downstream datasets.

Returns

The dataset’s usage.

Return type

DatasetUsage

from_geo_features(features, geo_attr=None)[source]

Upsert this dataset from a geospatial FeatureCollection or iterable of Features.

features can be:

  • An object that implements __geo_interface__ as a FeatureCollection (see https://gist.github.com/sgillies/2217756)

  • An iterable of features, where each element is a feature dictionary or an object that implements the __geo_interface__ as a Feature

  • A map where the “features” key contains an iterable of features

See: geopandas.GeoDataFrame.from_features()

If geo_attr is provided, then the named Tamr attribute will be used for the geometry. If geo_attr is not provided, then the first attribute on the dataset with geometry type will be used for the geometry.

Parameters
  • features – geospatial features

  • geo_attr (str) – (optional) name of the Tamr attribute to use for the feature’s geometry

upstream_datasets()[source]

The Dataset’s upstream datasets.

API returns the URIs of the upstream datasets, resulting in a list of DatasetURIs, not actual Datasets.

Returns

A list of the Dataset’s upstream datasets.

Return type

list[DatasetURI]

spec()[source]

Returns this dataset’s spec.

Returns

The spec of this dataset.

Return type

DatasetSpec

delete(cascade=False)[source]

Deletes this dataset, optionally deleting all derived datasets as well.

Parameters

cascade (bool) – Whether to delete all datasets derived from this one. Optional, default is False. Do not use this option unless you are certain you need it as it can have unindended consequences.

Returns

HTTP response from the server

Return type

requests.Response

itergeofeatures(geo_attr=None)[source]

Returns an iterator that yields feature dictionaries that comply with __geo_interface__

See https://gist.github.com/sgillies/2217756

Parameters

geo_attr (str) – (optional) name of the Tamr attribute to use for the feature’s geometry

Returns

stream of features

Return type

Python generator yielding dict[str, object]

property relative_id
Type

str

property resource_id
Type

str

Dataset Spec

class tamr_unify_client.dataset.resource.DatasetSpec(client, data, api_path)[source]

A representation of the server view of a dataset.

static of(resource)[source]

Creates a dataset spec from a dataset.

Parameters

resource (Dataset) – The existing dataset.

Returns

The corresponding dataset spec.

Return type

DatasetSpec

static new()[source]

Creates a blank spec that could be used to construct a new dataset.

Returns

The empty spec.

Return type

DatasetSpec

from_data(data)[source]

Creates a spec with the same client and API path as this one, but new data.

Parameters

data (dict) – The data for the new spec.

Returns

The new spec.

Return type

DatasetSpec

to_dict()[source]

Returns a version of this spec that conforms to the API representation.

Returns

The spec’s dict.

Return type

dict

with_name(new_name)[source]

Creates a new spec with the same properties, updating name.

Parameters

new_name (str) – The new name.

Returns

A new spec.

Return type

DatasetSpec

with_external_id(new_external_id)[source]

Creates a new spec with the same properties, updating external ID.

Parameters

new_external_id (str) – The new external ID.

Returns

A new spec.

Return type

DatasetSpec

with_description(new_description)[source]

Creates a new spec with the same properties, updating description.

Parameters

new_description (str) – The new description.

Returns

A new spec.

Return type

DatasetSpec

with_key_attribute_names(new_key_attribute_names)[source]

Creates a new spec with the same properties, updating key attribute names.

Parameters

new_key_attribute_names (list[str]) – The new key attribute names.

Returns

A new spec.

Return type

DatasetSpec

with_tags(new_tags)[source]

Creates a new spec with the same properties, updating tags.

Parameters

new_tags (list[str]) – The new tags.

Returns

A new spec.

Return type

DatasetSpec

put()[source]

Updates the dataset on the server.

Returns

The modified dataset.

Return type

Dataset

Dataset Collection

class tamr_unify_client.dataset.collection.DatasetCollection(client, api_path='datasets')[source]

Collection of Dataset s.

Parameters
  • client (Client) – Client for API call delegation.

  • api_path (str) – API path used to access this collection. E.g. "projects/1/inputDatasets". Default: "datasets".

by_resource_id(resource_id)[source]

Retrieve a dataset by resource ID.

Parameters

resource_id (str) – The resource ID. E.g. "1"

Returns

The specified dataset.

Return type

Dataset

by_relative_id(relative_id)[source]

Retrieve a dataset by relative ID.

Parameters

relative_id (str) – The resource ID. E.g. "datasets/1"

Returns

The specified dataset.

Return type

Dataset

by_external_id(external_id)[source]

Retrieve a dataset by external ID.

Parameters

external_id (str) – The external ID.

Returns

The specified dataset, if found.

Return type

Dataset

Raises
  • KeyError – If no dataset with the specified external_id is found

  • LookupError – If multiple datasets with the specified external_id are found

stream()[source]

Stream datasets in this collection. Implicitly called when iterating over this collection.

Returns

Stream of datasets.

Return type

Python generator yielding Dataset

Usage:
>>> for dataset in collection.stream(): # explicit
>>>     do_stuff(dataset)
>>> for dataset in collection: # implicit
>>>     do_stuff(dataset)
by_name(dataset_name)[source]

Lookup a specific dataset in this collection by exact-match on name.

Parameters

dataset_name (str) – Name of the desired dataset.

Returns

Dataset with matching name in this collection.

Return type

Dataset

Raises

KeyError – If no dataset with specified name was found.

delete_by_resource_id(resource_id, cascade=False)[source]

Deletes a dataset from this collection by resource_id. Optionally deletes all derived datasets as well.

Parameters
  • resource_id (str) – The resource id of the dataset in this collection to delete.

  • cascade (bool) – Whether to delete all datasets derived from the deleted one. Optional, default is False. Do not use this option unless you are certain you need it as it can have unindended consequences.

Returns

HTTP response from the server.

Return type

requests.Response

create(creation_spec)[source]

Create a Dataset in Tamr

Parameters

creation_spec (dict[str, str]) – Dataset creation specification should be formatted as specified in the Public Docs for Creating a Dataset.

Returns

The created Dataset

Return type

Dataset

create_from_dataframe(df, primary_key_name, dataset_name, ignore_nan=True)[source]

Creates a dataset in this collection with the given name, creates an attribute for each column in the df (with primary_key_name as the key attribute), and upserts a record for each row of df.

Each attribute has the default type ARRAY[STRING], besides the key attribute, which will have type STRING.

This function attempts to ensure atomicity, but it is not guaranteed. If an error occurs while creating attributes or records, an attempt will be made to delete the dataset that was created. However, if this request errors, it will not try again.

Parameters
  • df (pandas.DataFrame) – The data to create the dataset with.

  • primary_key_name (str) – The name of the primary key of the dataset. Must be a column of df.

  • dataset_name (str) – What to name the dataset in Tamr. There cannot already be a dataset with this name.

  • ignore_nan (bool) – Whether to convert NaN values to null before upserting records to Tamr. If False and NaN is in df, this function will fail. Optional, default is True.

Returns

The newly created dataset.

Return type

Dataset

Raises
  • KeyError – If primary_key_name is not a column in df.

  • CreationError – If a step in creating the dataset fails.

class tamr_unify_client.dataset.collection.CreationError(error_message)[source]

An error from create_from_dataframe()

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

Dataset Profile

class tamr_unify_client.dataset.profile.DatasetProfile(client, data, alias=None)[source]

Profile info of a Tamr dataset.

property dataset_name

The name of the associated dataset.

Type

str

Return type

str

property relative_dataset_id

The relative dataset ID of the associated dataset.

Type

str

Return type

str

property is_up_to_date

Whether the associated dataset is up to date.

Type

bool

Return type

bool

property profiled_data_version

The profiled data version.

Type

str

Return type

str

property profiled_at

Info about when profile info was generated.

Type

dict

Return type

dict

property simple_metrics

Simple metrics for profiled dataset.

Type

list

Return type

list

property attribute_profiles

Simple metrics for profiled dataset.

Type

list

Return type

list

refresh(**options)[source]

Updates the dataset profile if needed.

The dataset profile is updated on the server; you will need to call profile() to retrieve the updated profile.

Parameters

**options – Options passed to underlying Operation . See apply_options() .

Returns

The refresh operation.

Return type

Operation

delete()

Deletes this resource. Some resources do not support deletion, and will raise a 405 error if this is called.

Returns

HTTP response from the server

Return type

requests.Response

property relative_id
Type

str

property resource_id
Type

str

Dataset Status

class tamr_unify_client.dataset.status.DatasetStatus(client, data, alias=None)[source]

Streamability status of a Tamr dataset.

property dataset_name

The name of the associated dataset.

Type

str

Return type

str

property relative_dataset_id

The relative dataset ID of the associated dataset.

Type

str

Return type

str

property is_streamable

Whether the associated dataset is available to be streamed.

Type

bool

Return type

bool

delete()

Deletes this resource. Some resources do not support deletion, and will raise a 405 error if this is called.

Returns

HTTP response from the server

Return type

requests.Response

property relative_id
Type

str

property resource_id
Type

str

Dataset URI

class tamr_unify_client.dataset.uri.DatasetURI(client, uri)[source]

Indentifier of a dataset.

Parameters
  • client (Client) – Queried dataset’s client.

  • uri (str) – Queried dataset’s dataset ID.

property resource_id
Type

str

property relative_id
Type

str

property uri
Type

str

dataset()[source]

Fetch the dataset that this identifier points to.

Returns

A Tamr dataset.

Return type

class

~tamr_unify_client.dataset.resource.Dataset

Dataset Usage

class tamr_unify_client.dataset.usage.DatasetUsage(client, data, alias=None)[source]

The usage of a dataset and its downstream dependencies.

See https://docs.tamr.com/reference#retrieve-downstream-dataset-usage

property relative_id
Type

str

property usage
Type

DatasetUse

property dependencies
Type

list[DatasetUse]

delete()

Deletes this resource. Some resources do not support deletion, and will raise a 405 error if this is called.

Returns

HTTP response from the server

Return type

requests.Response

property resource_id
Type

str

Dataset Use

class tamr_unify_client.dataset.use.DatasetUse(client, data)[source]

The use of a dataset in project steps. This is not a BaseResource because it has no API path and cannot be directly retrieved or modified.

See https://docs.tamr.com/reference#retrieve-downstream-dataset-usage

Parameters
  • client (Client) – Delegate underlying API calls to this client.

  • data (dict) – The JSON body containing usage information.

property dataset_id
Type

str

property dataset_name
Type

str

property input_to_project_steps
Type

list[ProjectStep]

property output_from_project_steps
Type

list[ProjectStep]

dataset()[source]

Retrieves the Dataset this use represents.

Returns

The dataset being used.

Return type

Dataset