DynamoDB
Important Capabilities
Capability | Status | Notes |
---|---|---|
Detect Deleted Entities | ✅ | Optionally enabled via stateful_ingestion.remove_stale_metadata |
Platform Instance | ✅ | By default, platform_instance will use the AWS account id |
This plugin extracts the following:
AWS DynamoDB table names with their region, and infer schema of attribute names and types by scanning the table
Prerequisities
In order to execute this source, you will need to create access key and secret keys that have DynamoDB read access. You can create these policies and attach to your account or can ask your account admin to attach these policies to your account.
For access key permissions, you can create a policy with permissions below and attach to your account, you can find more details in Managing access keys for IAM users
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"iam:ListAccessKeys",
"iam:CreateAccessKey",
"iam:UpdateAccessKey",
"iam:DeleteAccessKey"
],
"Resource": "arn:aws:iam::${aws_account_id}:user/${aws:username}"
}
]
}
For DynamoDB read access, you can simply attach AWS managed policy AmazonDynamoDBReadOnlyAccess
to your account, you can find more details in Attaching a policy to an IAM user group
CLI based Ingestion
Install the Plugin
pip install 'acryl-datahub[dynamodb]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: dynamodb
config:
platform_instance: "AWS_ACCOUNT_ID"
aws_access_key_id: "${AWS_ACCESS_KEY_ID}"
aws_secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
# User could use the below option to provide a list of primary keys of a table in dynamodb format,
# those items from given primary keys will be included when we scan the table.
# For each table we can retrieve up to 16 MB of data, which can contain as many as 100 items.
# We'll enforce the the primary keys list size not to exceed 100
# The total items we'll try to retrieve in these two scenarios:
# 1. If user don't specify include_table_item: we'll retrieve up to 100 items
# 2. If user specifies include_table_item: we'll retrieve up to 100 items plus user specified items in
# the table, with a total not more than 200 items
# include_table_item:
# table_name:
# [
# {
# "partition_key_name": { "attribute_type": "attribute_value" },
# "sort_key_name": { "attribute_type": "attribute_value" },
# },
# ]
sink:
# sink configs
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
aws_access_key_id ✅ string | AWS Access Key ID. |
aws_secret_access_key ✅ string(password) | AWS Secret Key. |
include_table_item map(str,array) | |
platform_instance string | The instance of the platform that all assets produced by this recipe belong to |
env string | The environment that all assets produced by this connector belong to Default: PROD |
table_pattern AllowDenyPattern | regex patterns for tables to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
table_pattern.allow array(string) | |
table_pattern.deny array(string) | |
table_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
stateful_ingestion StatefulStaleMetadataRemovalConfig | Base specialized config for Stateful Ingestion with stale metadata removal capability. |
stateful_ingestion.enabled boolean | The type of the ingestion state provider registered with datahub. Default: False |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"title": "DynamoDBConfig",
"description": "Any source that is a primary producer of Dataset metadata should inherit this class",
"type": "object",
"properties": {
"stateful_ingestion": {
"$ref": "#/definitions/StatefulStaleMetadataRemovalConfig"
},
"env": {
"title": "Env",
"description": "The environment that all assets produced by this connector belong to",
"default": "PROD",
"type": "string"
},
"platform_instance": {
"title": "Platform Instance",
"description": "The instance of the platform that all assets produced by this recipe belong to",
"type": "string"
},
"aws_access_key_id": {
"title": "Aws Access Key Id",
"description": "AWS Access Key ID.",
"type": "string"
},
"aws_secret_access_key": {
"title": "Aws Secret Access Key",
"description": "AWS Secret Key.",
"type": "string",
"writeOnly": true,
"format": "password"
},
"include_table_item": {
"title": "Include Table Item",
"description": "[Advanced] The primary keys of items of a table in dynamodb format the user would like to include in schema. Refer \"Advanced Configurations\" section for more details",
"type": "object",
"additionalProperties": {
"type": "array",
"items": {
"type": "object"
}
}
},
"table_pattern": {
"title": "Table Pattern",
"description": "regex patterns for tables to filter in ingestion.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
}
},
"required": [
"aws_access_key_id",
"aws_secret_access_key"
],
"additionalProperties": false,
"definitions": {
"DynamicTypedStateProviderConfig": {
"title": "DynamicTypedStateProviderConfig",
"type": "object",
"properties": {
"type": {
"title": "Type",
"description": "The type of the state provider to use. For DataHub use `datahub`",
"type": "string"
},
"config": {
"title": "Config",
"description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19)."
}
},
"required": [
"type"
],
"additionalProperties": false
},
"StatefulStaleMetadataRemovalConfig": {
"title": "StatefulStaleMetadataRemovalConfig",
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"type": "object",
"properties": {
"enabled": {
"title": "Enabled",
"description": "The type of the ingestion state provider registered with datahub.",
"default": false,
"type": "boolean"
},
"remove_stale_metadata": {
"title": "Remove Stale Metadata",
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
},
"AllowDenyPattern": {
"title": "AllowDenyPattern",
"description": "A class to store allow deny regexes",
"type": "object",
"properties": {
"allow": {
"title": "Allow",
"description": "List of regex patterns to include in ingestion",
"default": [
".*"
],
"type": "array",
"items": {
"type": "string"
}
},
"deny": {
"title": "Deny",
"description": "List of regex patterns to exclude from ingestion.",
"default": [],
"type": "array",
"items": {
"type": "string"
}
},
"ignoreCase": {
"title": "Ignorecase",
"description": "Whether to ignore case sensitivity during pattern matching.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
}
}
}
Limitations
For each region, the list table operation returns maximum number 100 tables, we need to further improve it by implementing pagination for listing tables
Advanced Configurations
Using include_table_item
config
If there are items that have most representative fields of the table, user could use the include_table_item
option to provide a list of primary keys of a table in dynamodb format, those items from given primary keys will be included when we scan the table.
Take AWS DynamoDB Developer Guide Example tables and data as an example, if user has a table Reply
with composite primary key Id
and ReplyDateTime
, user can use include_table_item
to include 2 items as following:
Example:
# put the table name and composite key in DynamoDB format
include_table_item:
Reply:
[
{
"ReplyDateTime": { "S": "2015-09-22T19:58:22.947Z" },
"Id": { "S": "Amazon DynamoDB#DynamoDB Thread 1" },
},
{
"ReplyDateTime": { "S": "2015-10-05T19:58:22.947Z" },
"Id": { "S": "Amazon DynamoDB#DynamoDB Thread 2" },
},
]
Code Coordinates
- Class Name:
datahub.ingestion.source.dynamodb.dynamodb.DynamoDBSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for DynamoDB, feel free to ping us on our Slack.