AWS Glue has soft limits for Number of table versions per table and Number of table versions per account. For more details on the soft-limits, refer AWS Glue endpoints and quotas. AWS Glue Table versions cleanup utility helps you delete old versions of Glue Tables. This is developed using AWS Glue SDK for Java. This is deployed as two AWS Lambda functions. This helps you retain X number of most recent versions for each Table and deletes the rest. Using this utility, you will be able to keep per-table and account level soft-limits under control. This can be scheduled using Amazon CloudWatch Events e.g. once in a month.
This utility comes in two forms:
- Java - use main branch
- Python - use main-python branch
Note: This utility safely ignores Databases and Tables that are resource linked from an another AWS account to the AWS account this utility is deployed into. In other words, this utility cleans up old versions of a table ONLY when the table belongs to the account this utility is deployed to run. Refer How Resource Links Work in Lake Formation for more details.
The Architecture of this utility is shown in the below diagram
- JDK 8
- IDE for e.g. Eclipse or Spring Tools or Intellij IDEA
- Apache Maven
- Access to AWS account
- AWS CLI
The following AWS services are required to deploy this utility:
- 2 AWS Lambda functions
- 2 IAM roles
- 1 Amazon SQS Queue
- 2 Amazon DynamoDB tables
- 1 AWS CloudWatch Event Rule
- 1 Amazon S3 bucket to upload AWS Lambda Function binary
Class | Overview |
---|---|
TableVersionsCleanupPlannerLambda | Lambda Function gets a list of tables for all databases and initiates the cleanup process. |
TableVersionsCleanupLambda | Lambda Function deletes old versions of a table. |
- Clone this code repo to your Laptop / MacBook
- This project has Maven nature, so you can import it to your IDE.
- Build a Jar file using one of the steps below:
- Using standalone Maven, go to project home directory and run command
mvn -X clean install
- From Eclipse or STS, run command
-X clean install
. Navigation: Project right click --> Run As --> Maven Build (Option 4)
- Using standalone Maven, go to project home directory and run command
- This will generate a jar file
glue-tableversions-cleanup-0.1.jar
- Note: The size of the jar file is around 16 MB
-
Log onto AWS console, select S3, select a bucket you want to use. If you do not have bucket already create one
-
Create a folder with name
table_version_cleanup_lambda_jar
-
Open command prompt on your Laptop / MacBook
-
Upload Lambda function Jar file to S3 bucket
aws s3 cp glue-tableversions-cleanup-0.1.jar s3://<bucket_name>/table_version_cleanup_lambda_jar/
-
Create an Amazon SQS queue with the below details:
- name =
table_versions_cleanup_planner_queue.fifo
- Type = FIFO
- Configuration:
- Visibility timeout = 15 minutes
- Message retention period = 4 Days
- Delivery delay = 0 seconds
- Content-based deduplication = enable
- name =
-
Create DynamoDB tables
Table Schema Capacity glue_table_version_cleanup_planner Primary partition key - execution_batch_id (Number), Primary sort key - database_name_table_name (String) Provisioned read capacity units = 5, Provisioned write capacity units = 10 glue_table_version_cleanup_statistics Primary partition key - execution_id (Number), Primary sort key - execution_batch_id (Number) Provisioned read capacity units = 5, Provisioned write capacity units = 10 -
Create IAM policies that are common to both Lambda functions
- Amazon DynamoDB policy
- name =
table_versions_cleanup_lambda_dynamodb_policy
- sample policy = table_versions_cleanup_lambda_dynamodb_policy
- name =
- Amazon CloudWatch policy
- name =
table_versions_cleanup_lambda_cloudwatch_policy
- sample policy = table_versions_cleanup_cloudwatch_logs_policy
- name =
- Amazon DynamoDB policy
-
Create IAM policies for TableVersionsCleanupPlannerLambdaExecRole
- AWS Glue policy
- name =
table_versions_cleanup_planner_lambda_glue_policy
- sample policy = table_versions_cleanup_planner_lambda_glue_policy
- name =
- Amazon SQS policy
- name =
table_versions_cleanup_planner_lambda_sqs_policy
- sample policy = table_versions_cleanup_planner_lambda_sqs_policy
- name =
- AWS Glue policy
-
Create IAM policies for TableVersionsCleanupLambdaExecRole
- AWS Glue policy
- name =
table_versions_cleanup_lambda_glue_policy
- sample policy = table_versions_cleanup_lambda_glue_policy
- name =
- Amazon SQS policy
- name =
table_versions_cleanup_lambda_sqs_policy
- sample policy = table_versions_cleanup_lambda_sqs_policy
- name =
- AWS Glue policy
-
Create an IAM role with name
TableVersionsCleanupPlannerLambdaExecRole
and attach below policies:- table_versions_cleanup_lambda_dynamodb_policy
- table_versions_cleanup_lambda_cloudwatch_policy
- table_versions_cleanup_planner_lambda_sqs_policy
- table_versions_cleanup_planner_lambda_glue_policy
-
Create an IAM role with name
TableVersionsCleanupLambdaExecRole
and attach below policies:- table_versions_cleanup_lambda_sqs_policy
- table_versions_cleanup_lambda_glue_policy
- table_versions_cleanup_lambda_dynamodb_policy
- table_versions_cleanup_lambda_cloudwatch_policy
-
Deploy TableVersionsCleanupPlannerLambda function
-
Runtime = Java 8
-
IAM Execution role =
TableVersionsCleanupPlannerLambdaExecRole
-
Function package =
s3://<bucket_name>/table_version_cleanup_lambda_jar/glue-tableversions-cleanup-0.1.jar
-
Lambda Handler =
software.aws.glue.tableversions.lambda.TableVersionsCleanupPlannerLambda
-
Timeout = e.g. 15 minutes
-
Memory = e.g. 128 MB
-
Environment variable = as defined in the following table
Variable Name E.g. Value Description database_names_string_literal database_1$database_2$database_3 database names string literal separated by a separator token separator $ The separator used in the database_prefix_list region us-east-1 AWS region used sqs_queue_url https://sqs.us-east-1.amazonaws.com/<AccountId>/table_versions_cleanup_planner_queue.fifo
SQS queue name used ddb_table_name glue_table_version_cleanup_planner DynamoDB Table used hash_key execution_batch_id Primary partition key used range_key database_name_table_name Primary sort key used
-
-
Deploy TableVersionsCleanupLambda function
-
Runtime = Java 8
-
IAM Execution role =
TableVersionsCleanupPlannerLambdaExecRole
-
Function package =
s3://<bucket_name>/table_version_cleanup_lambda_jar/glue-tableversions-cleanup-0.1.jar
-
Lambda Handler =
software.aws.glue.tableversions.lambda.TableVersionsCleanupLambda
-
Timeout = e.g. 15 minutes
-
Memory = e.g. 192 MB
-
Environment variable = as defined in the following table
Variable Name E.g. Variable Value Description region us-east-1 AWS region used number_of_versions_to_retain 100 Number of old versions to retain per table ddb_table_name glue_table_version_cleanup_statistics DynamoDB Table used hash_key execution_id Primary partition key used range_key execution_batch_id Primary sort key used -
Add an SQS trigger and select
table_versions_cleanup_planner_queue.fifo
-
-
Create a CloudWatch Event Rule and add TableVersionsCleanupPlannerLambda as its target. Refer the following AWS documentation for more details:
- Ravi Itha, Senior Big Data Consultant, Amazon Web Services, Inc.
- Phanee Gottumukkala, Associate Cloud Developer, Amazon Web Services, Inc.
- Julia Kroll, Data & ML Engineer, Amazon Web Services, Inc.
This sample code is made available under the MIT-0 license. See the LICENSE file.