Trino Fault Tolerant Execution can be used starting from EMR 6.9.0. You can use local HDFS or Amazon S3 buckets to persist data generated by Trino queries. The following provides sample configurations and best practices based on the storage option you select.
Amazon S3 buckets can be used by setting the property exchange.base-directories
in the trino-exchange-manager
classification.
When using multiple concurrent queries, it might be ideal to define multiple buckets to avoid an excessive number of requests against a single bucket that can lead to S3 throttling.
Trino already retries throttled requests, but having multiple buckets can still improve the performance in highly concurrent setups.
You can define multiple buckets comma separated like in the following example: e.g. s3://BUCKET_1,s3://BUCKET_2
Although there are no restrictions in the name used for the buckets, it might be convenient to define an internal naming convention for these buckets. For example, you can create the buckets using the following convention: e.g. ACCOUNT.REGION.trino-exchange-spooling-1,ACCOUNT.REGION.trino-exchange-spooling-2, etc.
Finally, data belonging to completed Trino queries are automatically deleted. However, data of queries that failed for any reason will not be deleted. In this case it might be useful to configure the bucket with lifecycle rules to automatically expire and delete data belonging to such queries. For additional details, see Managing your storage lifecycle in the Amazon S3 documentation.
[
{
"Classification": "trino-config",
"Properties": {
"fault-tolerant-execution-target-task-input-size": "4GB",
"fault-tolerant-execution-target-task-split-count": "64",
"fault-tolerant-execution-task-memory": "5GB",
"graceful-shutdown-timeout": "300s",
"query.low-memory-killer.delay": "0s",
"query.remote-task.max-error-duration": "1m",
"retry-policy": "TASK"
}
},
{
"Classification": "trino-exchange-manager",
"Properties": {
"exchange.base-directories": "s3://BUCKET_NAME.trino-exchange-spooling-1,s3://BUCKET_NAME.trino-exchange-spooling-2"
}
}
]
For configuring HDFS base directories, you should enable the property exchange.use-local-hdfs
in the trino-exchange-manager
classification.
You can additionally override the default path on the HDFS used to store exchange data using the property exchange.base-directories
.
As for Amazon S3, you can define multiple paths comma separated, but please note that this will not improve the performance.
In this case is recommended to set a minimum HDFS replication factor of two for clusters with less than 4 CORE nodes.
Also, when using HDFS it might be useful to increase the parameters dfs.datanode.handler.count
and dfs.datanode.max.transfer.threads
depending on the instance used.
An example configuration can be found in the Example section.
[
{
"Classification": "hdfs-site",
"Properties": {
"dfs.datanode.handler.count": "64",
"dfs.datanode.max.transfer.threads": "8192",
"dfs.namenode.handler.count": "64",
"dfs.replication": "2"
}
},
{
"Classification": "trino-config",
"Properties": {
"fault-tolerant-execution-target-task-input-size": "4GB",
"fault-tolerant-execution-target-task-split-count": "64",
"fault-tolerant-execution-task-memory": "5GB",
"graceful-shutdown-timeout": "300s",
"query.low-memory-killer.delay": "0s",
"query.remote-task.max-error-duration": "1m",
"retry-policy": "TASK"
}
},
{
"Classification": "trino-exchange-manager",
"Properties": {
"exchange.base-directories": "/trino_exchange",
"exchange.use-local-hdfs": "true"
}
}
]