S3 stream copy large GZ file from one bucket to other is extremely slow. #679

pm-src · 2021-12-20T22:53:05Z

pm-src
Dec 20, 2021

Prior to smart_open, copying large gz file, size 850MB:

S3 copy from us-east to us-west bucket: download to local EC2 from us-east and upload disk file to us-west bucket. Approximately 3 mins.

Implementing smart_open:

` chunk_size = (64 * 1024 * 1024)

with s3.open(**read_arguments) as read_file:

    with open(**write_arguments) as write_file:

        while True:

            buffer = read_file.read(chunk_size)

            if not buffer:

                break

            write_file.write(buffer)

`
Stream copy file from us-east to us-west, taking upwards of 16 mins. Any way to improve the stream copy performance ?

Thank you !

pm-src · 2021-12-21T01:37:11Z

pm-src
Dec 21, 2021
Author

This thread provides the exact details: #520

Based on the discussion, I did some verification with a csv file of equivalent size ~1 GB, it approximately took 5 mins

Why would smart_open treat a gz file differently when the request is to read/write as binary ? Is smart_open trying to gunzip and stream as gzip to destination ?

GZ file of ~750 MB took around ~21 mins
Same GZ expands to ~8.5 GB, split to 1 GB file and stream csv file took <5 mins.

Any response to help move forward is greatly appreciated.

Thank you !

4 replies

mpenkov Dec 21, 2021
Collaborator

Why would smart_open treat a gz file differently when the request is to read/write as binary ? Is smart_open trying to gunzip and stream as gzip to destination ?

Yes. That is the default behavior.

You need to disable compression by passing in compression="disable" to the open function.

In general, if you're moving large objects around within S3, then streaming (with smart_open or otherwise) is the least efficient way to do it, because it's a single-threaded approach. You'll get much better performance if you use boto3 directly (at the expense of additional development effort up front).

pm-src Dec 21, 2021
Author

@mpenkov thank you for the response.

Apparently 'ignore_ext' is mutually exclusive with 'compression'.

Using 'compression' did not change in the time taken to stream a GZ file. Was already using 'ignore_ext'

pm-src Dec 21, 2021
Author

I'm still not clear on why would it take 4X times to copy a GZ when compared to a CSV file.
When both the file formats are read as binary, it should be a binary data buffer streaming, irrespective of the file format ?

mpenkov Dec 21, 2021
Collaborator

Your expectations are correct. It's hard for me to answer the question "why" without looking at the actual data. If you can reproduce this problem using data that is possible to share, please open a bug report.

If I was to investigate this in more detail (once the issue is reproducible), then I would:

Ensure that the data is not being compressed/decompressed via streaming
Examine CPU usage and network conditions while the copy is running
Repeat the experiment multiple times to ensure there are no confounds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 stream copy large GZ file from one bucket to other is extremely slow. #679

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

S3 stream copy large GZ file from one bucket to other is extremely slow. #679

pm-src Dec 20, 2021

Replies: 1 comment · 4 replies

pm-src Dec 21, 2021 Author

mpenkov Dec 21, 2021 Collaborator

pm-src Dec 21, 2021 Author

pm-src Dec 21, 2021 Author

mpenkov Dec 21, 2021 Collaborator

pm-src
Dec 20, 2021

Replies: 1 comment 4 replies

pm-src
Dec 21, 2021
Author

mpenkov Dec 21, 2021
Collaborator

pm-src Dec 21, 2021
Author

pm-src Dec 21, 2021
Author

mpenkov Dec 21, 2021
Collaborator