Uploading Large Objects to S3

Problem

You want to efficiently upload large objects to S3.

If you have large files to upload to S3, the new MultiPart feature of S3 is great. In the past, the only choice was to upload the entire file in one operation. If that operation failed on the very last byte of the file, you would have to start the whole process again. Now you can split the file up into as many chunks as you want and upload each chunk individually. If you have a failure with one chunk, it won’t affect the others. When all of the chunks have been uploaded, you complete the operation and S3 then combines all of the chunks back into a single file. Because each chunk upload is a completely separate operation, it’s also possible (and desirable) to use threads or processes to get several upload operations going concurrently to increase the overall throughput.

The easiest way to take advantage of this is to use the s3multiput utility that is installed as part of boto. This command-line utility handles all of the details for you. You just tell it where the input file is located and tell it what S3 bucket you want to store the file in, and it creates a set of subprocesses (based on the number of CPU cores you have) and gives each subprocess a list of chunks to upload. In the example below, the -c option tells the command to also print out some information that shows us the progress of each of the subprocesses:

% ls -l
...
-rw-r--r--   1 mitch  staff   265060652 Jun 21 08:45 mybigfile
...
$ s3multiput -b my_bucket -c 20 mybigfile
0 bytes transferred / 265060652 bytes total
0 bytes transferred / 265060652 bytes total
0 bytes transferred / 265060652 bytes total
0 bytes transferred / 265060652 bytes total
...
265060652 bytes tranferred / 265060652 bytes total
265060652 bytes tranferred / 265060652 bytes total
265060652 bytes tranferred / 265060652 bytes total
265060652 bytes tranferred / 265060652 bytes total
$

If you want to come up with your own custom code to handle MultiPart uploads, you can check out the source code for s3multipart as a good starting point.

Uploading Large Objects to S3

Problem

Solution

Discussion