If you have large files to upload to S3, the new MultiPart feature of S3 is great. In the past, the only choice was to upload the entire file in one operation. If that operation failed on the very last byte of the file, you would have to start the whole process again. Now you can split the file up into as many chunks as you want and upload each chunk individually. If you have a failure with one chunk, it won’t affect the others. When all of the chunks have been uploaded, you complete the operation and S3 then combines all of the chunks back into a single file. Because each chunk upload is a completely separate operation, it’s also possible (and desirable) to use threads or processes to get several upload operations going concurrently to increase the overall throughput.
The easiest way to take advantage of this is to use the
s3multiput
utility that is installed as part of
boto
. This command-line utility handles all of the details
for you. You just tell it where the input file is located and tell it
what S3 bucket you want to store the file in, and it creates a set of
subprocesses (based on the number of CPU cores you have) and gives each
subprocess a list of chunks to upload. In the example below, the
-c
option tells the command to also print out some
information that shows us the progress of each of the
subprocesses:
% ls -l ... -rw-r--r-- 1 mitch staff 265060652 Jun 21 08:45 mybigfile ... $ s3multiput -b my_bucket -c 20 mybigfile 0 bytes transferred / 265060652 bytes total 0 bytes transferred / 265060652 bytes total 0 bytes transferred / 265060652 bytes total 0 bytes transferred / 265060652 bytes total ... 265060652 bytes tranferred / 265060652 bytes total 265060652 bytes tranferred / 265060652 bytes total 265060652 bytes tranferred / 265060652 bytes total 265060652 bytes tranferred / 265060652 bytes total $
If you want to come up with your own custom code to handle
MultiPart uploads, you can check out the source code for
s3multipart
as a good starting point.