S3 objects are resources that store data. They are somewhat similar to the files in a standard computer system, but there are a number of important differences which were summarized in S3 Architecture.”
An object can contain up to 5 GB of data, or it can be entirely empty. An object can store two types of information: data and metadata. The data stored by an object is its main content, such as a photo or text document. In addition to the data content, an object can store metadata that provides further information about the object, such as when it was created and the type of data it contains. You can store your own metadata information when you create or replace an object.
Each object resource in S3 can have access control permissions applied to it, allowing you to keep the object private, or to make it available to other S3 users or the general public.
Each object in S3 is identified by a unique name, known as its key, which uniquely identifies it within a bucket. Object keys must not be longer than 1,024 bytes when encoded as UTF-8, and they can contain almost any characters, including spaces and punctuation. Objects are similar to files, so it makes sense to use obvious names for your objects, as you would for a file, such as My Birthday Cake.jpg.
One major difference between the S3 storage model and the average computer file system is that S3 has no notion of a hierarchical folder or directory structure. S3 buckets contain objects—that is the beginning and end of the hierarchy imposed by the storage model. If you wish to impose a hierarchical structure for your objects in S3 to help organize and search them, you must construct this hierarchy yourself using the flexible naming capabilities of object keys. You can do this by choosing a special character or string to mark the boundaries between components of a hierarchical path and by storing your objects with key names that describe their full path in the hierarchy.
Because objects can be accessed using URIs, as if S3 was a
standard web server, the most obvious character to use for delimiting
the components of a hierarchical path is a forward slash (/
). If
you use slash characters in your hierarchical object keys, the
resulting paths will look like the URIs everyone is familiar with. For
example, suppose you want to store your photo collection in a bucket
called “pictures,” and you will use object keys to simulate a
directory hierarchy. You could store your pictures with keys like
2007/March/My Birthday Cake.jpg.
Not only would this make it possible to search for specific objects in
the hierarchy using the S3 functionality we will discuss in Listing Objects” later in this chapter, but if you
made your pictures publicly accessible, they would be available at a
sensible URL, such as http://s3.amazonaws.com/pictures/2007/March/My
BirthdayCake.jpg.
S3 allows a broad range of characters to be used in object key names, including all unicode characters between U+0001 and U+10FFFF. However, this range of legal keyname characters includes some unprintable characters that cannot be properly represented in the XML documents S3 returns when it lists the object keys in a bucket. It is therefore possible to create objects with key names that cannot be parsed from object listings using standard XML parsing tools. You should avoid using such problematic object names by ensuring that object keys only include characters that can be represented in XML documents.
An object may have metadata associated with it to further describe properties of the object. This metadata is made up of short text strings comprising a name and one or more values. Both the name and values used in metadata items must conform to UTF-8 encoding.
The S3 service can provide two kinds of object metadata:
System metadata is information generated or used by the S3
service itself. This metadata is made available to you as
read-only information. Metadata items that may be available
include the request ID headers x-amz-request-id
and x-amz-id-2
, which assign a unique
identifier to each operation performed by S3. This information
can be useful to Amazon staff should they need to troubleshoot
problems you are experiencing with the service.
You can store up to 2 KB of your own metadata information with each of your objects. This custom information is not interpreted by the S3 service, it is merely stored by it and returned when the object is retrieved.
The REST interface of the S3 service supports metadata as an
extension to the standard HTTP header mechanism. Metadata is uploaded
to the service as HTTP request headers and is retrieved from the
service as response headers. This overlap between metadata as an S3
construct and as standard HTTP headers has interesting consequences.
On one hand, you must be careful to avoid accidentally using metadata
names that clash with HTTP headers. The REST interface makes this
possible by recognizing a special metadata name prefix, x-amz-meta-
, which
indicates that a header contains metadata.
On the other hand, you can deliberately store a range of HTTP
headers as metadata with your object so these headers can be returned
with the HTTP response when the object is retrieved. By uploading
metadata items without the x-amz-meta-
prefix, you can store certain
HTTP response header values as metadata with your object and control
how clients, such as web browsers, behave when they download your
objects. The most common example of using
standard HTTP headers as metadata is the Content-Type header. You can
specify the content type of your objects using this metadata item, so
web browsers can recognize the type of the object you have
stored.
S3 does not allow you to set arbitrary metadata items to be returned as HTTP headers; only some header names are recognized as legal HTTP headers. Any header with a name the service does not recognize is discarded. HTTP header names the service does recognize and store include: Content-Type, Content-Language, Expires, Cache-Control, Content-Disposition, and Content-Encoding.
Objects stored in S3 are immutable. An object’s key, data, and metadata information cannot be altered after the object is created. For example, you cannot change an object’s key to reflect a filename change, nor can you include new or changed information in its metadata. Most importantly, you cannot change the data content of an existing object, or add new data to it, without overwriting that object.
When you need to change an existing object, you must re-create the object from scratch. If you are storing your novel in S3, and you find and correct a single spelling mistake, you will have to upload the entire file again to save the corrected version. The same holds true if you are creating an object that contains a lot of data and the upload fails midway through; there is no way to resume the upload, and you will have to start again from scratch.
This feature of the S3 service has a good and a bad side. It is good because it allows S3 to manage objects efficiently behind the scenes, greatly simplifying Amazon’s service architecture and ensuring they can make the service available at a low cost. It can be bad because S3 developers sometimes have to make their applications more complicated to add more intelligence to the simple data-storage model. In the trade-off between making the S3 service work well and cheaply, and making developers’ lives easier, Amazon has elected to do the former. At least it will help keep us all in work.
The immutability of S3 objects has far-reaching implications for how you should design your S3-based applications. Unless your application uses data that very rarely changes, or data that changes so drastically that replacing whole objects at a time is a reasonable option, you will have to carefully consider how your application will handle data updates. The following few paragraphs summarize some of the approaches you could take in an application that must update data objects stored in S3.
If you wish to keep your application simple, and you are not dealing with large data items, you may choose to live with the overhead of re-creating objects whenever you need to reflect local data changes. This approach will require you to upload a new object whenever an object’s key name, data, or metadata content changes. The chief advantage of this approach, besides being simple, is that you can retain a direct relationship between an object key in S3 and the actual object data. This relationship is important if you intend to use S3 as a standard web server, where your key names act as user-friendly URLs.
If your object data does not change much, but you frequently need to update the names of objects—such as to reflect hierarchical changes or merely to present different object names to different users—you may wish to implement a remapping tool that converts object names to the real S3 object key names and vice versa.
If your application uses large amounts of data that change often, you may have no choice but to completely restructure the way your data are stored in S3. This approach could involve splitting large data files into smaller pieces, such that each piece can be stored as a separate object in S3. Then when your local data files change, you need only replace those pieces that are affected or add new pieces, rather than replacing the whole data set.
An application capable of restructuring data files in this way will clearly be quite complicated, It will require a lot of logic to manage the decomposition and recomposition of data files and to maintain a database of mappings to define the relationship between objects in S3 and your local data.
In the section ElasticDrive: S3 As a Virtual Block Device” in Chapter 4 we discuss a tool that allows you to create a standard filesystem on top of your S3 storage space. This tool stores the raw data blocks that represent a file system in S3 rather than the individual files. This is an extreme example of restructuring data to store it in S3.
To create an object in S3, you send a PUT request containing the object’s data and metadata to the service, with a URI specifying the key name of the object and the bucket it will be stored in. The object’s data content is provided as the body of the request, while the metadata is provided as request headers. When the object is successfully stored, S3 will return an HTTP 200 response message.
Most web browsers cannot perform PUT requests. To create objects using a web browser you must use the alternative POST request method discussed in Create Objects from a Web Browser Using POST.”
Remember that objects are immutable in S3 and can only be created, not updated. If you create an object using a key name that is already present in S3, the existing object will be replaced.
When an object with some data content is created or updated, an HTTP Content-Length header should be included with the request to inform S3 of the number of bytes it is expected to receive and store. In addition to this header, a range of extra information may be provided when you create an object.
If you include the Content-MD5 header with a Base-64-encoded MD5 hash of your data, S3 will perform a data verification check to ensure the data it received exactly matches the data you sent. Any discrepancy will cause the request to fail and prevent you from accidentally storing incorrect data in the service. We highly recommend taking advantage of this feature, and we have included it in our example implementation code. Alternatively, you can perform your own verification check using the ETag response header returned by S3, because this header contains a hex-encoded MD5 hash value of the data that the service received.
The access control settings of an object can be set by the
same request that creates the object by including a special header
named x-amz-acl
with a value
matching one of the canned ACL policy names available in the service
(see Canned Access Policies” later in this chapter for
more information).
Example 3-8 defines a method that creates an object in S3, uploads its data and metadata content, and sets the object’s access control permission settings.
Example 3-8. Create an object: S3.rb
def create_object(bucket_name, object_key, opts={}) # Initialize local variables for the provided option items data = (opts[:data] ? opts[:data] : '') headers = (opts[:headers] ? opts[:headers].clone : {}) metadata = (opts[:metadata] ? opts[:metadata].clone : {}) # The Content-Length header must always be set when data is uploaded. headers['Content-Length'] = (data.respond_to?(:stat) ? data.stat.size : data.size).to_s # Calculate an md5 hash of the data for upload verification if data.respond_to?(:stat) # Generate MD5 digest from file data one chunk at a time md5_digest = Digest::MD5.new File.open(data.path, 'rb') do |io| buffer = '' md5_digest.update(buffer) while io.read(4096, buffer) end md5_hash = md5_digest.digest else md5_hash = Digest::MD5.digest(data) end headers['Content-MD5'] = encode_base64(md5_hash) # Set the canned policy, may be: 'private', 'public-read', # 'public-read-write', 'authenticated-read' headers['x-amz-acl'] = opts[:policy] if opts[:policy] # Set an explicit content type if none is provided, otherwise the # ruby HTTP library will use its own default type # 'application/x-www-form-urlencoded' if not headers['Content-Type'] headers['Content-Type'] = data.respond_to?(:to_str) ? 'text/plain' : 'application/octet-stream' end # Convert metadata items to headers using the # S3 metadata header name prefix. metadata.each do |n,v| headers["x-amz-meta-#{n}"] = v end uri = generate_s3_uri(bucket_name, object_key) do_rest('PUT', uri, data, headers) return true end
The best way to become familiar with the workings of the object creation API is to see some examples. We will start by creating a simple text document inside our my-bucket bucket.
irb> s3.create_object('my-bucket', 'Hello.txt', :data => 'Hello World') REQUEST DESCRIPTION ======= PUT\n sQqNsWTgdUEFt6mb5y4/5Q==\n text/plain\n Wed, 07 Nov 2007 11:13:57 GMT\n /my-bucket/Hello.txt REQUEST ======= Method : PUT URI : https://my-bucket.s3.amazonaws.com/Hello.txt Headers: Expect=100-continue Authorization=AWS ABCDEFGHIJ1234567890:vf6Slm3v09YPjKyyHhVn0BshUuA= Date=Wed, 07 Nov 2007 11:13:57 GMT Content-Type=text/plain Host=my-bucket.s3.amazonaws.com Content-Length=11 Content-MD5=sQqNsWTgdUEFt6mb5y4/5Q== Request Body Data: Hello World RESPONSE ======== Status : 200 OK Headers: x-amz-id-2=MdRHrORD1Kw+Ps9zUhrzKczu9Jd0/1V2/0ITwK5vp2KKYshUGnGai7/htu1t/KLX etag="b10a8db164e0754105b7a99be72e3fe5" date=Wed, 07 Nov 2007 11:14:00 GMT x-amz-request-id=19E1F76D7C1864E9 server=AmazonS3 content-length=0
The object created by this command is quite simple. It contains
only text content without any user-defined metadata, and because it
does not specify an access control policy, it will be made private by
default. The object is stored in a resource that can be accessed using
two different URIs with slightly different formats: http://s3.amazonaws.com/my-bucket/Hello.txt or http://my-bucket.s3.amazonaws.com/Hello.txt; but
because it is a private object, you will receive an AccessDenied
error
message from S3 if you try to load this location in your web
browser.
We have not yet discussed how to list the objects stored in your
bucket. For the time being, we will make all your objects publicly
readable, so you can access them through a web browser and confirm
that they have been stored in S3. Replace the original Hello.txt object with a new version that
uses the canned access control policy public-read
.
irb> s3.create_object('my-bucket', 'Hello.txt', :data => 'Hello World', :policy => 'public-read') REQUEST DESCRIPTION ======= PUT\n sQqNsWTgdUEFt6mb5y4/5Q==\n text/plain\n Wed, 07 Nov 2007 11:16:58 GMT\n x-amz-acl:public-read\n /my-bucket/Hello.txt REQUEST ======= Method : PUT URI : https://my-bucket.s3.amazonaws.com/Hello.txt Headers: Expect=100-continue Authorization=AWS ABCDEFGHIJ1234567890:1edltaBG0ImEqCafMgUeHp6APlE= x-amz-acl=public-read Date=Wed, 07 Nov 2007 11:16:58 GMT Content-Type=text/plain Host=my-bucket.s3.amazonaws.com Content-Length=11 Content-MD5=sQqNsWTgdUEFt6mb5y4/5Q== Request Body Data: Hello World RESPONSE ======== Status : 200 OK Headers: x-amz-id-2=8gStUFaxZf3V+rMZ/hiimIBnsKr4QZHuxiUuWgSWMzsP8QEHVl6Z1aUMMpzQgA6N etag="b10a8db164e0754105b7a99be72e3fe5" date=Wed, 07 Nov 2007 11:17:01 GMT x-amz-request-id=77FACC965DBFBDD0 server=AmazonS3 content-length=0
Now when you visit the URI http://my-bucket.s3.amazonaws.com/Hello.txt in your browser, you will see the contents of the object displayed as text.
The create_object
method
automatically sets the Content-Type HTTP request header to have the value
text/plain when an object is created with textual data. In some
circumstances you may wish to override this behavior and create text
objects with a different content type, such as HTML pages. To set your
own HTTP headers, you can provide them as a hash to the method’s
headers
parameter.
Here is a command to upload a simple HTML document to S3 with the
content type set to text/html
, so a
web browser will interpret it correctly.
irb> headers = {'Content-Type'=>'text/html'} irb> html = '<b>Webpage</b> <i>content</i>' irb> s3.create_object('my-bucket', 'WebPage.html', :data => html, :policy => 'public-read', :headers => headers) REQUEST DESCRIPTION ======= PUT\n rD3P+CBXWNyYwtcqKuijpQ==\n text/html\n Wed, 07 Nov 2007 11:20:43 GMT\n x-amz-acl:public-read\n /my-bucket/WebPage.html REQUEST ======= Method : PUT URI : https://my-bucket.s3.amazonaws.com/WebPage.html Headers: Expect=100-continue Authorization=AWS ABCDEFGHIJ1234567890:2dreEnJ/XP2MIg52AO2OO/oJlvU= x-amz-acl=public-read Date=Wed, 07 Nov 2007 11:20:43 GMT Content-Type=text/html Host=my-bucket.s3.amazonaws.com Content-Length=29 Content-MD5=rD3P+CBXWNyYwtcqKuijpQ== Request Body Data: <b>Webpage</b> <i>content</i> RESPONSE ======== Status : 200 OK Headers: x-amz-id-2=DXdHJYXrmJuKHWJwJY2Wt0pD4KYWOrNcQoTe3rMyOXBLKYQrtxgnLPxr7k3K7qVm etag="ac3dcff8205758dc98c2d72a2ae8a3a5" date=Wed, 07 Nov 2007 11:20:45 GMT x-amz-request-id=2D54144B83D6E9F2 server=AmazonS3 content-length=0
If you visit this object’s URL in your web browser, you will see the page displayed with the correct HTML formatting. You can use this technique to associate a range of HTTP response headers with your objects, such as Content-Language for specifying the language a page is written in, and Expires to set how long pages retrieved from S3 should be cached.
To store metadata information other than HTTP headers with
an object, you can provide a set of metadata name and value items to
the create_object
method in the
metadata parameter. The method will send these items to the service as
HTTP request headers, but only after renaming the items to include the
prefix x-amz-meta-
, which indicates
the header is metadata.
irb> s3.create_object('my-bucket', 'Metadata.txt', :data => 'I have metadata!', :policy => 'public-read', :metadata => {'Description'=>'A welcome message'}) REQUEST DESCRIPTION ======= PUT\n LWtbsg8yIkBEjFolkkq09Q==\n text/plain\n Wed, 07 Nov 2007 11:22:20 GMT\n x-amz-acl:public-read\n x-amz-meta-description:A welcome message\n /my-bucket/Metadata.txt REQUEST ======= Method : PUT URI : https://my-bucket.s3.amazonaws.com/Metadata.txt Headers: x-amz-meta-Description=A welcome message Expect=100-continue Authorization=AWS ABCDEFGHIJ1234567890:ICUUaodoew22fxB0bQFthsy7zUE= x-amz-acl=public-read Date=Wed, 07 Nov 2007 11:22:20 GMT Content-Type=text/plain Host=my-bucket.s3.amazonaws.com Content-Length=16 Content-MD5=LWtbsg8yIkBEjFolkkq09Q== Request Body Data: I have metadata! RESPONSE ======== Status : 200 OK Headers: x-amz-id-2=jsMfJ0OYqhY+sH5zhybkZhwuEISITgNcNcQ5nm0jkoULXdnSWh7nV+revG1XReTn etag="2d6b5bb20f322240448c5a25924ab4f5" date=Wed, 07 Nov 2007 11:22:22 GMT x-amz-request-id=A70A4451681D9C6B server=AmazonS3 content-length=0
Web browsers do not generally display HTTP header information, so you will not be able to see this object’s metadata information if you visit the URL in a browser. We will demonstrate how to retrieve an object’s data and metadata directly in the next section, Retrieving Objects.”
Finally, we should note that the implementation we provided can
upload files to S3, as well as text strings. To upload the contents of
a file to S3, you supply a Ruby file object to the method as its data
parameter. Here is a command that uploads an image file—called
image.png—in the local directory,
makes it publicly accessible, and sets the content type to image/png
, so web browsers will display it
correctly.
irb> s3.create_object('my-bucket', 'image.png', :data => File.new('image.png', 'rb'), :policy => 'public-read', :headers => {'Content-Type'=>'image/png'})
There are two ways to retrieve information about an object from S3: using a GET request or a HEAD request.
To retrieve all of an object’s data, including both its content and metadata, you send a GET request to the service with a URI specifying the bucket the object is stored in and its key name. The response to a GET request will contain the object’s data in the response body and its metadata in the response headers.
To retrieve only an object’s metadata and not its contents, you send a HEAD request to the service instead. The response to a HEAD request includes the metadata headers but contains no body. Both GET and HEAD requests will return an HTTP 200 status code when they are successful.
You may wonder why you would ever use the HEAD method, when you can simply use the GET method to retrieve all of an object’s data and ignore the response body, if you are not interested in the object’s contents. It is worthwhile to use HEAD requests when you are only interested in an object’s metadata, because your client will not have to maintain an open network connection any longer than necessary or manually close the connection to discard the unwanted response-body data.
Example 3-9 and Example 3-10 define methods that retrieve data from an object in S3. The former uses a GET request and retrieves the object’s content data and its metadata, and the latter uses a HEAD request and only retrieves the metadata.
Example 3-9. Retrieve an object: S3.rb
def get_object(bucket_name, object_key, headers={}) uri = generate_s3_uri(bucket_name, object_key) if block_given? response = do_rest('GET', uri, nil, headers) {|segment| yield(segment)} else response = do_rest('GET', uri, nil, headers) end response_headers = {} metadata = {} response.each_header do |name,value| if name.index('x-amz-meta-') == 0 metadata[name['x-amz-meta-'.length..-1]] = value else response_headers[name] = value end end result = { :metadata => metadata, :headers => response_headers } result[:body] = response.body if not block_given? return result end
The get_object
method
performs differently when a code block is provided to the method, in
which case the data is downloaded from S3 a piece at a time and is
passed on to the code block for processing as it arrives. This
behavior allows us to stream object downloads, processing the data as
it arrives rather than leaving Ruby to store all the object’s data in
memory. This has obvious advantages when you are downloading large
objects. If the download is streamed, the object’s data is not
returned in the method’s dictionary result object.
Example 3-10. Retrieve an object’s metadata: S3.rb
def get_object_metadata(bucket_name, object_key, headers={}) uri = generate_s3_uri(bucket_name, object_key) response = do_rest('HEAD', uri, nil, headers) response_headers = {} metadata = {} response.each_header do |name,value| if name.index('x-amz-meta-') == 0 metadata[name[11..-1]] = value else response_headers[name] = value end end return { :metadata => metadata, :headers => response_headers } end
Let us use the objects we created earlier to demonstrate how to
use the get_object
method. We will
retrieve the Metadata.txt object
from the my-bucket bucket,
because this object contains both metadata and content data. We will
store the results in a Ruby variable, so we can examine them more
closely.
irb> obj = s3.get_object('my-bucket', 'Metadata.txt')
REQUEST DESCRIPTION
=======
GET\n
\n
\n
Wed, 07 Nov 2007 11:25:04 GMT\n
/my-bucket/Metadata.txt
REQUEST
=======
Method : GET
URI : https://my-bucket.s3.amazonaws.com/Metadata.txt
Headers:
Authorization=AWS ABCDEFGHIJ1234567890:m2Df9z1SeO7bmM4G1NLRSE6YkxU=
Date=Wed, 07 Nov 2007 11:25:04 GMT
Host=my-bucket.s3.amazonaws.com
RESPONSE
========
Status : 200 OK
Headers:
last-modified=Wed, 07 Nov 2007 11:22:22 GMT
x-amz-meta-description=A welcome message
x-amz-id-2=+VstU+ssan+c+cDuFWL7fQtvm4KqGWxUPwhQOc2JF7mjhzfNbiGI1jJ6NvsTpX+I
content-type=text/plain
etag="2d6b5bb20f322240448c5a25924ab4f5"
date=Wed, 07 Nov 2007 11:25:07 GMT
x-amz-request-id=269211548D652727
server=AmazonS3
content-length=16
REQUEST DESCRIPTION
=======
GET\n
\n
\n
Tue, 06 Nov 2007 02:57:45 GMT\n
/my-bucket/Metadata.txt
{:headers=>
{"last-modified"=>"Wed, 07 Nov 2007 11:22:22 GMT",
"x-amz-id-2"=>
"Q9Y/kBSntTXpWI2Vmk56aqf07P5MzFxW06xx25hDQIoDZhhtFtA3Ue2ojF5c7lkF",
"date"=>"Wed, 07 Nov 2007 11:25:07 GMT",
"etag"=>"\"2d6b5bb20f322240448c5a25924ab4f5\"",
"content-type"=>"text/plain",
"x-amz-request-id"=>"EBECEC8CCC3AFEB3",
"server"=>"AmazonS3",
"content-length"=>"16"},
:body=>"I have metadata!",
:metadata=>{"description"=>"A welcome message"}}
The debug log above shows that the response message includes the
object’s Description metadata item in a response header called
x-amz-meta-description
, and its
content data in the response body. We can obtain these details through
the Ruby data structures returned by the method.
irb> obj[:body] => "I have metadata!" irb> obj[:metadata] => {"description"=>"A welcome message"} irb> obj[:headers]['content-type'] => "text/plain"
To save the data content of an S3 object to a file, you can
write the contents of the :body
hash item from the method’s result to a File object.
irb> File.open('Metadata.txt','w') do |file| irb> file.write(obj[:body]) irb> end irb> puts File.new('Metadata.txt').read I have metadata!
If the object in S3 is large, you should stream the download
straight to a file, rather than having Ruby store the data in memory
in the :body
variable. To stream
the download, you must define a code block to process the download a
piece at a time, and provide this code block to the get_object
method.
# Stream an object download from S3 irb> File.open('Metadata.txt','w') do |file| irb> obj = s3.get_object('my-bucket','Metadata.txt') do |data| irb> file.write(data) irb> end irb> end # The object's data is stored to a file irb> puts File.new('Metadata.txt').read I have metadata! # However, the object's data is not available in the result object irb> obj[:body] => nil
If you are only interested in an object’s metadata, use the
get_object_metadata
method instead
of the get_object
method.
irb> obj_meta = s3.get_object_metadata('my-bucket', 'Metadata.txt') REQUEST DESCRIPTION ======= HEAD\n \n \n Wed, 07 Nov 2007 11:26:58 GMT\n /my-bucket/Metadata.txt REQUEST ======= Method : HEAD URI : https://my-bucket.s3.amazonaws.com/Metadata.txt Headers: Authorization=AWS ABCDEFGHIJ1234567890:0BpIagDvtNXIBMAiquTFkq3qmC8= Date=Wed, 07 Nov 2007 11:26:58 GMT Host=my-bucket.s3.amazonaws.com RESPONSE ======== Status : 200 OK Headers: last-modified=Wed, 07 Nov 2007 11:22:22 GMT x-amz-meta-description=A welcome message x-amz-id-2=NhMrsU/5vunkvEMo51x4aXKYRM/QSp5JJSptv51N0i7Oyb8LMw+NcPXVtC+GBNRk content-type=text/plain etag="2d6b5bb20f322240448c5a25924ab4f5" date=Wed, 07 Nov 2007 11:27:01 GMT x-amz-request-id=F9793916FB0DAF60 server=AmazonS3 content-length=16 {:headers=> {"last-modified"=>"Wed, 07 Nov 2007 11:22:22 GMT", "x-amz-id-2"=> "NhMrsU/5vunkvEMo51x4aXKYRM/QSp5JJSptv51N0i7Oyb8LMw+NcPXVtC+GBNRk", "date"=>"Wed, 07 Nov 2007 11:27:01 GMT", "etag"=>"\"2d6b5bb20f322240448c5a25924ab4f5\"", "content-type"=>"text/plain", "x-amz-request-id"=>"F9793916FB0DAF60", "server"=>"AmazonS3", "content-length"=>"16"}, :metadata=>{"description"=>"A welcome message"}} irb> obj_meta[:body] => nil irb> obj_meta[:metadata] => {"description"=>"A welcome message"} irb> obj[:headers]['content-type'] => "text/plain"
S3 supports a specialized set of HTTP request headers that can be used to exert more control over how and when objects are retrieved with GET requests. These headers can be used to cause GET requests to only retrieve objects when certain conditions are met, or to retrieve only a specific portion of an object’s content data.
Table 3-3 describes the conditions that can be applied to GET requests by providing specific request headers.
Table 3-3. Headers for conditional GET requests
To demonstrate how to apply these conditional request headers, let us try performing some conditional requests. The first request applies the Range request header to return only the last nine bytes of the Metadata.txt object’s data content—the word “metadata!” The second request applies the If-Unmodified-Since header to return the object only if it has been unchanged since 1994. This request will fail, because the object is obviously newer than that. The third request applies the If-Modified-Since header and this time the request will succeed, because the object has been modified since 1994.
irb> obj = s3.get_object('my-bucket', 'Metadata.txt', {'Range'=>'bytes=-9'}) irb> obj[:body] => "metadata!" irb> obj = s3.get_object('my-bucket', 'Metadata.txt', {'If-Unmodified-Since'=>'Sat, 29 Oct 1994 19:43:31 GMT'}) AWS::ServiceError: HTTP Error: 412 - Precondition Failed, AWS Error: PreconditionFailed - At least one of the pre-conditions you specified did not hold irb> obj = s3.get_object('my-bucket', 'Metadata.txt', {'If-Modified-Since'=>'Sat, 29 Oct 1994 19:43:31 GMT'}) irb> obj[:body] => "I have metadata!"
To obtain a listing of the objects you have stored in a bucket, you send a GET request to a URI that specifies the bucket resource. S3 will reply with an HTTP 200 response message that contains an XML document in the response body. This XML object-listing document contains an inventory of the objects in a bucket, including important information about each object, such as its key name, size, last modification date, and the MD5 hash value of its data.
Here is an XML document returned by the operation:
<ListBucketResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'> <Name>my-bucket</Name> <Prefix/> <Marker/> <MaxKeys>1000</MaxKeys> <IsTruncated>false</IsTruncated> <Contents> <Key>Metadata.txt</Key> <LastModified>2007-11-06T02:52:41.000Z</LastModified> <ETag>"2d6b5bb20f322240448c5a25924ab4f5"</ETag> <Size>16</Size> <Owner> <ID>1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b</ID> <DisplayName>jamesmurty</DisplayName> </Owner> <StorageClass>STANDARD</StorageClass> </Contents> <Contents> <Key>WebPage.html</Key> <LastModified>2007-11-06T02:51:35.000Z</LastModified> <ETag>"ac3dcff8205758dc98c2d72a2ae8a3a5"</ETag> <Size>29</Size> <Owner> <ID>1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b</ID> <DisplayName>jamesmurty</DisplayName> </Owner> <StorageClass>STANDARD</StorageClass> </Contents> </ListBucketResult>
This document has a complex structure and contains much more information than just an inventory of our objects. Let us work through the elements in this document structure by dividing them into three categories: object details, truncated listings, and searching.
In an object-listing document, the information that is most apparent, and most immediately useful, is the actual list of objects. In the ListBucketResult document structure, objects are listed as a set of Contents elements, which contain objects sorted alphabetically by their key name. Each Contents element includes a number of useful details about the object it represents.
Inside each Contents element we can see important details about the object, including its key name (Key), a timestamp describing when it was created or last modified (LastModified), a hex-encoded MD5 hash value of the object’s data content (ETag), and the size of the object in bytes (Size). In addition to these object details, the listing includes an Owner element that identifies the owner or creator of the object. Unless you are listing the contents of a bucket owned by someone else, or you have permitted someone else to create objects in your bucket, you will be the owner of all the objects in your bucket. Finally, the Contents element includes a StorageClass designation for each object. This is not a meaningful piece of information, because the storage class of objects is always Standard.
S3 buckets can hold an unlimited number of objects. This could clearly pose a problem if you were to list all the objects in a large bucket, because the listing document would continue forever, or at least for as long as your patience held out. To avoid returning huge listing documents for buckets with many objects, S3 truncates object listings to include 1,000 or fewer objects. When a client receives a truncated object-listing document, it is responsible for recognizing that the listing is incomplete and performing follow-up queries, until it is able to build up a complete list of the bucket’s contents.
When you send an object listing request to the service you can specify the maximum number of objects you want to be included in the listing. The service will include up to this many objects and no more, though it may also return fewer objects than requested. The listing document returned by the service describes the limit it applied in the MaxKeys element. In our example listing document, we can see that this limit is set to the default maximum value of 1000. To limit the number of keys that will be included in a key-listing document you include the parameter max-keys in the GET request.
When an S3 client receives an object-listing that has been
truncated, it must recognize that there are more objects in the
bucket than are recorded in the listing, so it can perform
additional requests to list the missing objects. The IsTruncated element of the listing document indicates
when a listing has been truncated. If this element has the value
true
, the listing document gives
an incomplete picture of the bucket’s contents.
In addition to recognizing when a listing is incomplete, S3
clients need to have a mechanism to perform follow-up requests, so
they can build up a complete list of objects one listing document at
a time. To paginate a large number of objects in multiple listings,
S3 clients provide a marker
parameter with their listing requests. This parameter specifies a
string that will serve as the starting point for the listing. When a
marker is provided, only those objects in the bucket with key names
alphabetically after the marker will be included in the listing. If
the marker parameter is specified in a request, S3 will echo this
value back to the client as an element called Marker in the response
XML document.
By providing the marker
parameter
when performing a listing request, S3 clients can ensure that the
service provides the next listing they need to find out about the
objects that were not included in the prior listing. The client can
determine what is the appropriate value to use for the marker
parameter in one of two ways:
If the listing document includes a NextMarker element, the value of this element can
be used in the marker
parameter of a follow-up request to retrieve the next set of
objects. Unfortunately, the NextMarker element is only provided
by the service in some circumstances (see the discussion on
Searching” below). If
this element is not available, the client must resort to the
second option.
When the NextMarker element is not available, the client
can deduce the appropriate value to use in the marker
parameter of a follow-up
request by identifying the last object key name in the current
listing document. When the last key name from the current
listing is used as the marker for the next request, only objects
with key names that occur after this name
will be listed.
Retrieving an object listing for buckets with many thousands of objects can be a slow process, because clients must perform multiple listing requests to learn about all the objects. If your buckets will store many objects, it may be a good idea to keep a local record of your object key names so you can refer to this instead of performing time-consuming listing requests.
If it is not feasible to keep your own records of object details, you can still speed up object listings by performing multiple listing requests at the same time with different marker values and merging the results.
S3’s object-listing API provides basic searching functionality to find objects based on their key names. In Object Keys and Hierarchical Naming,” we discussed how S3 object key names can be constructed to represent a hierarchy resembling a file directory structure. When you perform object-listings, you can take advantage of this structure and list only those objects that occupy a specific place in the hierarchy.
The listing API accepts two parameters that allow you to
search for objects with specific names: prefix
and delimiter
. If you include the prefix
parameter value in your object-listing request, only those objects
with key names that start with the prefix string will be listed. If
you include the delimiter parameter, any objects with key names that
contain the delimiter string will be listed in a separate part of
the XML document, as if they were directory names. If you use both
of these parameters at the same time, you can navigate through the
hierarchy represented in your object key names like you would
through the subdirectories in a standard file system.
These parameters are not easy to describe, so let us look at some examples to make things clearer. Imagine that you have stored a number of images in a bucket with key names that represent a hierarchy, like this:
MyPictures/2005/image1.jpg MyPictures/2006/image2.jpg MyPictures/2007/image3.jpgMyPictures/2007/image4.jpg
To list only the images that date from 2005, you could perform
a listing with the prefix parameter set to MyPictures/2005/
, and only the object
named MyPictures/2005/image1.jpg
would be listed.
To find out how many year-based subdirectories are inside this
hierarchy, you would perform a listing with the prefix value
MyPictures/
and the delimiter
value /
(a forward slash). With
these two parameters, the listing will not contain any objects at
all, because the delimiter character is present after the prefix
string in all the object keys. Instead, the XML listing document
will include a set of CommonPrefixes elements containing a Prefix value
corresponding to each unique object key name
that includes both the prefix and the delimiter.
<CommonPrefixes> <Prefix>MyPictures/2005/</Prefix> </CommonPrefixes> <CommonPrefixes> <Prefix>MyPictures/2006/</Prefix> </CommonPrefixes> <CommonPrefixes> <Prefix>MyPictures/2007/</Prefix> </CommonPrefixes>
As you can see, the CommonPrefixes resemble a subdirectory listing of the MyDocuments directory. You will also notice that the CommonPrefixes values are all unique. Although there are two objects in the MyPictures/2007/ location, this prefix is only mentioned once.
These CommonPrefixes strings are very useful for navigating hierarchies, because you can apply them as prefix parameter values to follow-up requests to drill down into the contents of each of the simulated subdirectories.
Do not worry if you are still confused about these object-searching parameters; we will demonstrate them shortly.
We have mentioned a number of parameters that can be applied to object listing requests. Table 3-4 summarizes the parameters recognized by the S3 API and what they do.
Table 3-4. Request parameters for listing objects
Parameter Name | Description |
---|---|
max-keys | The maximum number of objects that will be listed, up to 1,000. |
marker | A string that serves as a starting point for the listing. Only object keys that occur alphabetically after this marker value will be included in the listing. |
prefix | Only objects with key names that start with the prefix string will be included in the listing. |
delimiter | If object key names contain the delimiter string, they are listed as subdirectories in the CommonPrefixes document element. If the request also includes the prefix parameter, the delimiter must occur after the prefix portion of the key name to be recognized. |
Example 3-11 defines a method that sends a
GET request to a bucket’s URI and retrieves and interprets the
resulting object-listing XML document. The method automatically
handles truncated listings by performing follow-up requests
until all the objects have been listed. You
can add optional parameter settings to control the listing by
providing a params
argument
containing an array of hash objects that map each parameter’s name
to its value.
Example 3-11. List objects: S3.rb
def list_objects(bucket_name, *params) is_truncated = true objects = [] prefixes = [] while is_truncated uri = generate_s3_uri(bucket_name, '', params) response = do_rest('GET', uri) xml_doc = REXML::Document.new(response.body) xml_doc.elements.each('//Contents') do |contents| objects << { :key => contents.elements['Key'].text, :size => contents.elements['Size'].text, :last_modified => contents.elements['LastModified'].text, :etag => contents.elements['ETag'].text, :owner_id => contents.elements['Owner/ID'].text, :owner_name => contents.elements['Owner/DisplayName'].text } end cps = xml_doc.elements.to_a('//CommonPrefixes') if cps.length > 0 cps.each do |cp| prefixes << cp.elements['Prefix'].text end end # Determine whether listing is truncated is_truncated = 'true' == xml_doc.elements['//IsTruncated'].text # Remove any existing marker value params.delete_if {|p| p[:marker]} # Set the marker parameter to the NextMarker if possible, # otherwise set it to the last key name in the listing next_marker_elem = xml_doc.elements['//NextMarker'] last_key_elem = xml_doc.elements['//Contents/Key[last()]'] if next_marker_elem params << {:marker => next_marker_elem.text} elsif last_key_elem params << {:marker => last_key_elem.text} else params << {:marker => ''} end end return { :bucket_name => bucket_name, :objects => objects, :prefixes => prefixes } end
The best way to come to grips with the many capabilities of the object-listing API operation is to perform some real listings and see what results we get. We will start with our test bucket my-bucket, which contains a few objects we have already created.
Let us list the objects in our test bucket and store the
results in a listing
variable.
irb> listing = s3.list_objects('my-bucket') REQUEST DESCRIPTION ======= GET\n \n \n Wed, 07 Nov 2007 11:28:27 GMT\n /my-bucket/ REQUEST ======= Method : GET URI : https://my-bucket.s3.amazonaws.com Headers: Authorization=AWS ABCDEFGHIJ1234567890:isS0zbXbBWCHLj1awGUHKrEFhRI= Date=Wed, 07 Nov 2007 11:28:27 GMT Host=my-bucket.s3.amazonaws.com RESPONSE ======== Status : 200 OK Headers: x-amz-id-2=x4WBeY62B2w7krKmRfo2hSgpcERZq38PP90knTW1XxWwYy2+rp/oWBRCDnt1k2aI content-type=application/xml date=Wed, 07 Nov 2007 11:28:30 GMT x-amz-request-id=E54E58955A479C6B server=AmazonS3 transfer-encoding=chunked
We can examine the contents of the listing to confirm that it contains only objects and no common prefixes.
irb> listing[:bucket_name] => "my-bucket" irb> listing[:prefixes] => [] irb> listing[:objects].size => 3 irb> listing[:objects].each {|o| puts "#{o[:key]} (#{o[:size]} bytes)"} Hello.txt (11 bytes) Metadata.txt (16 bytes) WebPage.html (29 bytes)
To see what happens when the list_objects
method encounters a truncated
listing, you can send a listing request with the max-keys parameter
set to a value smaller than the number of objects in your bucket. We
will set the maximum keys limit to 1, which means the method will
have to perform three separate requests.
Parameter arguments can be provided to the list_objects
method hashes with key
names that are either Ruby symbols (:marker=>'x'
) or plain strings
('marker'=>'x'
). Strings
will always work but are less efficient, and symbols cannot
include nonalphabetical characters, like the hyphen in the
max-keys
parameter name. We
will use symbol names whenever possible.
irb> listing = s3.list_objects('my-bucket', 'max-keys'=>1) . . . REQUEST ======= Method : GET URI : https://my-bucket.s3.amazonaws.com/?max-keys=1 Headers: Authorization=AWS ABCDEFGHIJ1234567890:ax8AZKHXMIhlt3QWwuLWRSzMceY= Date=Wed, 03 Oct 2007 04:20:08 GMT Host=s3.amazonaws.com . . . REQUEST ======= Method : GET URI : https://my-bucket.s3.amazonaws.com/?max-keys=1 &marker=Hello.txt . . . irb> listing[:objects].size => 3
If you do not have debugging turned on, you will not see any
difference in the results, although the listing will take longer
because it will require multiple HTTP requests instead of just one.
If you enable debugging, you will see the implementation performing
multiple requests until all the objects are listed. All of the
request messages except the first will include a marker
parameter value.
We can set our own marker value to list only those objects with names that occur alphabetically after the marker in the bucket listing. If we list all the object names that occur after the Metadata.txt object we will receive one result.
irb> listing = s3.list_objects('my-bucket', :marker=>'Metadata.txt') irb> listing[:objects].each {|o| puts o[:key]} WebPage.html
If we list all the object names that occur after the letter W we will receive the same result, because only the object key name WebPage.html is sorted alphabetically after the string.
irb> listing = s3.list_objects('my-bucket', :marker=>'W') irb> listing[:objects].each {|o| puts o[:key]} WebPage.html
To properly demonstrate how to use the prefix
and delimiter
parameters to perform searches
on your objects and navigate through key name hierarchies, we must
first create some test objects with hierarchical names. We will
create some dummy objects now using the same image names we
discussed in Searching.”
These objects can be empty because we are only interested in the
object names.
irb> s3.create_object('my-bucket', 'MyPictures/2005/image1.jpg') irb> s3.create_object('my-bucket', 'MyPictures/2006/image2.jpg') irb> s3.create_object('my-bucket', 'MyPictures/2007/image3.jpg') irb> s3.create_object('my-bucket', 'MyPictures/2007/image4.jpg')
Now, let us list the objects in our bucket that match the prefix ‘M’.
irb> listing = s3.list_objects('my-bucket', :prefix=>'M') irb> listing[:objects].each {|o| puts o[:key]} Metadata.txt MyPictures/2005/image1.jpg MyPictures/2006/image2.jpg MyPictures/2007/image3.jpg MyPictures/2007/image4.jpg
If we add to this mix a delimiter
parameter set to the slash
character, we can see how the object key names that contain the
delimiter
are summarized into the
CommonPrefixes element.
irb> listing = s3.list_objects('my-bucket', :prefix=>'M', :delimiter=>'/')
. . .
REQUEST
=======
Method : GET
URI : https://my-bucket.s3.amazonaws.com/?delimiter=%2F
&prefix=M
. . .
RESPONSE
========
. . .
Body:
<?xml version='1.0' encoding='UTF-8'?>
<ListBucketResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'>
<Name>my-bucket</Name>
<Prefix>M</Prefix>
<Marker/>
<MaxKeys>1000</MaxKeys>
<Delimiter>/</Delimiter>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>Metadata.txt</Key>
<LastModified>2007-11-07T11:22:22.000Z</LastModified>
<ETag>"2d6b5bb20f322240448c5a25924ab4f5"</ETag>
<Size>16</Size>
<Owner>
<ID>1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b</ID>
<DisplayName>jamesmurty</DisplayName>
</Owner>
<StorageClass>STANDARD</StorageClass>
</Contents>
<CommonPrefixes>
<Prefix>MyPictures/</Prefix>
</CommonPrefixes>
</ListBucketResult>
With both the prefix and delimiter applied, this listing includes only one object key for the Metadata.txt object and a single, common prefixes item.
irb> listing[:objects].each {|o| puts o[:key]} Metadata.txt irb> listing[:prefixes] => ["MyPictures/"]
To drill down further into the pseudo-subdirectory MyPictures, we can perform a follow-up
request that takes the value from the CommonPrefixes XML element and
includes it in the request message as the prefix
parameter.
irb> prefix = listing[:prefixes].first => "MyPictures/" irb> listing = s3.list_objects('my-bucket', :prefix=>prefix, :delimiter=>'/') irb> listing[:prefixes] => ["MyPictures/2005/", "MyPictures/2006/", "MyPictures/2007/"]
This latest listing will include the set of objects and
prefixes underneath MyPictures/
in the hierarchy. In our example there are only prefixes, because we
have not stored any objects in this location in the heirarchy. This
demonstrates very nicely how the prefix
and delimiter
parameters can be set according
to the values returned in the CommonPrefixes XML element of a prior
response document. Doing so can allow you to navigate through a
hierarchical naming structure as though it were a directory
structure in a standard filesystem.
We will end by going one level deeper in this hierarchy to
list objects at the end of the hierarchy. You reach the leaves in
the hierarchy when the object key names do not contain the delimiter
string in the portion of the key
name after the prefix
.
irb> listing = s3.list_objects('my-bucket', :prefix=>'MyPictures/2005/', :delimiter=>'/') irb> listing[:prefixes] => [] irb> listing[:objects].each {|o| puts "#{o[:key]} (#{o[:size]} bytes)"} MyPictures/2005/image1.jpg (0 bytes)
To delete an object from S3, you send a DELETE request to the service with a URI specifying the object’s key name and the bucket it is stored in. The object will be deleted, and S3 will return an empty HTTP 204 response message. Delete requests will succeed when you delete an object multiple times, or even if you delete an object that never existed.
Be careful when deleting objects; there is no way to retrieve an object’s data after it has been deleted.
Example 3-12 defines a method that sends a DELETE request message to a URI specifying the bucket the object is stored in and the object’s key name.
Example 3-12. Delete object: S3.rb
def delete_object(bucket_name, object_key) uri = generate_s3_uri(bucket_name, object_key) do_rest('DELETE', uri) return true end
Here is an example command and debugging log showing what happens when you delete an object.
irb> s3.delete_object('my-bucket', 'MyPictures/2005/image1.jpg') REQUEST DESCRIPTION ======= DELETE\n \n \n Wed, 07 Nov 2007 11:35:06 GMT\n /my-bucket/MyPictures/2005/image1.jpg REQUEST ======= Method : DELETE URI : https://my-bucket.s3.amazonaws.com/MyPictures/2005/image1.jpg Headers: Authorization=AWS ABCDEFGHIJ1234567890:4Kext6O5ezLFqR0SCPR9flRs2eI= Date=Wed, 07 Nov 2007 11:35:06 GMT Host=my-bucket.s3.amazonaws.com RESPONSE ======== Status : 204 No Content Headers: x-amz-id-2=YdPaperJTlH8CPlaRmNj1JbmElROjNxwVBfh1rboy1st7kRDoDjHhc2rsu5e6WeP date=Wed, 07 Nov 2007 11:35:08 GMT x-amz-request-id=872BE829A52CAAC7 server=AmazonS3 irb> s3.get_object('my-bucket', 'MyPictures/2005/image1.jpg') AWS::ServiceError: HTTP Error: 404 - Not Found, AWS Error: NoSuchKey - The specified key does not exist.
We have already seen in Create or Replace an Object” how you can create objects in S3 with PUT requests. The problem with PUT requests is that common web browsers do not support them; they rely instead on HTML forms and POST requests to upload files and data to web servers. In early 2008 Amazon added support for POST requests to the S3 service to make it possible for S3 developers and their customers to upload content into S3 using a standard web browser.
The new POST support is intended to augment rather than replace the PUT method; it allows for browser-based uploads to S3 but nothing more. You cannot create buckets or update access control settings with POST requests. If your application interacts with S3 directly the PUT request method remains the preferred mechanism for uploading data into the service. However, if your application provides a web site that accepts user-submitted content, the POST support could make your life much easier.
POST requests are constructed very differently to the other kinds of requests we have seen so far. Rather than building the request message directly yourself, you must provide the browser with an HTML Form containing all the information it will need to build a valid request on your behalf.
Here are the steps involved in allowing a web site visitor to upload content to your S3 account:
Your application generates a web page containing a specially-constructed HTML form. This form contains input fields that define and authenticate the POST request that will be sent by the browser when the user submits the form. The form will generally include a set of policy conditions specifying what kind of data the user can upload.
The user populates the HTML form with some data which may be text or a file he wishes to upload. The user then submits the form and the web browser sends a corresponding POST request to S3.
The S3 service receives the POST request and checks that the data provided by the user complies with any policy conditions you specified in the form. If any of the policy conditions fail, or if the form submitted is different from the version you authorized and signed, the service will reject the request and return an XML error message.
If the service accepts the POST request and the user successfully uploads his data, S3 will respond with a success status code, or it will redirect the user’s browser to another URL if you specify one. If the upload fails, S3 will respond with an XML error message that will be displayed in the user’s browser.
The fact that S3 sends XML error messages directly to users greatly limits the control you can maintain over the user experience of your web site. If an upload fails due to an error in the HTML form, the user will be shown a fairly incomprehensible error message (assuming their browser displays XML text at all) and will need to hit the Back button to return to your site.
To take advantage of the POST support in S3 you must be able to construct two new kinds of document: an HTML Form that defines the POST request, and a Policy Document that imposes conditions on the data a user may upload into your S3 account. We will discuss each of these documents in turn, before presenting example Ruby code that will allow you to easily create these documents.
In this book we assume that you will use server scripts to generate form and policy documents, however you may wish to use browser scripting to generate or modify these documents on the user's computer. Although it is possible to build very flexible upload forms with browser scripting, if you pursue this course you should be very mindful of security and always avoid making your AWS secret key accessible on the client side.
To enable web browsers to construct a POST request that will be understood and accepted by the S3 service, you must create HTML form documents that are structured correctly and that contain all the information required by the service.
Here is a web page containing an HTML form that allows a user to upload a file to the my-bucket bucket.
<html> <head><title>File Upload to S3</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head> <body> <form action="https://my-bucket.s3.amazonaws.com/" method="post" enctype="multipart/form-data"> <input type="hidden" name="AWSAccessKeyId" value="ABCDEFGHIJ1234567890"> <input type="hidden" name="signature" value="LPQ+lb6L0ykrDdUqc2usbEPmsjA="> <input type="hidden" name="key" value="${filename}"> <input type="hidden" name="policy" value="eyJleHBpcmF0aW9uIjogIjIwMDgtMDEtMDlUMTE6Mjk6MzRaI...Mt="> Select a file to upload: <input name="file" type="file"> <br> <input type="submit" value="Upload to Amazon S3"> </form> </body>
The first thing to note is that the form and all its contents
must be UTF-8 encoded. To ensure that the web page containing the
form uses the correct encoding, the page’s content type and
character set should be explicitly defined in a meta
tag inside the page’s header:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The HTML form itself is configured to use the POST method and
to encode data in a multipart form enclosure. The form’s action
parameter must contain a URL that
specifies the S3 bucket the object will be created in, and whether
the HTTP or HTTPS protocol should be used to upload the data. This
URL can be constructed according to any of the formats recognized by
S3, see Constructing S3 URIs” for more information.
We will use the recommended subdomain URL format in this
book.
Now that we have covered the general structure of the HTML form, we can turn our attention to the input fields it must contain to produce a valid POST request. Table 3-5 lists the field names that the S3 service recognizes and describes how each field is used. The service recognizes these field names whether they are specified in upper, lower or mixed case.
Depending on which input fields you include in a form, it will be either authenticated or anonymous.
Authenticated forms include input fields that specify a policy document, your AWS access key, and a signature value. With an authenticated form you can permit uploads into any bucket you own, impose conditions on the data a user can submit, and specify an expiration date for the form. The S3 service will not accept uploads from an authorized form if the form has been modified by a user.
Anonymous forms do not include any of the policy document, AWS access key or signature input fields. Anonymous forms do not allow you to exercise any control over the data a user submits, and they may only be used with buckets that permit public write access by anonymous users.
Table 3-5. HTML form input fields for S3 POST
Field Name | Value | Required? | ||||
---|---|---|---|---|---|---|
key | The name of the object that will be created by
the POST request. The name of the file uploaded by the user
is stored in the special variable To allow a user to
upload a file to a predefined path in S3, you might specify
a key value like | Yes | ||||
file | An input form field that will provide the data to upload to S3. This field should be the very last one within a form, because any subsequent input fields will be ignored by S3. This input field may provide file or text data. The following input field types are all acceptable:
| Yes | ||||
AWSAccessKeyId | The AWS access key of the bucket’s owner. | If authenticated | ||||
policy | A Base64-encoded policy document that imposes conditions on the request. See Policy Document for S3 POST” below for more information on policy documents. | If authenticated | ||||
signature | A HMAC signature value generated by signing a Base64-encoded policy document with your AWS secret key. | If authenticated | ||||
acl | The access control policy to apply to the newly
created object. This value may be one of: private , public-read , public-read-write or authenticated-read . If this field
is not included, the object will be private by
default. | No | ||||
success_action_status | The HTTP status code S3 will include in its
response after a successful upload. This value may be
If the | No | ||||
success_action_redirect | The URL to which the client’s web browser is
redirected after a successful upload.
The redirect URL returned by S3 will include parameters that
specify the If this field is not included or if the
specified URL is invalid, S3 will respond with the status
code specified by the | No | ||||
x-amz-meta-* | If fields with a name beginning with x-amz-meta- are included, values
of these fields are stored as metadata with the new
object. | No | ||||
Others... | Other fields can be included in the form, however S3
will only act upon the fields it recognizes such as the
standard REST headers: Cache-Control , Content-Type , Content-Disposition , Content-Encoding , Expires . | No |
The maximum amount of form data that is permitted in a POST request is 20 KB, excluding the form’s data content.
Most input fields contain simple static text values, but the
S3 service also understands one variable name: ${filename}
. When the service processes a
POST request, this variable is replaced with the name of the file
the user has uploaded. This variable can be used in any form field
except for policy
, though it is
most useful in the key
field
where you specify the name of the object that will be created. Note
that this variable will only work in forms that upload files, if a
form uploads text data the ${filename}
variable will not be
substituted.
A policy document specifies conditions that a POST
request must meet to be considered valid by S3, and is the basis of
the request signing technique used to authenticate POST requests.
You must include a policy document in your HTML forms to maintain
any degree of control over the data that users can upload to S3. A
policy document is provided to S3 as a Base64-encoded document in
the policy
input field.
You specify your policy as a UTF-8 encoded JavaScript Object Notation (JSON) document. In other words, the policy document contains a hierarchical collection of name and value pairs that follow a predefined structure we will describe below.
Special characters in the policy document must be
escaped with a preceding backslash (\) character. The set of special characters
includes the backslash (\) and dollar sign ($), any control
characters such as newline (\n) and tab (\t), as well as all unicode
characters (\uxxxx
).
Every policy document includes two top-level name and value
object pairs: expiration
and conditions
. The
expiration item specifies an ISO-8601 GMT timestamp value which
indicates when the policy will expire, while the conditions item
contains an array of zero or more rules to define what data a user
can upload. If a user submits a POST request with data that breaks
any of the policy conditions, S3 will reject the request with an
error message.
Here is a policy document that imposes three conditions, and that expires at midnight of February 1st, 2008.
{ "expiration": "2008-02-01T00:00:00Z", "conditions": [ {"bucket": "my-bucket"}, ["starts-with", "$key", ""], ["content-length-range", 1, 51200] ] }
Policy conditions are specified as either an object or an array in JSON format. You need not worry too much about what this means exactly, because there is only a limited set of condition statements available. You can simply adapt the example statements below to define your own conditions.
Each condition statement in a policy document describes a test operation that S3 will perform on a specific field in the HTML form. There are three kinds of conditions that can be specified in a policy document:
An equality condition checks whether a field’s value or values exactly matches a given string. The equality condition can be specified in one of two formats.
The first format defines the condition as an array of
strings that describe the equality operator (eq
), the field to test ($
fieldName
),
and a literal comparison value. Here is a condition statement
that tests whether the field named “acl” has a value of
“private”:
["eq", "$acl", "private"]
There is a shorthand format for the equality test in which the condition is specified as a simple name and value pair within brace ({}) characters:
{"acl": "private"}
If your HTML form includes multiple values for a single field, the equality check must include each of the values in the correct order separated by commas:
{"acl": "private,authenticated-read"}
This condition uses the starts-with
operator to check
whether a field’s value begins with a specific string.
The condition is specified with the same long format as we saw in the equality test. To test whether the input field named “key” has a value starting with “/documents/public/” we would use the following condition statement:
["starts-with", "$key", "/documents/public/"]
The starts-with
condition has an important use beyond just testing for values
that start with a given string: it can be used as a test that
matches any possible value of a field. If you wish to permit a
field to take on any value at all, you can define a
starts-with condition that checks the field’s value against an
empty string. A field will pass this test whatever its
value.
["starts-with", "$key", ""]
This condition is a special case because it does not apply to a field in the HTML form, but instead tests the size of the data a user has uploaded to S3. With this condition, you can define upper and lower limits on how many bytes of data a user can submit to S3. If the user uploads too much or too little data, the POST request will fail.
A content length range condition is specified as an
array of three items describing the condition’s name ("content-length-range"
) followed by
the lower and upper bounds expressed as integers. Here is a
condition statement that requires the user to upload at least
1 byte but no more than 50 Kilobytes:
["content-length-range", 1, 51200]
To construct a policy document that fully describes and authenticates your HTML form, you must include at least one condition for each named input field in the form. If your form contains input fields that are not mentioned in the policy document, the S3 service will reject the POST request. This is a safety mechanism imposed by the service to ensure that the policy documents you create describe all the fields in the form and leave no room for an attacker to modify the form after you have signed it.
For example, if you wish to add a field specifying the
Content-Type
header that will be
assigned to a new object, you must also include a condition for this
field in the policy document. The condition you include may be an
equality check or it may use the starts-with
operator to allow any value
for the field, it does not matter what kind of condition you define
provided there is at least one policy condition that refers to each
field.
There are two field names that must be included in every
policy document: key
and bucket
. The key
input field is required in every form
and must therefore be permitted with a corresponding condition
statement. The situation is the same for the bucket
field. Although the bucket field is
not explicitly included in the S3 POST form, it is included
implicitly with every POST request.
There are some exceptions to the rule that every input field
must have a corresponding condition statement in the policy
document. The exceptions to the rule are any form fields that occur
after the file
field, any fields
with a name starting with x-ignore-
, and the fields AWSAccessKeyId
, signature
, file
and policy
.
Example 3-13 defines a method that generates a policy document based on information you provide, including an expiration Time object and a hash dictionary representing policy conditions. The method seems rather complex, however in essence all it does is concatenate a series of condition statement strings into a boilerplate policy document template. The complexity comes from deciding which condition statements to define.
Example 3-13. Build POST policy document: S3.rb
def build_post_policy(expiration_time, conditions) if expiration_time.nil? or not expiration_time.respond_to?(:getutc) raise 'Policy document must include a valid expiration Time object' end if conditions.nil? or not conditions.class == Hash raise 'Policy document must include a valid conditions Hash object' end # Convert conditions object mappings to condition statements conds = [] conditions.each_pair do |name,test| if test.nil? # A nil condition value means allow anything. conds << %{["starts-with", "$#{name}", ""]} elsif test.is_a? String conds << %{{"#{name}": "#{test}"}} elsif test.is_a? Array conds << %{{"#{name}": "#{test.join(',')}"}} elsif test.is_a? Hash operation = test[:op] value = test[:value] conds << %{["#{operation}", "$#{name}", "#{value}"]} elsif test.is_a? Range conds << %{["#{name}", #{test.begin}, #{test.end}]} else raise "Unexpected value type for condition '#{name}': #{test.class}" end end return %{{"expiration": "#{expiration_time.getutc.iso8601}", "conditions": [#{conds.join(",")}]}} end
To use this method, you describe each policy condition as a mapping from a field name to a value object. You indicate what kind of condition you wish to apply by using different data types for the value object.
Value Data Type | Condition Applied |
---|---|
nil | A starts-with test that will accept any value. |
String | An equality test using the given string. |
Array | An equality test, against a value composed of all the array’s items combined into a comma-delimited string. |
Hash | An operation named by the :op mapping, with a value as given
by the :value
mapping. |
Range | A range test, where the range must lie between the beginning and end values of the Range object provided. |
Here is a command that demonstrates how to use the build_post_policy
method to generate a
policy document containing each of the five condition types
understood by the method.
irb> conditions = { irb> 'key' => nil, # Empty starts-with condition irb> 'bucket' => 'my-bucket', # Equality condition irb> 'x-amz-meta-mytag' => ['Work','TODO'], # Equality with multi-values irb> 'Content-Type' => {:op=>'starts-with', # Starts-with condition irb> :value=>'text/'}, irb> 'content-length-range'=>Range.new(1,50) # Range condition irb> } irb> expiration = Time.now + 60 * 5 # Policy expires in 5 minutes irb> s3.build_post_policy(expiration, conditions) # The resultant policy document => {"expiration": "2008-01-08T10:31:36Z", "conditions": [ ["starts-with", "$key", ""], {"bucket": "my-bucket"}, {"x-amz-meta-mytag": "Work,TODO"}, ["starts-with", "$Content-Type", "text/"], ["content-length-range", 1, 50] ] }
Example 3-14 defines a method that builds an HTML form document for performing S3 POST requests. The form document produced by this method will follow the structure described in HTML form for S3 POST.”
If you provide policy conditions as optional arguments, this
method will call the build_post_policy
method to create a
policy document. The policy document will be Base64-encoded,
included in the form in the policy
input field, and used in
combination with your AWS secret key to generate a HMAC signature
value that authenticates the form. If you do not provide any policy
conditions the method will produce an anonymous form.
Example 3-14. Build POST HTML Form: S3.rb
def build_post_form(bucket_name, key, options={}) fields = [] # Form is only authenticated if a policy is specified. if options[:expiration] or options[:conditions] # Generate policy document policy = build_post_policy(options[:expiration], options[:conditions]) puts "POST Policy\n===========\n#{policy}\n\n" if @debug # Add the base64-encoded policy document as the 'policy' field policy_b64 = encode_base64(policy) fields << %{<input type="hidden" name="policy" value="#{policy_b64}">} # Add the AWS access key as the 'AWSAccessKeyId' field fields << %{<input type="hidden" name="AWSAccessKeyId" value="#{@aws_access_key}">} # Add signature for encoded policy document as the 'AWSAccessKeyId' field signature = generate_signature(policy_b64) fields << %{<input type="hidden" name="signature" value="#{signature}">} end # Include any additional fields options[:fields].each_pair do |n,v| if v.nil? # Allow users to provide their own <input> fields as text. fields << n else fields << %{<input type="hidden" name="#{n}" value="#{v}">} end end if options[:fields] # Add the vital 'file' input item, which may be a textarea or file. if options[:text_input] # Use the text_input option which should specify a textarea or text # input field. For example: # '<textarea name="file" cols="80" rows="5">Default Text</textarea>' fields << options[:text_input] else fields << %{<input name="file" type="file">} end # Construct a sub-domain URL to refer to the target bucket. The # HTTPS protocol will be used if the secure HTTPS option is enabled. url = "http#{@secure_http ? 's' : ''}://#{bucket_name}.s3.amazonaws.com/" # Construct the entire form. form = %{ <form action="#{url}" method="post" enctype="multipart/form-data"> <input type="hidden" name="key" value="#{key}"> #{fields.join("\n")} <br> <input type="submit" value="Upload to Amazon S3"> </form> } puts "POST Form\n=========\n#{form}\n" if @debug return form end
Let us step through some examples to see how this method
works, and the policy and form documents it generates. We will start
with a very simple anonymous form that is not authenticated, and
will therefore be limited to uploading files to a bucket with
public-write access. We will use the ${filename}
variable as the value for the
key field, which means that the object will be given the same name
as the uploaded file.
irb> s3.build_post_form('my-bucket', '${filename}') POST Form ========= <form action="https://my-bucket.s3.amazonaws.com/" method="post" enctype="multipart/form-data"> <input type="hidden" name="key" value="${filename}"> <input name="file" type="file"> <br> <input type="submit" value="Upload to Amazon S3"> </form>
The following example is more realistic and useful. In it we
will generate an authenticated form, which means we must include
policy conditions for each of the input fields included in the form.
When a file is uploaded using this form, the resultant S3 object
will be named uploads/images/pic.jpg. The object will
be made publicly accessible by assigning it the public-read
ACL setting, and it will be
identified as a JPEG image by its content type value. To ensure that
the user uploads a file of a reasonable size, we will apply a
content length range restriction of between 10KB and 2MB. Finally,
after the user uploads a file they will be redirected to the URL
http://localhost/post_upload.
# Fields to set the object's access permissions and content type irb> fields = { irb> 'acl' => 'public-read', irb> 'Content-Type' => 'image/jpeg', irb> 'success_action_redirect' => 'http://localhost/post_upload' irb> } # Conditions for the mandatory 'bucket' and 'key' fields, as well as the # additional fields specified above. Also includes a byte range condition. irb> conditions = { irb> 'bucket' => 'my-bucket', irb> 'key' => 'uploads/images/pic.jpg', irb> 'acl' => 'public-read', irb> 'Content-Type' => 'image/jpeg', irb> 'success_action_redirect' => 'http://localhost/post_upload', irb> 'content-length-range' => Range.new(10240, 204800) irb> } # Form expires in 24 hours irb> expiration = Time.now + 3600 * 24 # Combine all the optional form components into a single hash dictionary irb> options = { irb> :expiration => expiration, irb> :conditions => conditions, irb> :fields => fields irb> } # Generate the form. We have turned on debugging so both the policy and # form documents will be printed out in full. irb> s3.build_post_form('my-bucket', 'uploads/images/pic.jpg', options) POST Policy =========== {"expiration": "2008-01-09T11:29:34Z", "conditions": [ {"success_action_redirect": "http://localhost/post_upload"}, {"bucket": "my-bucket"}, {"Content-Type": "image/jpeg"}, {"key": "uploads/images/pic.jpg"}, ["content-length-range", 10240, 204800], {"acl": "public-read"} ]} POST Form ========= <form action="https://my-bucket.s3.amazonaws.com/" method="post" enctype="multipart/form-data"> <input type="hidden" name="key" value="uploads/images/pic.jpg"> <input type="hidden" name="policy" value="eyJleH...KICAg="/> <input type="hidden" name="AWSAccessKeyId" value="ABCDEFGHIJ1234567890"> <input type="hidden" name="signature" value="LPQ+lb6L0ykrDdUqc2usbEPmsjA="> <input type="hidden" name="success_action_redirect" value="http://localhost/post_upload"> <input type="hidden" name="Content-Type" value="image/jpeg"> <input type="hidden" name="acl" value="public-read"> <input name="file" type="file"> <br> <input type="submit" value="Upload to Amazon S3"> </form>
For our final example, we will allow our users to type HTML code into a text box and submit this data to S3 instead of a file. To do this, we provide our own text area input field to override the default file input field.
irb> fields = {'acl' => 'public-read', 'Content-Type' => 'text/html'} irb> key = 'users/posts/comment-01234.html' irb> conditions = { irb> 'bucket' => 'my-bucket', irb> 'key' => {:op => 'starts-with', :value => 'users/posts/comment-'}, irb> 'acl' => 'public-read', irb> 'Content-Type' => 'text/html' irb> } # Define our own input field item named 'file' to accept textual data irb> input_field = '<textarea name="file" cols="60" rows="10"></textarea>' irb> s3.build_post_form('my-bucket', key, :fields => fields, irb> :expiration => Time.now + 60, :conditions => conditions, irb> :text_input => input_field) POST Policy =========== {"expiration": "2008-01-11T05:10:33Z", "conditions": [ {"bucket": "my-bucket"}, {"Content-Type": "text/html"}, ["starts-with", "$key", "users/posts/comment-"], {"acl": "public-read"} ]} POST Form ========= <form action="https://my-bucket.s3.amazonaws.com/" method="post" enctype="multipart/form-data"> <input type="hidden" name="key" value="users/posts/comment-01234.html"> <input type="hidden" name="policy" value="eyJleHBp...aIiwKI"/> <input type="hidden" name="AWSAccessKeyId" value="ABCDEFGHIJ1234567890"> <input type="hidden" name="signature" value="cVnaqAdLzRd3g8sEZx9xSs63hw0="> <input type="hidden" name="Content-Type" value="text/html"> <input type="hidden" name="acl" value="public-read"> <textarea name="file" cols="60" rows="10"></textarea> <br> <input type="submit" value="Upload to Amazon S3"> </form>