Objects

S3 objects are resources that store data. They are somewhat similar to the files in a standard computer system, but there are a number of important differences which were summarized in S3 Architecture.”

An object can contain up to 5 GB of data, or it can be entirely empty. An object can store two types of information: data and metadata. The data stored by an object is its main content, such as a photo or text document. In addition to the data content, an object can store metadata that provides further information about the object, such as when it was created and the type of data it contains. You can store your own metadata information when you create or replace an object.

Each object resource in S3 can have access control permissions applied to it, allowing you to keep the object private, or to make it available to other S3 users or the general public.

Object Keys and Hierarchical Naming

Each object in S3 is identified by a unique name, known as its key, which uniquely identifies it within a bucket. Object keys must not be longer than 1,024 bytes when encoded as UTF-8, and they can contain almost any characters, including spaces and punctuation. Objects are similar to files, so it makes sense to use obvious names for your objects, as you would for a file, such as My Birthday Cake.jpg.

One major difference between the S3 storage model and the average computer file system is that S3 has no notion of a hierarchical folder or directory structure. S3 buckets contain objects—that is the beginning and end of the hierarchy imposed by the storage model. If you wish to impose a hierarchical structure for your objects in S3 to help organize and search them, you must construct this hierarchy yourself using the flexible naming capabilities of object keys. You can do this by choosing a special character or string to mark the boundaries between components of a hierarchical path and by storing your objects with key names that describe their full path in the hierarchy.

Because objects can be accessed using URIs, as if S3 was a standard web server, the most obvious character to use for delimiting the components of a hierarchical path is a forward slash (/). If you use slash characters in your hierarchical object keys, the resulting paths will look like the URIs everyone is familiar with. For example, suppose you want to store your photo collection in a bucket called “pictures,” and you will use object keys to simulate a directory hierarchy. You could store your pictures with keys like 2007/March/My Birthday Cake.jpg. Not only would this make it possible to search for specific objects in the hierarchy using the S3 functionality we will discuss in Listing Objects” later in this chapter, but if you made your pictures publicly accessible, they would be available at a sensible URL, such as http://s3.amazonaws.com/pictures/2007/March/My BirthdayCake.jpg.

Warning

S3 allows a broad range of characters to be used in object key names, including all unicode characters between U+0001 and U+10FFFF. However, this range of legal keyname characters includes some unprintable characters that cannot be properly represented in the XML documents S3 returns when it lists the object keys in a bucket. It is therefore possible to create objects with key names that cannot be parsed from object listings using standard XML parsing tools. You should avoid using such problematic object names by ensuring that object keys only include characters that can be represented in XML documents.

Object Metadata

An object may have metadata associated with it to further describe properties of the object. This metadata is made up of short text strings comprising a name and one or more values. Both the name and values used in metadata items must conform to UTF-8 encoding.

The S3 service can provide two kinds of object metadata:

System metadata: System metadata is information generated or used by the S3 service itself. This metadata is made available to you as read-only information. Metadata items that may be available include the request ID headers x-amz-request-id and x-amz-id-2, which assign a unique identifier to each operation performed by S3. This information can be useful to Amazon staff should they need to troubleshoot problems you are experiencing with the service.
User Metadata: You can store up to 2 KB of your own metadata information with each of your objects. This custom information is not interpreted by the S3 service, it is merely stored by it and returned when the object is retrieved.

The REST interface of the S3 service supports metadata as an extension to the standard HTTP header mechanism. Metadata is uploaded to the service as HTTP request headers and is retrieved from the service as response headers. This overlap between metadata as an S3 construct and as standard HTTP headers has interesting consequences. On one hand, you must be careful to avoid accidentally using metadata names that clash with HTTP headers. The REST interface makes this possible by recognizing a special metadata name prefix, x-amz-meta-, which indicates that a header contains metadata.

On the other hand, you can deliberately store a range of HTTP headers as metadata with your object so these headers can be returned with the HTTP response when the object is retrieved. By uploading metadata items without the x-amz-meta- prefix, you can store certain HTTP response header values as metadata with your object and control how clients, such as web browsers, behave when they download your objects. The most common example of using standard HTTP headers as metadata is the Content-Type header. You can specify the content type of your objects using this metadata item, so web browsers can recognize the type of the object you have stored.

Note

S3 does not allow you to set arbitrary metadata items to be returned as HTTP headers; only some header names are recognized as legal HTTP headers. Any header with a name the service does not recognize is discarded. HTTP header names the service does recognize and store include: Content-Type, Content-Language, Expires, Cache-Control, Content-Disposition, and Content-Encoding.

Objects Are Immutable

Objects stored in S3 are immutable. An object’s key, data, and metadata information cannot be altered after the object is created. For example, you cannot change an object’s key to reflect a filename change, nor can you include new or changed information in its metadata. Most importantly, you cannot change the data content of an existing object, or add new data to it, without overwriting that object.

When you need to change an existing object, you must re-create the object from scratch. If you are storing your novel in S3, and you find and correct a single spelling mistake, you will have to upload the entire file again to save the corrected version. The same holds true if you are creating an object that contains a lot of data and the upload fails midway through; there is no way to resume the upload, and you will have to start again from scratch.

This feature of the S3 service has a good and a bad side. It is good because it allows S3 to manage objects efficiently behind the scenes, greatly simplifying Amazon’s service architecture and ensuring they can make the service available at a low cost. It can be bad because S3 developers sometimes have to make their applications more complicated to add more intelligence to the simple data-storage model. In the trade-off between making the S3 service work well and cheaply, and making developers’ lives easier, Amazon has elected to do the former. At least it will help keep us all in work.

The immutability of S3 objects has far-reaching implications for how you should design your S3-based applications. Unless your application uses data that very rarely changes, or data that changes so drastically that replacing whole objects at a time is a reasonable option, you will have to carefully consider how your application will handle data updates. The following few paragraphs summarize some of the approaches you could take in an application that must update data objects stored in S3.

Keep it simple, and live with the inefficiency

If you wish to keep your application simple, and you are not dealing with large data items, you may choose to live with the overhead of re-creating objects whenever you need to reflect local data changes. This approach will require you to upload a new object whenever an object’s key name, data, or metadata content changes. The chief advantage of this approach, besides being simple, is that you can retain a direct relationship between an object key in S3 and the actual object data. This relationship is important if you intend to use S3 as a standard web server, where your key names act as user-friendly URLs.

Convert local names to S3 key names

If your object data does not change much, but you frequently need to update the names of objects—such as to reflect hierarchical changes or merely to present different object names to different users—you may wish to implement a remapping tool that converts object names to the real S3 object key names and vice versa.

Restructure data between your local system and S3

If your application uses large amounts of data that change often, you may have no choice but to completely restructure the way your data are stored in S3. This approach could involve splitting large data files into smaller pieces, such that each piece can be stored as a separate object in S3. Then when your local data files change, you need only replace those pieces that are affected or add new pieces, rather than replacing the whole data set.

An application capable of restructuring data files in this way will clearly be quite complicated, It will require a lot of logic to manage the decomposition and recomposition of data files and to maintain a database of mappings to define the relationship between objects in S3 and your local data.

In the section ElasticDrive: S3 As a Virtual Block Device” in Chapter 4 we discuss a tool that allows you to create a standard filesystem on top of your S3 storage space. This tool stores the raw data blocks that represent a file system in S3 rather than the individual files. This is an extreme example of restructuring data to store it in S3.

Create or Replace an Object

To create an object in S3, you send a PUT request containing the object’s data and metadata to the service, with a URI specifying the key name of the object and the bucket it will be stored in. The object’s data content is provided as the body of the request, while the metadata is provided as request headers. When the object is successfully stored, S3 will return an HTTP 200 response message.

Note

Most web browsers cannot perform PUT requests. To create objects using a web browser you must use the alternative POST request method discussed in Create Objects from a Web Browser Using POST.”

Warning

Remember that objects are immutable in S3 and can only be created, not updated. If you create an object using a key name that is already present in S3, the existing object will be replaced.

When an object with some data content is created or updated, an HTTP Content-Length header should be included with the request to inform S3 of the number of bytes it is expected to receive and store. In addition to this header, a range of extra information may be provided when you create an object.

If you include the Content-MD5 header with a Base-64-encoded MD5 hash of your data, S3 will perform a data verification check to ensure the data it received exactly matches the data you sent. Any discrepancy will cause the request to fail and prevent you from accidentally storing incorrect data in the service. We highly recommend taking advantage of this feature, and we have included it in our example implementation code. Alternatively, you can perform your own verification check using the ETag response header returned by S3, because this header contains a hex-encoded MD5 hash value of the data that the service received.

The access control settings of an object can be set by the same request that creates the object by including a special header named x-amz-acl with a value matching one of the canned ACL policy names available in the service (see Canned Access Policies” later in this chapter for more information).

Example 3-8 defines a method that creates an object in S3, uploads its data and metadata content, and sets the object’s access control permission settings.

Example 3-8. Create an object: S3.rb

def create_object(bucket_name, object_key, opts={})
  # Initialize local variables for the provided option items
  data = (opts[:data] ? opts[:data] : '')
  headers = (opts[:headers] ? opts[:headers].clone : {})
  metadata = (opts[:metadata] ? opts[:metadata].clone : {})

  # The Content-Length header must always be set when data is uploaded.
  headers['Content-Length'] =
        (data.respond_to?(:stat) ? data.stat.size : data.size).to_s

  # Calculate an md5 hash of the data for upload verification
  if data.respond_to?(:stat)
    # Generate MD5 digest from file data one chunk at a time
    md5_digest = Digest::MD5.new
    File.open(data.path, 'rb') do |io|
      buffer = ''
      md5_digest.update(buffer) while io.read(4096, buffer)
    end
    md5_hash = md5_digest.digest
  else
    md5_hash = Digest::MD5.digest(data)
  end
  headers['Content-MD5'] = encode_base64(md5_hash)

  # Set the canned policy, may be: 'private', 'public-read',
  # 'public-read-write', 'authenticated-read'
  headers['x-amz-acl'] = opts[:policy] if opts[:policy]

  # Set an explicit content type if none is provided, otherwise the
  # ruby HTTP library will use its own default type
  # 'application/x-www-form-urlencoded'
  if not headers['Content-Type']
    headers['Content-Type'] =
      data.respond_to?(:to_str) ? 'text/plain' : 'application/octet-stream'
  end

  # Convert metadata items to headers using the
  # S3 metadata header name prefix.
  metadata.each do |n,v|
    headers["x-amz-meta-#{n}"] = v
  end

  uri = generate_s3_uri(bucket_name, object_key)
  do_rest('PUT', uri, data, headers)
  return true
end

The best way to become familiar with the workings of the object creation API is to see some examples. We will start by creating a simple text document inside our my-bucket bucket.

irb> s3.create_object('my-bucket', 'Hello.txt', :data => 'Hello World')
REQUEST DESCRIPTION
=======
PUT\n
sQqNsWTgdUEFt6mb5y4/5Q==\n
text/plain\n
Wed, 07 Nov 2007 11:13:57 GMT\n
/my-bucket/Hello.txt

REQUEST
=======
Method : PUT
URI    : https://my-bucket.s3.amazonaws.com/Hello.txt
Headers:
  Expect=100-continue
  Authorization=AWS ABCDEFGHIJ1234567890:vf6Slm3v09YPjKyyHhVn0BshUuA=
  Date=Wed, 07 Nov 2007 11:13:57 GMT
  Content-Type=text/plain
  Host=my-bucket.s3.amazonaws.com
  Content-Length=11
  Content-MD5=sQqNsWTgdUEFt6mb5y4/5Q==
Request Body Data:
Hello World


RESPONSE
========
Status : 200 OK
Headers:
  x-amz-id-2=MdRHrORD1Kw+Ps9zUhrzKczu9Jd0/1V2/0ITwK5vp2KKYshUGnGai7/htu1t/KLX
  etag="b10a8db164e0754105b7a99be72e3fe5"
  date=Wed, 07 Nov 2007 11:14:00 GMT
  x-amz-request-id=19E1F76D7C1864E9
  server=AmazonS3
  content-length=0

The object created by this command is quite simple. It contains only text content without any user-defined metadata, and because it does not specify an access control policy, it will be made private by default. The object is stored in a resource that can be accessed using two different URIs with slightly different formats: http://s3.amazonaws.com/my-bucket/Hello.txt or http://my-bucket.s3.amazonaws.com/Hello.txt; but because it is a private object, you will receive an AccessDenied error message from S3 if you try to load this location in your web browser.

We have not yet discussed how to list the objects stored in your bucket. For the time being, we will make all your objects publicly readable, so you can access them through a web browser and confirm that they have been stored in S3. Replace the original Hello.txt object with a new version that uses the canned access control policy public-read.

irb> s3.create_object('my-bucket', 'Hello.txt', :data => 'Hello World', 
                      :policy => 'public-read')
REQUEST DESCRIPTION
=======
PUT\n
sQqNsWTgdUEFt6mb5y4/5Q==\n
text/plain\n
Wed, 07 Nov 2007 11:16:58 GMT\n
x-amz-acl:public-read\n
/my-bucket/Hello.txt

REQUEST
=======
Method : PUT
URI    : https://my-bucket.s3.amazonaws.com/Hello.txt
Headers:
  Expect=100-continue
  Authorization=AWS ABCDEFGHIJ1234567890:1edltaBG0ImEqCafMgUeHp6APlE=
  x-amz-acl=public-read
  Date=Wed, 07 Nov 2007 11:16:58 GMT
  Content-Type=text/plain
  Host=my-bucket.s3.amazonaws.com
  Content-Length=11
  Content-MD5=sQqNsWTgdUEFt6mb5y4/5Q==
Request Body Data:
Hello World


RESPONSE
========
Status : 200 OK
Headers:
  x-amz-id-2=8gStUFaxZf3V+rMZ/hiimIBnsKr4QZHuxiUuWgSWMzsP8QEHVl6Z1aUMMpzQgA6N
  etag="b10a8db164e0754105b7a99be72e3fe5"
  date=Wed, 07 Nov 2007 11:17:01 GMT
  x-amz-request-id=77FACC965DBFBDD0
  server=AmazonS3
  content-length=0

Now when you visit the URI http://my-bucket.s3.amazonaws.com/Hello.txt in your browser, you will see the contents of the object displayed as text.

The create_object method automatically sets the Content-Type HTTP request header to have the value text/plain when an object is created with textual data. In some circumstances you may wish to override this behavior and create text objects with a different content type, such as HTML pages. To set your own HTTP headers, you can provide them as a hash to the method’s headers parameter. Here is a command to upload a simple HTML document to S3 with the content type set to text/html, so a web browser will interpret it correctly.

irb> headers = {'Content-Type'=>'text/html'}
irb> html = '<b>Webpage</b> <i>content</i>'

irb> s3.create_object('my-bucket', 'WebPage.html', :data => html, 
                      :policy => 'public-read', :headers => headers)
REQUEST DESCRIPTION
=======
PUT\n
rD3P+CBXWNyYwtcqKuijpQ==\n
text/html\n
Wed, 07 Nov 2007 11:20:43 GMT\n
x-amz-acl:public-read\n
/my-bucket/WebPage.html

REQUEST
=======
Method : PUT
URI    : https://my-bucket.s3.amazonaws.com/WebPage.html
Headers:
  Expect=100-continue
  Authorization=AWS ABCDEFGHIJ1234567890:2dreEnJ/XP2MIg52AO2OO/oJlvU=
  x-amz-acl=public-read
  Date=Wed, 07 Nov 2007 11:20:43 GMT
  Content-Type=text/html
  Host=my-bucket.s3.amazonaws.com
  Content-Length=29
  Content-MD5=rD3P+CBXWNyYwtcqKuijpQ==
Request Body Data:
<b>Webpage</b> <i>content</i>


RESPONSE
========
Status : 200 OK
Headers:
  x-amz-id-2=DXdHJYXrmJuKHWJwJY2Wt0pD4KYWOrNcQoTe3rMyOXBLKYQrtxgnLPxr7k3K7qVm
  etag="ac3dcff8205758dc98c2d72a2ae8a3a5"
  date=Wed, 07 Nov 2007 11:20:45 GMT
  x-amz-request-id=2D54144B83D6E9F2
  server=AmazonS3
  content-length=0

If you visit this object’s URL in your web browser, you will see the page displayed with the correct HTML formatting. You can use this technique to associate a range of HTTP response headers with your objects, such as Content-Language for specifying the language a page is written in, and Expires to set how long pages retrieved from S3 should be cached.

To store metadata information other than HTTP headers with an object, you can provide a set of metadata name and value items to the create_object method in the metadata parameter. The method will send these items to the service as HTTP request headers, but only after renaming the items to include the prefix x-amz-meta-, which indicates the header is metadata.

irb> s3.create_object('my-bucket', 'Metadata.txt', 
                      :data => 'I have metadata!', :policy => 'public-read', 
                      :metadata => {'Description'=>'A welcome message'})
REQUEST DESCRIPTION
=======
PUT\n
LWtbsg8yIkBEjFolkkq09Q==\n
text/plain\n
Wed, 07 Nov 2007 11:22:20 GMT\n
x-amz-acl:public-read\n
x-amz-meta-description:A welcome message\n
/my-bucket/Metadata.txt

REQUEST
=======
Method : PUT
URI    : https://my-bucket.s3.amazonaws.com/Metadata.txt
Headers:
  x-amz-meta-Description=A welcome message
  Expect=100-continue
  Authorization=AWS ABCDEFGHIJ1234567890:ICUUaodoew22fxB0bQFthsy7zUE=
  x-amz-acl=public-read
  Date=Wed, 07 Nov 2007 11:22:20 GMT
  Content-Type=text/plain
  Host=my-bucket.s3.amazonaws.com
  Content-Length=16
  Content-MD5=LWtbsg8yIkBEjFolkkq09Q==
Request Body Data:
I have metadata!


RESPONSE
========
Status : 200 OK
Headers:
  x-amz-id-2=jsMfJ0OYqhY+sH5zhybkZhwuEISITgNcNcQ5nm0jkoULXdnSWh7nV+revG1XReTn
  etag="2d6b5bb20f322240448c5a25924ab4f5"
  date=Wed, 07 Nov 2007 11:22:22 GMT
  x-amz-request-id=A70A4451681D9C6B
  server=AmazonS3
  content-length=0

Web browsers do not generally display HTTP header information, so you will not be able to see this object’s metadata information if you visit the URL in a browser. We will demonstrate how to retrieve an object’s data and metadata directly in the next section, Retrieving Objects.”

Finally, we should note that the implementation we provided can upload files to S3, as well as text strings. To upload the contents of a file to S3, you supply a Ruby file object to the method as its data parameter. Here is a command that uploads an image file—called image.png—in the local directory, makes it publicly accessible, and sets the content type to image/png, so web browsers will display it correctly.

irb> s3.create_object('my-bucket', 'image.png', 
                      :data => File.new('image.png', 'rb'), 
                      :policy => 'public-read', 
                      :headers => {'Content-Type'=>'image/png'})

Note

Be sure to open files containing binary data in binary mode. In Ruby you do this by including “b” for binary in the file mode parameter. On some operating systems, Ruby will open files in binary mode by default, but on others, such as Windows, you must set this option explicitly.

Retrieving Objects

There are two ways to retrieve information about an object from S3: using a GET request or a HEAD request.

To retrieve all of an object’s data, including both its content and metadata, you send a GET request to the service with a URI specifying the bucket the object is stored in and its key name. The response to a GET request will contain the object’s data in the response body and its metadata in the response headers.

To retrieve only an object’s metadata and not its contents, you send a HEAD request to the service instead. The response to a HEAD request includes the metadata headers but contains no body. Both GET and HEAD requests will return an HTTP 200 status code when they are successful.

You may wonder why you would ever use the HEAD method, when you can simply use the GET method to retrieve all of an object’s data and ignore the response body, if you are not interested in the object’s contents. It is worthwhile to use HEAD requests when you are only interested in an object’s metadata, because your client will not have to maintain an open network connection any longer than necessary or manually close the connection to discard the unwanted response-body data.

Example 3-9 and Example 3-10 define methods that retrieve data from an object in S3. The former uses a GET request and retrieves the object’s content data and its metadata, and the latter uses a HEAD request and only retrieves the metadata.

Example 3-9. Retrieve an object: S3.rb

def get_object(bucket_name, object_key, headers={})
  uri = generate_s3_uri(bucket_name, object_key)

  if block_given?
    response = do_rest('GET', uri, nil, headers) {|segment| yield(segment)}
  else
    response = do_rest('GET', uri, nil, headers)
  end

  response_headers = {}
  metadata = {}

  response.each_header do |name,value|
    if name.index('x-amz-meta-') == 0
      metadata[name['x-amz-meta-'.length..-1]] = value
    else
      response_headers[name] = value
    end
  end

  result = {
    :metadata => metadata,
    :headers => response_headers
  }
  result[:body] = response.body if not block_given?

  return result
end

The get_object method performs differently when a code block is provided to the method, in which case the data is downloaded from S3 a piece at a time and is passed on to the code block for processing as it arrives. This behavior allows us to stream object downloads, processing the data as it arrives rather than leaving Ruby to store all the object’s data in memory. This has obvious advantages when you are downloading large objects. If the download is streamed, the object’s data is not returned in the method’s dictionary result object.

Example 3-10. Retrieve an object’s metadata: S3.rb

def get_object_metadata(bucket_name, object_key, headers={})
  uri = generate_s3_uri(bucket_name, object_key)
  response = do_rest('HEAD', uri, nil, headers)

  response_headers = {}
  metadata = {}

  response.each_header do |name,value|
    if name.index('x-amz-meta-') == 0
      metadata[name[11..-1]] = value
    else
      response_headers[name] = value
    end
  end

  return {
    :metadata => metadata,
    :headers => response_headers
  }
end

Let us use the objects we created earlier to demonstrate how to use the get_object method. We will retrieve the Metadata.txt object from the my-bucket bucket, because this object contains both metadata and content data. We will store the results in a Ruby variable, so we can examine them more closely.

irb> obj = s3.get_object('my-bucket', 'Metadata.txt')
REQUEST DESCRIPTION
=======
GET\n
\n
\n
Wed, 07 Nov 2007 11:25:04 GMT\n
/my-bucket/Metadata.txt

REQUEST
=======
Method : GET
URI    : https://my-bucket.s3.amazonaws.com/Metadata.txt
Headers:
  Authorization=AWS ABCDEFGHIJ1234567890:m2Df9z1SeO7bmM4G1NLRSE6YkxU=
  Date=Wed, 07 Nov 2007 11:25:04 GMT
  Host=my-bucket.s3.amazonaws.com

RESPONSE
========
Status : 200 OK
Headers:
  last-modified=Wed, 07 Nov 2007 11:22:22 GMT
  x-amz-meta-description=A welcome message
  x-amz-id-2=+VstU+ssan+c+cDuFWL7fQtvm4KqGWxUPwhQOc2JF7mjhzfNbiGI1jJ6NvsTpX+I
  content-type=text/plain
  etag="2d6b5bb20f322240448c5a25924ab4f5"
  date=Wed, 07 Nov 2007 11:25:07 GMT
  x-amz-request-id=269211548D652727
  server=AmazonS3
  content-length=16

REQUEST DESCRIPTION
=======
GET\n
\n
\n
Tue, 06 Nov 2007 02:57:45 GMT\n
/my-bucket/Metadata.txt

 {:headers=>
  {"last-modified"=>"Wed, 07 Nov 2007 11:22:22 GMT",
   "x-amz-id-2"=>
    "Q9Y/kBSntTXpWI2Vmk56aqf07P5MzFxW06xx25hDQIoDZhhtFtA3Ue2ojF5c7lkF",
   "date"=>"Wed, 07 Nov 2007 11:25:07 GMT",
   "etag"=>"\"2d6b5bb20f322240448c5a25924ab4f5\"",
   "content-type"=>"text/plain",
   "x-amz-request-id"=>"EBECEC8CCC3AFEB3",
   "server"=>"AmazonS3",
   "content-length"=>"16"},
 :body=>"I have metadata!",
 :metadata=>{"description"=>"A welcome message"}}

The debug log above shows that the response message includes the object’s Description metadata item in a response header called x-amz-meta-description, and its content data in the response body. We can obtain these details through the Ruby data structures returned by the method.

irb> obj[:body]
=> "I have metadata!"

irb> obj[:metadata]
=> {"description"=>"A welcome message"}

irb> obj[:headers]['content-type']
=> "text/plain"

To save the data content of an S3 object to a file, you can write the contents of the :body hash item from the method’s result to a File object.

irb> File.open('Metadata.txt','w') do |file|
irb>   file.write(obj[:body])
irb> end

irb> puts File.new('Metadata.txt').read
I have metadata!

If the object in S3 is large, you should stream the download straight to a file, rather than having Ruby store the data in memory in the :body variable. To stream the download, you must define a code block to process the download a piece at a time, and provide this code block to the get_object method.

# Stream an object download from S3
irb> File.open('Metadata.txt','w') do |file|
irb>   obj = s3.get_object('my-bucket','Metadata.txt') do |data| 
irb>     file.write(data)
irb>   end
irb> end

# The object's data is stored to a file
irb> puts File.new('Metadata.txt').read
I have metadata!

# However, the object's data is not available in the result object
irb> obj[:body]
=> nil

If you are only interested in an object’s metadata, use the get_object_metadata method instead of the get_object method.

irb> obj_meta = s3.get_object_metadata('my-bucket', 'Metadata.txt')
REQUEST DESCRIPTION
=======
HEAD\n
\n
\n
Wed, 07 Nov 2007 11:26:58 GMT\n
/my-bucket/Metadata.txt

REQUEST
=======
Method : HEAD
URI    : https://my-bucket.s3.amazonaws.com/Metadata.txt
Headers:
  Authorization=AWS ABCDEFGHIJ1234567890:0BpIagDvtNXIBMAiquTFkq3qmC8=
  Date=Wed, 07 Nov 2007 11:26:58 GMT
  Host=my-bucket.s3.amazonaws.com

RESPONSE
========
Status : 200 OK
Headers:
  last-modified=Wed, 07 Nov 2007 11:22:22 GMT
  x-amz-meta-description=A welcome message
  x-amz-id-2=NhMrsU/5vunkvEMo51x4aXKYRM/QSp5JJSptv51N0i7Oyb8LMw+NcPXVtC+GBNRk
  content-type=text/plain
  etag="2d6b5bb20f322240448c5a25924ab4f5"
  date=Wed, 07 Nov 2007 11:27:01 GMT
  x-amz-request-id=F9793916FB0DAF60
  server=AmazonS3
  content-length=16

{:headers=>
  {"last-modified"=>"Wed, 07 Nov 2007 11:22:22 GMT",
   "x-amz-id-2"=>
    "NhMrsU/5vunkvEMo51x4aXKYRM/QSp5JJSptv51N0i7Oyb8LMw+NcPXVtC+GBNRk",
   "date"=>"Wed, 07 Nov 2007 11:27:01 GMT",
   "etag"=>"\"2d6b5bb20f322240448c5a25924ab4f5\"",
   "content-type"=>"text/plain",
   "x-amz-request-id"=>"F9793916FB0DAF60",
   "server"=>"AmazonS3",
   "content-length"=>"16"},
 :metadata=>{"description"=>"A welcome message"}}

irb> obj_meta[:body]
=> nil

irb> obj_meta[:metadata]
=> {"description"=>"A welcome message"}

irb> obj[:headers]['content-type']
=> "text/plain"

Retrieving Objects Conditionally

S3 supports a specialized set of HTTP request headers that can be used to exert more control over how and when objects are retrieved with GET requests. These headers can be used to cause GET requests to only retrieve objects when certain conditions are met, or to retrieve only a specific portion of an object’s content data.

Table 3-3 describes the conditions that can be applied to GET requests by providing specific request headers.

Table 3-3. Headers for conditional GET requests

Header Name	Description	Example values
Range	Return only the object content data within a specified byte range.	First 100 bytes: `bytes=0-99` Last 100 bytes: `bytes=-100` First 100 and last 100 bytes: `bytes=0-99,-100`
If-Modified-Since	Retrieve the object only if it has been modified since the specified time, otherwise return the HTTP error status code 304 (not modified).	Sat, 29 Oct 1994 19:43:31 GMT
If-Unmodified-Since	Retrieve the object only if it has not been modified since the specified time, otherwise return the HTTP error status code 412 (precondition failed).	Sat, 29 Oct 1994 19:43:31 GMT
If-Match	Retrieve the object only if its entity tag (ETag) matches the specified value, otherwise return the HTTP error status code 412 (precondition failed). In S3, the ETag corresponds to a hex-encoded MD5 hash of the object’s data.	b10a8db164e0754105b7a99be72e3fe5
If-None-Match	Return the object only if its entity tag (ETag) is different from the value specified, otherwise return the HTTP error status code 304 (not modified). In S3, the ETag corresponds to a hex-encoded MD5 hash of the object’s data.	b10a8db164e0754105b7a99be72e3fe5

To demonstrate how to apply these conditional request headers, let us try performing some conditional requests. The first request applies the Range request header to return only the last nine bytes of the Metadata.txt object’s data content—the word “metadata!” The second request applies the If-Unmodified-Since header to return the object only if it has been unchanged since 1994. This request will fail, because the object is obviously newer than that. The third request applies the If-Modified-Since header and this time the request will succeed, because the object has been modified since 1994.

irb> obj = s3.get_object('my-bucket', 'Metadata.txt', {'Range'=>'bytes=-9'})
irb> obj[:body]
=> "metadata!"

irb> obj = s3.get_object('my-bucket', 'Metadata.txt', 
                  {'If-Unmodified-Since'=>'Sat, 29 Oct 1994 19:43:31 GMT'})
AWS::ServiceError: HTTP Error: 412 - Precondition Failed, AWS Error: 
PreconditionFailed - At least one of the pre-conditions you specified did 
not hold

irb> obj = s3.get_object('my-bucket', 'Metadata.txt', 
                  {'If-Modified-Since'=>'Sat, 29 Oct 1994 19:43:31 GMT'})
irb> obj[:body]
=> "I have metadata!"

Listing Objects

To obtain a listing of the objects you have stored in a bucket, you send a GET request to a URI that specifies the bucket resource. S3 will reply with an HTTP 200 response message that contains an XML document in the response body. This XML object-listing document contains an inventory of the objects in a bucket, including important information about each object, such as its key name, size, last modification date, and the MD5 hash value of its data.

Here is an XML document returned by the operation:

<ListBucketResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'>
  <Name>my-bucket</Name>
  <Prefix/>
  <Marker/>
  <MaxKeys>1000</MaxKeys>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>Metadata.txt</Key>
    <LastModified>2007-11-06T02:52:41.000Z</LastModified>
    <ETag>"2d6b5bb20f322240448c5a25924ab4f5"</ETag>
    <Size>16</Size>
    <Owner>
      <ID>1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b</ID>
      <DisplayName>jamesmurty</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <Contents>
    <Key>WebPage.html</Key>
    <LastModified>2007-11-06T02:51:35.000Z</LastModified>
    <ETag>"ac3dcff8205758dc98c2d72a2ae8a3a5"</ETag>
    <Size>29</Size>
    <Owner>
      <ID>1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b</ID>
      <DisplayName>jamesmurty</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

This document has a complex structure and contains much more information than just an inventory of our objects. Let us work through the elements in this document structure by dividing them into three categories: object details, truncated listings, and searching.

Object details

In an object-listing document, the information that is most apparent, and most immediately useful, is the actual list of objects. In the ListBucketResult document structure, objects are listed as a set of Contents elements, which contain objects sorted alphabetically by their key name. Each Contents element includes a number of useful details about the object it represents.

Inside each Contents element we can see important details about the object, including its key name (Key), a timestamp describing when it was created or last modified (LastModified), a hex-encoded MD5 hash value of the object’s data content (ETag), and the size of the object in bytes (Size). In addition to these object details, the listing includes an Owner element that identifies the owner or creator of the object. Unless you are listing the contents of a bucket owned by someone else, or you have permitted someone else to create objects in your bucket, you will be the owner of all the objects in your bucket. Finally, the Contents element includes a StorageClass designation for each object. This is not a meaningful piece of information, because the storage class of objects is always Standard.

Truncated listings

S3 buckets can hold an unlimited number of objects. This could clearly pose a problem if you were to list all the objects in a large bucket, because the listing document would continue forever, or at least for as long as your patience held out. To avoid returning huge listing documents for buckets with many objects, S3 truncates object listings to include 1,000 or fewer objects. When a client receives a truncated object-listing document, it is responsible for recognizing that the listing is incomplete and performing follow-up queries, until it is able to build up a complete list of the bucket’s contents.

When you send an object listing request to the service you can specify the maximum number of objects you want to be included in the listing. The service will include up to this many objects and no more, though it may also return fewer objects than requested. The listing document returned by the service describes the limit it applied in the MaxKeys element. In our example listing document, we can see that this limit is set to the default maximum value of 1000. To limit the number of keys that will be included in a key-listing document you include the parameter max-keys in the GET request.

When an S3 client receives an object-listing that has been truncated, it must recognize that there are more objects in the bucket than are recorded in the listing, so it can perform additional requests to list the missing objects. The IsTruncated element of the listing document indicates when a listing has been truncated. If this element has the value true, the listing document gives an incomplete picture of the bucket’s contents.

In addition to recognizing when a listing is incomplete, S3 clients need to have a mechanism to perform follow-up requests, so they can build up a complete list of objects one listing document at a time. To paginate a large number of objects in multiple listings, S3 clients provide a marker parameter with their listing requests. This parameter specifies a string that will serve as the starting point for the listing. When a marker is provided, only those objects in the bucket with key names alphabetically after the marker will be included in the listing. If the marker parameter is specified in a request, S3 will echo this value back to the client as an element called Marker in the response XML document.

By providing the marker parameter when performing a listing request, S3 clients can ensure that the service provides the next listing they need to find out about the objects that were not included in the prior listing. The client can determine what is the appropriate value to use for the marker parameter in one of two ways:

If the listing document includes a NextMarker element, the value of this element can be used in the marker parameter of a follow-up request to retrieve the next set of objects. Unfortunately, the NextMarker element is only provided by the service in some circumstances (see the discussion on Searching” below). If this element is not available, the client must resort to the second option.
When the NextMarker element is not available, the client can deduce the appropriate value to use in the marker parameter of a follow-up request by identifying the last object key name in the current listing document. When the last key name from the current listing is used as the marker for the next request, only objects with key names that occur after this name will be listed.

Note

Retrieving an object listing for buckets with many thousands of objects can be a slow process, because clients must perform multiple listing requests to learn about all the objects. If your buckets will store many objects, it may be a good idea to keep a local record of your object key names so you can refer to this instead of performing time-consuming listing requests.

If it is not feasible to keep your own records of object details, you can still speed up object listings by performing multiple listing requests at the same time with different marker values and merging the results.

Searching

S3’s object-listing API provides basic searching functionality to find objects based on their key names. In Object Keys and Hierarchical Naming,” we discussed how S3 object key names can be constructed to represent a hierarchy resembling a file directory structure. When you perform object-listings, you can take advantage of this structure and list only those objects that occupy a specific place in the hierarchy.

The listing API accepts two parameters that allow you to search for objects with specific names: prefix and delimiter. If you include the prefix parameter value in your object-listing request, only those objects with key names that start with the prefix string will be listed. If you include the delimiter parameter, any objects with key names that contain the delimiter string will be listed in a separate part of the XML document, as if they were directory names. If you use both of these parameters at the same time, you can navigate through the hierarchy represented in your object key names like you would through the subdirectories in a standard file system.

These parameters are not easy to describe, so let us look at some examples to make things clearer. Imagine that you have stored a number of images in a bucket with key names that represent a hierarchy, like this:

MyPictures/2005/image1.jpg
MyPictures/2006/image2.jpg
MyPictures/2007/image3.jpgMyPictures/2007/image4.jpg

To list only the images that date from 2005, you could perform a listing with the prefix parameter set to MyPictures/2005/, and only the object named MyPictures/2005/image1.jpg would be listed.

To find out how many year-based subdirectories are inside this hierarchy, you would perform a listing with the prefix value MyPictures/ and the delimiter value / (a forward slash). With these two parameters, the listing will not contain any objects at all, because the delimiter character is present after the prefix string in all the object keys. Instead, the XML listing document will include a set of CommonPrefixes elements containing a Prefix value corresponding to each unique object key name that includes both the prefix and the delimiter.

<CommonPrefixes>
  <Prefix>MyPictures/2005/</Prefix>
</CommonPrefixes>
<CommonPrefixes>
  <Prefix>MyPictures/2006/</Prefix>
</CommonPrefixes>
<CommonPrefixes>
  <Prefix>MyPictures/2007/</Prefix>
</CommonPrefixes>

As you can see, the CommonPrefixes resemble a subdirectory listing of the MyDocuments directory. You will also notice that the CommonPrefixes values are all unique. Although there are two objects in the MyPictures/2007/ location, this prefix is only mentioned once.

These CommonPrefixes strings are very useful for navigating hierarchies, because you can apply them as prefix parameter values to follow-up requests to drill down into the contents of each of the simulated subdirectories.

Do not worry if you are still confused about these object-searching parameters; we will demonstrate them shortly.

Note

When an object-listing document includes common prefixes, represented by the CommonPrefixes elements, the document will also include a NextMarker element containing the string that should be used as the marker parameter in follow-up requests.

Listing parameters

We have mentioned a number of parameters that can be applied to object listing requests. Table 3-4 summarizes the parameters recognized by the S3 API and what they do.

Table 3-4. Request parameters for listing objects

Parameter Name	Description
max-keys	The maximum number of objects that will be listed, up to 1,000.
marker	A string that serves as a starting point for the listing. Only object keys that occur alphabetically after this marker value will be included in the listing.
prefix	Only objects with key names that start with the prefix string will be included in the listing.
delimiter	If object key names contain the delimiter string, they are listed as subdirectories in the CommonPrefixes document element. If the request also includes the prefix parameter, the delimiter must occur after the prefix portion of the key name to be recognized.

Listing implementation

Example 3-11 defines a method that sends a GET request to a bucket’s URI and retrieves and interprets the resulting object-listing XML document. The method automatically handles truncated listings by performing follow-up requests until all the objects have been listed. You can add optional parameter settings to control the listing by providing a params argument containing an array of hash objects that map each parameter’s name to its value.

Example 3-11. List objects: S3.rb

def list_objects(bucket_name, *params)
  is_truncated = true

  objects = []
  prefixes = []

  while is_truncated
    uri = generate_s3_uri(bucket_name, '', params)
    response = do_rest('GET', uri)

    xml_doc = REXML::Document.new(response.body)

    xml_doc.elements.each('//Contents') do |contents|
      objects << {
        :key => contents.elements['Key'].text,
        :size => contents.elements['Size'].text,
        :last_modified => contents.elements['LastModified'].text,
        :etag => contents.elements['ETag'].text,
        :owner_id => contents.elements['Owner/ID'].text,
        :owner_name => contents.elements['Owner/DisplayName'].text
      }
    end

    cps = xml_doc.elements.to_a('//CommonPrefixes')
    if cps.length > 0
      cps.each do |cp|
        prefixes << cp.elements['Prefix'].text
      end
    end

    # Determine whether listing is truncated
    is_truncated = 'true' == xml_doc.elements['//IsTruncated'].text

    # Remove any existing marker value
    params.delete_if {|p| p[:marker]}

    # Set the marker parameter to the NextMarker if possible,
    # otherwise set it to the last key name in the listing
    next_marker_elem = xml_doc.elements['//NextMarker']
    last_key_elem = xml_doc.elements['//Contents/Key[last()]']

    if next_marker_elem
      params << {:marker => next_marker_elem.text}
    elsif last_key_elem
      params << {:marker => last_key_elem.text}
    else
      params << {:marker => ''}
    end

  end

  return {
    :bucket_name => bucket_name,
    :objects => objects,
    :prefixes => prefixes
  }
end

The best way to come to grips with the many capabilities of the object-listing API operation is to perform some real listings and see what results we get. We will start with our test bucket my-bucket, which contains a few objects we have already created.

Let us list the objects in our test bucket and store the results in a listing variable.

irb> listing = s3.list_objects('my-bucket')
REQUEST DESCRIPTION
=======
GET\n
\n
\n
Wed, 07 Nov 2007 11:28:27 GMT\n
/my-bucket/

REQUEST
=======
Method : GET
URI    : https://my-bucket.s3.amazonaws.com
Headers:
  Authorization=AWS ABCDEFGHIJ1234567890:isS0zbXbBWCHLj1awGUHKrEFhRI=
  Date=Wed, 07 Nov 2007 11:28:27 GMT
  Host=my-bucket.s3.amazonaws.com

RESPONSE
========
Status : 200 OK
Headers:
  x-amz-id-2=x4WBeY62B2w7krKmRfo2hSgpcERZq38PP90knTW1XxWwYy2+rp/oWBRCDnt1k2aI
  content-type=application/xml
  date=Wed, 07 Nov 2007 11:28:30 GMT
  x-amz-request-id=E54E58955A479C6B
  server=AmazonS3
  transfer-encoding=chunked

We can examine the contents of the listing to confirm that it contains only objects and no common prefixes.

irb> listing[:bucket_name]
=> "my-bucket"

irb> listing[:prefixes]
=> []

irb> listing[:objects].size
=> 3

irb> listing[:objects].each {|o| puts "#{o[:key]} (#{o[:size]} bytes)"}
Hello.txt (11 bytes)
Metadata.txt (16 bytes)
WebPage.html (29 bytes)

To see what happens when the list_objects method encounters a truncated listing, you can send a listing request with the max-keys parameter set to a value smaller than the number of objects in your bucket. We will set the maximum keys limit to 1, which means the method will have to perform three separate requests.

Note

Parameter arguments can be provided to the list_objects method hashes with key names that are either Ruby symbols (:marker=>'x') or plain strings ('marker'=>'x'). Strings will always work but are less efficient, and symbols cannot include nonalphabetical characters, like the hyphen in the max-keys parameter name. We will use symbol names whenever possible.

irb> listing = s3.list_objects('my-bucket', 'max-keys'=>1)
. . .
REQUEST
=======
Method : GET
URI    : https://my-bucket.s3.amazonaws.com/?max-keys=1
Headers:
  Authorization=AWS ABCDEFGHIJ1234567890:ax8AZKHXMIhlt3QWwuLWRSzMceY=
  Date=Wed, 03 Oct 2007 04:20:08 GMT
  Host=s3.amazonaws.com
. . .
REQUEST
=======
Method : GET
URI    : https://my-bucket.s3.amazonaws.com/?max-keys=1
        &marker=Hello.txt
. . .

irb> listing[:objects].size
=> 3

If you do not have debugging turned on, you will not see any difference in the results, although the listing will take longer because it will require multiple HTTP requests instead of just one. If you enable debugging, you will see the implementation performing multiple requests until all the objects are listed. All of the request messages except the first will include a marker parameter value.

We can set our own marker value to list only those objects with names that occur alphabetically after the marker in the bucket listing. If we list all the object names that occur after the Metadata.txt object we will receive one result.

irb> listing = s3.list_objects('my-bucket', :marker=>'Metadata.txt')

irb> listing[:objects].each {|o| puts o[:key]}
WebPage.html

If we list all the object names that occur after the letter W we will receive the same result, because only the object key name WebPage.html is sorted alphabetically after the string.

irb> listing = s3.list_objects('my-bucket', :marker=>'W')

irb> listing[:objects].each {|o| puts o[:key]}
WebPage.html

Demonstrating object searching

To properly demonstrate how to use the prefix and delimiter parameters to perform searches on your objects and navigate through key name hierarchies, we must first create some test objects with hierarchical names. We will create some dummy objects now using the same image names we discussed in Searching.” These objects can be empty because we are only interested in the object names.

irb> s3.create_object('my-bucket', 'MyPictures/2005/image1.jpg')
irb> s3.create_object('my-bucket', 'MyPictures/2006/image2.jpg')
irb> s3.create_object('my-bucket', 'MyPictures/2007/image3.jpg')
irb> s3.create_object('my-bucket', 'MyPictures/2007/image4.jpg')

Now, let us list the objects in our bucket that match the prefix ‘M’.

irb> listing = s3.list_objects('my-bucket', :prefix=>'M')

irb> listing[:objects].each {|o| puts o[:key]}
Metadata.txt
MyPictures/2005/image1.jpg
MyPictures/2006/image2.jpg
MyPictures/2007/image3.jpg
MyPictures/2007/image4.jpg

If we add to this mix a delimiter parameter set to the slash character, we can see how the object key names that contain the delimiter are summarized into the CommonPrefixes element.

irb> listing = s3.list_objects('my-bucket', :prefix=>'M', :delimiter=>'/')
. . .
REQUEST
=======
Method : GET
URI    : https://my-bucket.s3.amazonaws.com/?delimiter=%2F
        &prefix=M
. . .
RESPONSE
========
. . .
Body:
<?xml version='1.0' encoding='UTF-8'?>
<ListBucketResult xmlns='http://s3.amazonaws.com/doc/2006-03-01/'>
  <Name>my-bucket</Name>
  <Prefix>M</Prefix>
  <Marker/>
  <MaxKeys>1000</MaxKeys>
  <Delimiter>/</Delimiter>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>Metadata.txt</Key>
    <LastModified>2007-11-07T11:22:22.000Z</LastModified>
    <ETag>"2d6b5bb20f322240448c5a25924ab4f5"</ETag>
    <Size>16</Size>
    <Owner>
      <ID>1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b3c4d5e6f1a2b</ID>
      <DisplayName>jamesmurty</DisplayName>
    </Owner>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
  <CommonPrefixes>
    <Prefix>MyPictures/</Prefix>
  </CommonPrefixes>
</ListBucketResult>

With both the prefix and delimiter applied, this listing includes only one object key for the Metadata.txt object and a single, common prefixes item.

irb> listing[:objects].each {|o| puts o[:key]}
Metadata.txt

irb> listing[:prefixes]
=> ["MyPictures/"]

To drill down further into the pseudo-subdirectory MyPictures, we can perform a follow-up request that takes the value from the CommonPrefixes XML element and includes it in the request message as the prefix parameter.

irb> prefix = listing[:prefixes].first
=> "MyPictures/"

irb> listing = s3.list_objects('my-bucket', :prefix=>prefix, :delimiter=>'/')

irb> listing[:prefixes]
=> ["MyPictures/2005/", "MyPictures/2006/", "MyPictures/2007/"]

This latest listing will include the set of objects and prefixes underneath MyPictures/ in the hierarchy. In our example there are only prefixes, because we have not stored any objects in this location in the heirarchy. This demonstrates very nicely how the prefix and delimiter parameters can be set according to the values returned in the CommonPrefixes XML element of a prior response document. Doing so can allow you to navigate through a hierarchical naming structure as though it were a directory structure in a standard filesystem.

We will end by going one level deeper in this hierarchy to list objects at the end of the hierarchy. You reach the leaves in the hierarchy when the object key names do not contain the delimiter string in the portion of the key name after the prefix.

irb> listing = s3.list_objects('my-bucket', :prefix=>'MyPictures/2005/',
                               :delimiter=>'/')
                  
irb> listing[:prefixes]
=> []

irb> listing[:objects].each {|o| puts "#{o[:key]} (#{o[:size]} bytes)"}
MyPictures/2005/image1.jpg (0 bytes)

Deleting Objects

To delete an object from S3, you send a DELETE request to the service with a URI specifying the object’s key name and the bucket it is stored in. The object will be deleted, and S3 will return an empty HTTP 204 response message. Delete requests will succeed when you delete an object multiple times, or even if you delete an object that never existed.

Warning

Be careful when deleting objects; there is no way to retrieve an object’s data after it has been deleted.

Example 3-12 defines a method that sends a DELETE request message to a URI specifying the bucket the object is stored in and the object’s key name.

Example 3-12. Delete object: S3.rb

def delete_object(bucket_name, object_key)
  uri = generate_s3_uri(bucket_name, object_key)
  do_rest('DELETE', uri)
  return true
end

Here is an example command and debugging log showing what happens when you delete an object.

irb> s3.delete_object('my-bucket', 'MyPictures/2005/image1.jpg')
REQUEST DESCRIPTION
=======
DELETE\n
\n
\n
Wed, 07 Nov 2007 11:35:06 GMT\n
/my-bucket/MyPictures/2005/image1.jpg

REQUEST
=======
Method : DELETE
URI    : https://my-bucket.s3.amazonaws.com/MyPictures/2005/image1.jpg
Headers:
  Authorization=AWS ABCDEFGHIJ1234567890:4Kext6O5ezLFqR0SCPR9flRs2eI=
  Date=Wed, 07 Nov 2007 11:35:06 GMT
  Host=my-bucket.s3.amazonaws.com

RESPONSE
========
Status : 204 No Content
Headers:
  x-amz-id-2=YdPaperJTlH8CPlaRmNj1JbmElROjNxwVBfh1rboy1st7kRDoDjHhc2rsu5e6WeP
  date=Wed, 07 Nov 2007 11:35:08 GMT
  x-amz-request-id=872BE829A52CAAC7
  server=AmazonS3
  
irb> s3.get_object('my-bucket', 'MyPictures/2005/image1.jpg')
AWS::ServiceError: HTTP Error: 404 - Not Found, AWS Error: 
NoSuchKey - The specified key does not exist.

Create Objects from a Web Browser Using POST

We have already seen in Create or Replace an Object” how you can create objects in S3 with PUT requests. The problem with PUT requests is that common web browsers do not support them; they rely instead on HTML forms and POST requests to upload files and data to web servers. In early 2008 Amazon added support for POST requests to the S3 service to make it possible for S3 developers and their customers to upload content into S3 using a standard web browser.

The new POST support is intended to augment rather than replace the PUT method; it allows for browser-based uploads to S3 but nothing more. You cannot create buckets or update access control settings with POST requests. If your application interacts with S3 directly the PUT request method remains the preferred mechanism for uploading data into the service. However, if your application provides a web site that accepts user-submitted content, the POST support could make your life much easier.

POST requests are constructed very differently to the other kinds of requests we have seen so far. Rather than building the request message directly yourself, you must provide the browser with an HTML Form containing all the information it will need to build a valid request on your behalf.

Here are the steps involved in allowing a web site visitor to upload content to your S3 account:

Your application generates a web page containing a specially-constructed HTML form. This form contains input fields that define and authenticate the POST request that will be sent by the browser when the user submits the form. The form will generally include a set of policy conditions specifying what kind of data the user can upload.
The user populates the HTML form with some data which may be text or a file he wishes to upload. The user then submits the form and the web browser sends a corresponding POST request to S3.
The S3 service receives the POST request and checks that the data provided by the user complies with any policy conditions you specified in the form. If any of the policy conditions fail, or if the form submitted is different from the version you authorized and signed, the service will reject the request and return an XML error message.
If the service accepts the POST request and the user successfully uploads his data, S3 will respond with a success status code, or it will redirect the user’s browser to another URL if you specify one. If the upload fails, S3 will respond with an XML error message that will be displayed in the user’s browser.

<html> 
<head><title>File Upload to S3</title> 
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
</head> 
<body>
<form action="https://my-bucket.s3.amazonaws.com/" method="post" 
      enctype="multipart/form-data"> 
  <input type="hidden" name="AWSAccessKeyId" value="ABCDEFGHIJ1234567890"> 
  <input type="hidden" name="signature" value="LPQ+lb6L0ykrDdUqc2usbEPmsjA="> 
  <input type="hidden" name="key" value="${filename}"> 
  <input type="hidden" name="policy" 
         value="eyJleHBpcmF0aW9uIjogIjIwMDgtMDEtMDlUMTE6Mjk6MzRaI...Mt=">
  Select a file to upload: <input name="file" type="file"> 
  <br> 
  <input type="submit" value="Upload to Amazon S3"> 
</form>
</body>

The first thing to note is that the form and all its contents must be UTF-8 encoded. To ensure that the web page containing the form uses the correct encoding, the page’s content type and character set should be explicitly defined in a meta tag inside the page’s header:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

The HTML form itself is configured to use the POST method and to encode data in a multipart form enclosure. The form’s action parameter must contain a URL that specifies the S3 bucket the object will be created in, and whether the HTTP or HTTPS protocol should be used to upload the data. This URL can be constructed according to any of the formats recognized by S3, see Constructing S3 URIs” for more information. We will use the recommended subdomain URL format in this book.

Now that we have covered the general structure of the HTML form, we can turn our attention to the input fields it must contain to produce a valid POST request. Table 3-5 lists the field names that the S3 service recognizes and describes how each field is used. The service recognizes these field names whether they are specified in upper, lower or mixed case.

Depending on which input fields you include in a form, it will be either authenticated or anonymous.

Authenticated: Authenticated forms include input fields that specify a policy document, your AWS access key, and a signature value. With an authenticated form you can permit uploads into any bucket you own, impose conditions on the data a user can submit, and specify an expiration date for the form. The S3 service will not accept uploads from an authorized form if the form has been modified by a user.
Anonymous: Anonymous forms do not include any of the policy document, AWS access key or signature input fields. Anonymous forms do not allow you to exercise any control over the data a user submits, and they may only be used with buckets that permit public write access by anonymous users.

Table 3-5. HTML form input fields for S3 POST

Field Name Value Required?

key

The name of the object that will be created by the POST request. The name of the file uploaded by the user is stored in the special variable ${filename}, which may be used as part of the object’s key name

To allow a user to upload a file to a predefined path in S3, you might specify a key value like my-path/${filename}. This value would generate an object name that contains the path prefix followed by the name of the original file.

Yes

file

An input form field that will provide the data to upload to S3. This field should be the very last one within a form, because any subsequent input fields will be ignored by S3.

This input field may provide file or text data. The following input field types are all acceptable:

<input name="file"
                      type="file">

<input name="file"
                      type="text" value="Default value">

<input name="file"
                      type="hidden" value="Default
                      value">

<textarea
                      name="file" cols="60"
                      rows="3"></textarea>

Yes

AWSAccessKeyId The AWS access key of the bucket’s owner. If authenticated

policy A Base64-encoded policy document that imposes conditions on the request. See Policy Document for S3 POST” below for more information on policy documents. If authenticated

signature A HMAC signature value generated by signing a Base64-encoded policy document with your AWS secret key. If authenticated

acl The access control policy to apply to the newly created object. This value may be one of: private, public-read, public-read-write or authenticated-read. If this field is not included, the object will be private by default. No

success_action_status

The HTTP status code S3 will include in its response after a successful upload. This value may be 200 or 204, and will default to 204 if the field is not included.

If the success_action_redirect field is set, it will override any value in this field.

success_action_redirect

The URL to which the client’s web browser is redirected after a successful upload. The redirect URL returned by S3 will include parameters that specify the bucket, key and etag values of the newly-created object.

If this field is not included or if the specified URL is invalid, S3 will respond with the status code specified by the success_action_status field.

x-amz-meta-* If fields with a name beginning with x-amz-meta- are included, values of these fields are stored as metadata with the new object. No

Others... Other fields can be included in the form, however S3 will only act upon the fields it recognizes such as the standard REST headers: Cache-Control, Content-Type, Content-Disposition, Content-Encoding, Expires. No

The maximum amount of form data that is permitted in a POST request is 20 KB, excluding the form’s data content.

Most input fields contain simple static text values, but the S3 service also understands one variable name: ${filename}. When the service processes a POST request, this variable is replaced with the name of the file the user has uploaded. This variable can be used in any form field except for policy, though it is most useful in the key field where you specify the name of the object that will be created. Note that this variable will only work in forms that upload files, if a form uploads text data the ${filename} variable will not be substituted.

Policy Document for S3 POST

A policy document specifies conditions that a POST request must meet to be considered valid by S3, and is the basis of the request signing technique used to authenticate POST requests. You must include a policy document in your HTML forms to maintain any degree of control over the data that users can upload to S3. A policy document is provided to S3 as a Base64-encoded document in the policy input field.

You specify your policy as a UTF-8 encoded JavaScript Object Notation (JSON) document. In other words, the policy document contains a hierarchical collection of name and value pairs that follow a predefined structure we will describe below.

Special characters in the policy document must be escaped with a preceding backslash (\) character. The set of special characters includes the backslash (\) and dollar sign ($), any control characters such as newline (\n) and tab (\t), as well as all unicode characters (\uxxxx).

Every policy document includes two top-level name and value object pairs: expiration and conditions. The expiration item specifies an ISO-8601 GMT timestamp value which indicates when the policy will expire, while the conditions item contains an array of zero or more rules to define what data a user can upload. If a user submits a POST request with data that breaks any of the policy conditions, S3 will reject the request with an error message.

Here is a policy document that imposes three conditions, and that expires at midnight of February 1st, 2008.

{
  "expiration": "2008-02-01T00:00:00Z", 
  "conditions": [
    {"bucket": "my-bucket"},
    ["starts-with", "$key", ""],
    ["content-length-range", 1, 51200]
  ]
}

Policy conditions are specified as either an object or an array in JSON format. You need not worry too much about what this means exactly, because there is only a limited set of condition statements available. You can simply adapt the example statements below to define your own conditions.

Each condition statement in a policy document describes a test operation that S3 will perform on a specific field in the HTML form. There are three kinds of conditions that can be specified in a policy document:

Equality

An equality condition checks whether a field’s value or values exactly matches a given string. The equality condition can be specified in one of two formats.

The first format defines the condition as an array of strings that describe the equality operator (eq), the field to test ($fieldName), and a literal comparison value. Here is a condition statement that tests whether the field named “acl” has a value of “private”:

["eq", "$acl", "private"]

There is a shorthand format for the equality test in which the condition is specified as a simple name and value pair within brace ({}) characters:

{"acl": "private"}

If your HTML form includes multiple values for a single field, the equality check must include each of the values in the correct order separated by commas:

{"acl": "private,authenticated-read"}

Starts With

This condition uses the starts-with operator to check whether a field’s value begins with a specific string.

The condition is specified with the same long format as we saw in the equality test. To test whether the input field named “key” has a value starting with “/documents/public/” we would use the following condition statement:

["starts-with", "$key", "/documents/public/"]

The starts-with condition has an important use beyond just testing for values that start with a given string: it can be used as a test that matches any possible value of a field. If you wish to permit a field to take on any value at all, you can define a starts-with condition that checks the field’s value against an empty string. A field will pass this test whatever its value.

["starts-with", "$key", ""]

Content Length Range

This condition is a special case because it does not apply to a field in the HTML form, but instead tests the size of the data a user has uploaded to S3. With this condition, you can define upper and lower limits on how many bytes of data a user can submit to S3. If the user uploads too much or too little data, the POST request will fail.

A content length range condition is specified as an array of three items describing the condition’s name ("content-length-range") followed by the lower and upper bounds expressed as integers. Here is a condition statement that requires the user to upload at least 1 byte but no more than 50 Kilobytes:

["content-length-range", 1, 51200]

To construct a policy document that fully describes and authenticates your HTML form, you must include at least one condition for each named input field in the form. If your form contains input fields that are not mentioned in the policy document, the S3 service will reject the POST request. This is a safety mechanism imposed by the service to ensure that the policy documents you create describe all the fields in the form and leave no room for an attacker to modify the form after you have signed it.

For example, if you wish to add a field specifying the Content-Type header that will be assigned to a new object, you must also include a condition for this field in the policy document. The condition you include may be an equality check or it may use the starts-with operator to allow any value for the field, it does not matter what kind of condition you define provided there is at least one policy condition that refers to each field.

There are two field names that must be included in every policy document: key and bucket. The key input field is required in every form and must therefore be permitted with a corresponding condition statement. The situation is the same for the bucket field. Although the bucket field is not explicitly included in the S3 POST form, it is included implicitly with every POST request.

There are some exceptions to the rule that every input field must have a corresponding condition statement in the policy document. The exceptions to the rule are any form fields that occur after the file field, any fields with a name starting with x-ignore-, and the fields AWSAccessKeyId, signature, file and policy.

Generate a policy document

Example 3-13 defines a method that generates a policy document based on information you provide, including an expiration Time object and a hash dictionary representing policy conditions. The method seems rather complex, however in essence all it does is concatenate a series of condition statement strings into a boilerplate policy document template. The complexity comes from deciding which condition statements to define.

Example 3-13. Build POST policy document: S3.rb

def build_post_policy(expiration_time, conditions)
  if expiration_time.nil? or not expiration_time.respond_to?(:getutc)
    raise 'Policy document must include a valid expiration Time object'
  end
  if conditions.nil? or not conditions.class == Hash
    raise 'Policy document must include a valid conditions Hash object'
  end

  # Convert conditions object mappings to condition statements
  conds = []
  conditions.each_pair do |name,test|
    if test.nil?
      # A nil condition value means allow anything.
      conds << %{["starts-with", "$#{name}", ""]}
    elsif test.is_a? String
      conds << %{{"#{name}": "#{test}"}}
    elsif test.is_a? Array
      conds << %{{"#{name}": "#{test.join(',')}"}}
    elsif test.is_a? Hash
      operation = test[:op]
      value = test[:value]
      conds << %{["#{operation}", "$#{name}", "#{value}"]}
    elsif test.is_a? Range
      conds << %{["#{name}", #{test.begin}, #{test.end}]}
    else
      raise "Unexpected value type for condition '#{name}': #{test.class}"
    end
  end

  return %{{"expiration": "#{expiration_time.getutc.iso8601}",
            "conditions": [#{conds.join(",")}]}}
end

To use this method, you describe each policy condition as a mapping from a field name to a value object. You indicate what kind of condition you wish to apply by using different data types for the value object.

Value Data Type	Condition Applied
nil	A starts-with test that will accept any value.
String	An equality test using the given string.
Array	An equality test, against a value composed of all the array’s items combined into a comma-delimited string.
Hash	An operation named by the `:op` mapping, with a value as given by the `:value` mapping.
Range	A range test, where the range must lie between the beginning and end values of the Range object provided.

Here is a command that demonstrates how to use the build_post_policy method to generate a policy document containing each of the five condition types understood by the method.

irb> conditions = {
irb>   'key' => nil,                           # Empty starts-with condition
irb>   'bucket' => 'my-bucket',                # Equality condition
irb>   'x-amz-meta-mytag' => ['Work','TODO'],  # Equality with multi-values
irb>   'Content-Type' => {:op=>'starts-with',  # Starts-with condition
irb>                      :value=>'text/'},
irb>   'content-length-range'=>Range.new(1,50) # Range condition
irb> }

irb> expiration = Time.now + 60 * 5  # Policy expires in 5 minutes

irb> s3.build_post_policy(expiration, conditions)

# The resultant policy document
=> {"expiration": "2008-01-08T10:31:36Z", 
     "conditions": [
       ["starts-with", "$key", ""],
       {"bucket": "my-bucket"},
       {"x-amz-meta-mytag": "Work,TODO"},
       ["starts-with", "$Content-Type", "text/"],
       ["content-length-range", 1, 50]
     ]
   }

Generate a form

Example 3-14 defines a method that builds an HTML form document for performing S3 POST requests. The form document produced by this method will follow the structure described in HTML form for S3 POST.”

If you provide policy conditions as optional arguments, this method will call the build_post_policy method to create a policy document. The policy document will be Base64-encoded, included in the form in the policy input field, and used in combination with your AWS secret key to generate a HMAC signature value that authenticates the form. If you do not provide any policy conditions the method will produce an anonymous form.

Example 3-14. Build POST HTML Form: S3.rb

def build_post_form(bucket_name, key, options={})
  fields = []

  # Form is only authenticated if a policy is specified.
  if options[:expiration] or options[:conditions]
    # Generate policy document
    policy = build_post_policy(options[:expiration], options[:conditions])
    puts "POST Policy\n===========\n#{policy}\n\n" if @debug

    # Add the base64-encoded policy document as the 'policy' field
    policy_b64 = encode_base64(policy)
    fields << %{<input type="hidden" name="policy" value="#{policy_b64}">}

    # Add the AWS access key as the 'AWSAccessKeyId' field
    fields << %{<input type="hidden" name="AWSAccessKeyId"
                                     value="#{@aws_access_key}">}

    # Add signature for encoded policy document as the 'AWSAccessKeyId' field
    signature = generate_signature(policy_b64)
    fields << %{<input type="hidden" name="signature" value="#{signature}">}
  end

  # Include any additional fields
  options[:fields].each_pair do |n,v|
    if v.nil?
      # Allow users to provide their own <input> fields as text.
      fields << n
    else
      fields << %{<input type="hidden" name="#{n}" value="#{v}">}
    end
  end if options[:fields]

  # Add the vital 'file' input item, which may be a textarea or file.
  if options[:text_input]
    # Use the text_input option which should specify a textarea or text
    # input field. For example:
    # '<textarea name="file" cols="80" rows="5">Default Text</textarea>'
    fields << options[:text_input]
  else
    fields << %{<input name="file" type="file">}
  end

  # Construct a sub-domain URL to refer to the target bucket. The
  # HTTPS protocol will be used if the secure HTTPS option is enabled.
  url = "http#{@secure_http ? 's' : ''}://#{bucket_name}.s3.amazonaws.com/"

  # Construct the entire form.
  form = %{
    <form action="#{url}" method="post" enctype="multipart/form-data">
      <input type="hidden" name="key" value="#{key}">
      #{fields.join("\n")}
      <br>
      <input type="submit" value="Upload to Amazon S3">
    </form>
    }
  puts "POST Form\n=========\n#{form}\n" if @debug

  return form
end

Let us step through some examples to see how this method works, and the policy and form documents it generates. We will start with a very simple anonymous form that is not authenticated, and will therefore be limited to uploading files to a bucket with public-write access. We will use the ${filename} variable as the value for the key field, which means that the object will be given the same name as the uploaded file.

irb> s3.build_post_form('my-bucket', '${filename}')
POST Form
=========
<form action="https://my-bucket.s3.amazonaws.com/" method="post" 
      enctype="multipart/form-data">
  <input type="hidden" name="key" value="${filename}">
  <input name="file" type="file">
  <br>
  <input type="submit" value="Upload to Amazon S3">
</form>

The following example is more realistic and useful. In it we will generate an authenticated form, which means we must include policy conditions for each of the input fields included in the form. When a file is uploaded using this form, the resultant S3 object will be named uploads/images/pic.jpg. The object will be made publicly accessible by assigning it the public-read ACL setting, and it will be identified as a JPEG image by its content type value. To ensure that the user uploads a file of a reasonable size, we will apply a content length range restriction of between 10KB and 2MB. Finally, after the user uploads a file they will be redirected to the URL http://localhost/post_upload.

# Fields to set the object's access permissions and content type
irb> fields = {
irb>   'acl' => 'public-read', 
irb>   'Content-Type' => 'image/jpeg',
irb>   'success_action_redirect' => 'http://localhost/post_upload'
irb>   }

# Conditions for the mandatory 'bucket' and 'key' fields, as well as the
# additional fields specified above. Also includes a byte range condition.
irb> conditions = {
irb>   'bucket' => 'my-bucket',
irb>   'key' => 'uploads/images/pic.jpg',
irb>   'acl' => 'public-read',
irb>   'Content-Type' => 'image/jpeg',
irb>   'success_action_redirect' => 'http://localhost/post_upload',
irb>   'content-length-range' => Range.new(10240, 204800)
irb> }

# Form expires in 24 hours
irb> expiration = Time.now + 3600 * 24

# Combine all the optional form components into a single hash dictionary
irb> options = {
irb>   :expiration => expiration,
irb>   :conditions => conditions,
irb>   :fields => fields
irb> }

# Generate the form. We have turned on debugging so both the policy and 
# form documents will be printed out in full.
irb> s3.build_post_form('my-bucket', 'uploads/images/pic.jpg', options)

POST Policy
===========
{"expiration": "2008-01-09T11:29:34Z",
 "conditions": [
   {"success_action_redirect": "http://localhost/post_upload"},
   {"bucket": "my-bucket"},
   {"Content-Type": "image/jpeg"},
   {"key": "uploads/images/pic.jpg"},
   ["content-length-range", 10240, 204800],
   {"acl": "public-read"}
]}
              
POST Form
=========              
<form action="https://my-bucket.s3.amazonaws.com/" method="post" 
      enctype="multipart/form-data">
  <input type="hidden" name="key" value="uploads/images/pic.jpg">
  <input type="hidden" name="policy" value="eyJleH...KICAg="/>
  <input type="hidden" name="AWSAccessKeyId" value="ABCDEFGHIJ1234567890">
  <input type="hidden" name="signature" value="LPQ+lb6L0ykrDdUqc2usbEPmsjA=">
  <input type="hidden" name="success_action_redirect" 
            value="http://localhost/post_upload">
  <input type="hidden" name="Content-Type" value="image/jpeg">
  <input type="hidden" name="acl" value="public-read">
  <input name="file" type="file">
  <br>
  <input type="submit" value="Upload to Amazon S3">
</form>

For our final example, we will allow our users to type HTML code into a text box and submit this data to S3 instead of a file. To do this, we provide our own text area input field to override the default file input field.

irb> fields = {'acl' => 'public-read', 'Content-Type' => 'text/html'}

irb> key = 'users/posts/comment-01234.html'

irb> conditions = {
irb>   'bucket' => 'my-bucket',
irb>   'key' => {:op => 'starts-with', :value => 'users/posts/comment-'},
irb>   'acl' => 'public-read',
irb>   'Content-Type' => 'text/html'
irb> }

# Define our own input field item named 'file' to accept textual data 
irb> input_field = '<textarea name="file" cols="60" rows="10"></textarea>'

irb> s3.build_post_form('my-bucket', key, :fields => fields,
irb>     :expiration => Time.now + 60, :conditions => conditions, 
irb>     :text_input => input_field)

POST Policy
===========
{"expiration": "2008-01-11T05:10:33Z",
 "conditions": [
   {"bucket": "my-bucket"},
   {"Content-Type": "text/html"},
   ["starts-with", "$key", "users/posts/comment-"],
   {"acl": "public-read"}
]}
              
POST Form
=========
<form action="https://my-bucket.s3.amazonaws.com/" method="post" 
      enctype="multipart/form-data">
  <input type="hidden" name="key" value="users/posts/comment-01234.html">
  <input type="hidden" name="policy" value="eyJleHBp...aIiwKI"/>
  <input type="hidden" name="AWSAccessKeyId" value="ABCDEFGHIJ1234567890">
  <input type="hidden" name="signature" value="cVnaqAdLzRd3g8sEZx9xSs63hw0=">
  <input type="hidden" name="Content-Type" value="text/html">
  <input type="hidden" name="acl" value="public-read">
  <textarea name="file" cols="60" rows="10"></textarea>
  <br>
  <input type="submit" value="Upload to Amazon S3">
</form>