Amazon S3 (Simple Storage Service) is an online file-storage web service provided by Amazon. It is unique among online storage services in several ways:
It has a no-minimum pricing structure. Storage is billed by
the GB-month, bandwidth is billed by the GB, and there is an
additional charge per GET, PUT
,
and LIST
request.
There is no web interface to create objects; the only full mode of access is through the API.
It is generally agreed that the S3 API is the first large public API that calls itself RESTful and actually lives up to the principles of REST.
In addition to the rich HTTP web service interface, S3 can serve objects over plain HTTP (without any custom HTTP headers) and BitTorrent. Many organizations use S3 as a storage network for their static content because it can serve images, CSS, and JavaScript just as well as a standard web server.
The full documentation for the S3 API is at http://aws.amazon.com/s3. We will now look into the basic architecture of S3, its concepts, and its set of operations.
S3 is used to store objects, which are streams of data with a key (a name) and attached metadata. They are like files in many ways. Objects are stored in buckets, which also have a key. Buckets are like filesystem directories, with a few differences:
Bucket names must be unique across the entire S3 system. You cannot pick a bucket name that has already been chosen by someone else.
Bucket names must be valid DNS names (alphanumeric plus underscore, period, and dash).
Buckets cannot be nested. There is one level of buckets,
which contain objects. However, we can fake such nesting by giving
objects keys like blog/2007/01/05/index.html
. Slash
characters, though they often designate hierarchy in URIs, are
treated like any other character in object keys. We can even query
keys by prefix, so we can ask to list keys starting with blog/2007/01/05
.
Amazon provides three different URI templates by which objects can be accessed. These are genuine RESTful URIs; they refer to the resources themselves, and nothing else:
This last URI is an example of a virtual hosted
bucket; by using a DNS name as a bucket key, and pointing
that DNS name at s3.amazonaws.com. via a CNAME,
S3 will recognize the bucket key from the Host
header and serve the appropriate
object. This makes it possible to serve an entire domain from S3,
nearly transparently. If we create a bucket called images.example.com, place a
JPEG photo in it as an object called hello.jpg
, and ensure the proper CNAME is
set up pointing images.example.com. to s3.amazonaws.com., then our
image is accessible at http://images.example.com/hello.jpg with a standard web
browser, just as if we had an HTTP server serving that URI.
Because Amazon was not tied to the limitations of existing HTTP clients, it did not have to bow to the limitations of HTTP Basic or Digest authentication in web browsers when creating S3. The S3 authentication protocol is a thin layer, adding an HMAC signature to each request. After the message is signed, a header is added to the HTTP request as follows:
Authorization: AWS AWSAccessKeyId:Signature |
The AWSAccessKeyId value indicates the ID of the access key that the bucket owner generated; it is tantamount to a user ID. The Signature value is the Base64-encoded result of the HMAC calculation.
S3 is a closed system; the owner of a bucket is billed for most operations on it. Therefore, all requests to S3 must be signed or otherwise authorized by the bucket owner, as he is the one ultimately responsible for payment.
However, signing each request can be inconvenient in some situations. A common example is when an organization uses S3 as an asset server; usually the organization would want the corresponding bucket to be world-readable. S3 includes access control lists (ACLs) for this purpose. As long as the owner is comfortable with being charged for operations by anonymous users, he can give READ access to the AllUsers group, which will eliminate the need for a signature.
Another option, which can be incredibly useful, is to delegate access control by including the authentication information in the query string of the object's URI. This is most useful when the object is still private but there are designated users without an AWS account who should be allowed to retrieve it via plain HTTP or BitTorrent. Basecamp uses this approach to store a company's files. The files are kept on S3 with a locked-down ACL, and when an authorized user requests the file, he is sent to a URI including a signature, which is valid for a limited period of time. The format of the URIs is such:
/objectkey?AWSAccessKeyId=AWSAccessKeyId&Expires=Expires&Signature=Signature |
The AWSAccessKeyId
and
Signature
values are as described
previously, while the Expires value is a POSIX-time-formatted value
indicating when the authorization expires. The Expires
value is also signed by the HMAC
so that the recipient cannot modify it undetected.
S3 has a truly RESTful HTTP interface, in which the URIs correspond to resources only, the proper HTTP methods are used according to their semantics, and status codes are used appropriately. There are three types of resources in the S3 system:
Represents the Amazon S3 service; its well-known URI is http://s3.amazonaws.com/. This resource supports only one HTTP method:
GET service
Returns a list of all buckets owned by the currently authenticated user.
Represents one bucket belonging to the authenticated user. Can be accessed through the following URIs:
http://bucketkey/ (if the key is a valid DNS name with a CNAME pointing to s3.amazonaws.com)
A bucket resource supports the following three methods:
PUT bucket
Creates a bucket with the given name (as the client gets
to choose the name, this is accomplished with PUT
to the resource itself, rather
than POST
to the parent).
Attempting to create a bucket that already exists will return an
HTTP 409 Conflict error code.
GET bucket
Retrieves a list of objects contained in the specified
bucket. Takes a prefix
parameter in the query string to list all keys that begin with a
given string.
DELETE bucket
Deletes the specified bucket. Only the bucket's owner may delete a bucket. A bucket can be deleted only if it is empty; attempting to delete a nonempty bucket will cause an error with an HTTP status code of 409 Conflict.
Represents an object stored within a bucket. Accessible at the following URIs:
All object keys, as seen above, are qualified with their bucket key. An object resource supports the following four methods:
PUT object
Stores the given data at the location specified, creating a new object or overwriting an existing object.
GET object
Retrieves and returns the object at the specified location.
HEAD object
Returns the headers that would be returned from a GET
request on this object, with no
body.
DELETE object
Deletes the object at the given location. By analogy to Unix file permissions, you must have WRITE access on a bucket to delete objects within it. Deleting a nonexistent object is not an error, but is effectively a no-op.
Marcel Molina, Jr.'s AWS::S3 library (http://amazon.rubyforge.org/) is the most popular client for S3. Its design was inspired by ActiveRecord, and it is simple and elegant:
require 'aws/s3' # gem install aws-s3 AWS::S3::Base.establish_connection!( :access_key_id => 'MyAWSAccessKeyId', :secret_access_key => 'MyAWSSecretAccessKey' ) image_bucket = Bucket.create "images.example.com" S3Object.store( 'hello.jpg', # key File.read('hello.jpg'), # value 'images.example.com', # bucket name :content_type => 'image/jpeg', :access => :public_read )
The s3fuse project (http://sourceforge.net/projects/s3fuse/) is an implementation of an S3 client using FUSE (a Linux filesystem framework that runs in userspace rather than kernel space). This makes it possible to mount an S3 bucket as a Linux filesystem and use it transparently within unmodified applications.
Park Place, by why the lucky stiff (http://code.whytheluckystiff.net/parkplace),is a nearly complete clone of the Amazon S3 web service. It is perfect for developing and testing S3 applications without requiring an S3 account or payment. It does not support S3's SOAP interface, but it supports most everything else, including distributing objects with BitTorrent.
Park Place is written using the excellent Camping web microframework, also by why the lucky stiff (http://code.whytheluckystiff.net/camping). Camping is a very stripped-down Ruby framework modeled after Rails but taking less than 4 kb of source (packed).
Incidentally, the Camping source is a great place to learn Ruby meta-programming inside and out.