Representing Data in SimpleDB

The SimpleDB service stores all content as text, including the attribute values that store your data. The service does not recognize data types in the same way that a relational database does. This feature makes the service more flexible, because you can store any values you like without having to worry about whether they match a predefined schema; however, it also means that the service is only able to compare or sort values based on lexicographical (alphabetical) ordering. Whereas a traditional database can compare various data types based on a full understanding of what the particular type means, SimpleDB is oblivious to the standard data types and will assume that an alphabetical ordering always makes sense.

If you intend to perform queries that use comparison operators, such as less-than and greater-than, you will have to carefully encode any nontextual data you store in the service so that its lexicographical ordering is the same as the expected ordering for data of that type. You will also need to be able to decode these text values when you retrieve them from SimpleDB.

In this section we will define methods to encode and decode the most commonly used data types: Boolean, date, integer, and float. Our data-type encodings are designed to meet two criteria:

  1. The encoded values retain the expected ordering when sorted lexicographically.

  2. Encoded values include a special prefix character (!), to make it easy to recognize them, and a second character to identify the data type the encoded value represents.

The encoding of data types into text strings is an advanced topic that we cannot discuss in depth in this book. The encoding techniques we present here are intended to meet the needs of most SimpleDB users, but there are bound to be some situations in which the reader will have to implement an encoding format that better suits his or her application. Regardless of how you encode your data, make sure the end result sorts correctly as text for the full range of values you intend to use.

Our encoding of Boolean values is very simple: !b represents true and !B represents false. Example 13-10 and Example 13-11 define methods that will encode and decode Boolean values in the SimpleDB class.

This Boolean encoding is easy to use and to recognize:

# Encoding boolean values
irb> sdb.encode_boolean(true)
=> "!b"
irb> sdb.encode_boolean(false)
=> "!B"
irb> sdb.encode_boolean(nil)
=> "!B"

We use the ISO 8601 date format to store dates, because this format was designed such that lexicographical order corresponds to chronological order in all but a few cases (such as dates prior to 0 B.C.E.). Example 13-12 and Example 13-13 elaborate. To ensure that encoded dates can be properly compared, we always convert dates to the UTC time zone.

These date strings should look very familiar, because the AWS services use the ISO 8601 date format extensively.

irb> sdb.encode_date(Time.now)
=> "!d2008-01-03T05:14:39Z"

irb> sdb.decode_date('!d2008-01-03T05:12:50Z')
=> Thu Jan 03 05:12:50 UTC 2008

Integer values do not sort well when converted to text, because the ordering is affected by the number of digits in the string and by the presence of a minus sign for negative values. To encode positive integers to text strings, we add zeros to the beginning of the string so that all integer strings are the same length. Encoding negative numbers is more difficult. In this case we record the number as the difference between the actual value and the largest value that can be represented using our formatting scheme, given a limit on how many digits can be included. Positive and negative numbers are identified with the prefixes !i and !I respectively. Example 13-14 and Example 13-15 define methods that encode and decode integer values.

The floating-point data type is the most difficult one to encode into text values, because we must handle three separate components: the number’s sign, exponent, and fraction. To encode a float’s sign, we use the !f and !F prefixes to represent positive and negative values respectively. We store the exponent as a zero-padded integer with an offset value added to convert negative exponents into positive values. The fraction component is stored as an integer using the same technique described above to handle positive and negative values. If the fraction component is too large to fit in the space allowed, we reduce the precision by rounding the value.

Example 13-16 defines a method that encodes a positive or negative floating-point value, while Example 13-17 defines a method that decodes it. By default, the number of digits allocated to the float’s exponent is 2, which allows for exponent values between –50 and 49 to be encoded. The default number of digits allocated to the float’s fraction is 15, which represents a precision greater than the floating-point data type of most languages.

More encoding examples are shown here to further illustrate working with floating-point values.

irb> sdb.encode_float(0.0)
=> "!f00!000000000000000"
# Exponent: 0, 15-digit fraction: 000000000000000

irb> sdb.encode_float(12345678901234567890)
=> "!f70!123456789012346"
# Exponent: 20, Rounded 15-digit fraction: 123456789012346

irb> sdb.encode_float(0.12345678901234567890)
=> "!f50!123456789012346"
# Exponent: 0, Rounded 15-digit fraction: 123456789012346

irb> sdb.encode_float(-12345678901234567890)
=> "!F30!876543210987654"
# Exponent: -20, 15-digit fraction difference: 876543210987654

irb> sdb.encode_float(-0.12345678901234567890)
=> "!F50!876543210987654"
# Exponent: 0, 15-digit fraction difference: 876543210987654

# Confirm the encoded values sort correctly
irb> ["!f00!000000000000000","!f70!123456789012346","!f50!123456789012346",
irb>  "!F30!876543210987654","!F50!876543210987654"].sort

=> ["!F30!876543210987654",  # -12345678901234567890
    "!F50!876543210987654",  # -0.12345678901234567890
    "!f00!000000000000000",  # 0.0
    "!f50!123456789012346",  # 0.12345678901234567890
    "!f70!123456789012346"]  # 12345678901234567890

To allow the SimpleDB class to automatically encode and decode attribute values on the fly, in Example 13-18 and Example 13-19 we will define the methods encode_attribute_value and decode_attribute_value. These are called by the existing class methods when attribute values are set (see Example 13-6) or retrieved (see Example 13-7).