Representing Data in SimpleDB

The SimpleDB service stores all content as text, including the attribute values that store your data. The service does not recognize data types in the same way that a relational database does. This feature makes the service more flexible, because you can store any values you like without having to worry about whether they match a predefined schema; however, it also means that the service is only able to compare or sort values based on lexicographical (alphabetical) ordering. Whereas a traditional database can compare various data types based on a full understanding of what the particular type means, SimpleDB is oblivious to the standard data types and will assume that an alphabetical ordering always makes sense.

If you intend to perform queries that use comparison operators, such as less-than and greater-than, you will have to carefully encode any nontextual data you store in the service so that its lexicographical ordering is the same as the expected ordering for data of that type. You will also need to be able to decode these text values when you retrieve them from SimpleDB.

In this section we will define methods to encode and decode the most commonly used data types: Boolean, date, integer, and float. Our data-type encodings are designed to meet two criteria:

The encoded values retain the expected ordering when sorted lexicographically.
Encoded values include a special prefix character (!), to make it easy to recognize them, and a second character to identify the data type the encoded value represents.

The encoding of data types into text strings is an advanced topic that we cannot discuss in depth in this book. The encoding techniques we present here are intended to meet the needs of most SimpleDB users, but there are bound to be some situations in which the reader will have to implement an encoding format that better suits his or her application. Regardless of how you encode your data, make sure the end result sorts correctly as text for the full range of values you intend to use.

Boolean Encoding

Our encoding of Boolean values is very simple: !b represents true and !B represents false. Example 13-10 and Example 13-11 define methods that will encode and decode Boolean values in the SimpleDB class.

Example 13-10. Encode Boolean value: SimpleDB.rb

def encode_boolean(value)
  if value
    return '!b'
  else
    return '!B'
  end
end

Example 13-11. Decode Boolean value: SimpleDB.rb

def decode_boolean(value_str)
  if value_str == '!B'
    return false
  elsif value_str == '!b'
    return true
  else
    raise "Cannot decode boolean from string: #{value_str}"
  end
end

This Boolean encoding is easy to use and to recognize:

# Encoding boolean values
irb> sdb.encode_boolean(true)
=> "!b"
irb> sdb.encode_boolean(false)
=> "!B"
irb> sdb.encode_boolean(nil)
=> "!B"

Date Encoding

We use the ISO 8601 date format to store dates, because this format was designed such that lexicographical order corresponds to chronological order in all but a few cases (such as dates prior to 0 B.C.E.). Example 13-12 and Example 13-13 elaborate. To ensure that encoded dates can be properly compared, we always convert dates to the UTC time zone.

Example 13-12. Encode date value:SimpleDB.rb

def encode_date(value)
  return "!d" + value.getutc.iso8601
end

Example 13-13. Decode date value: SimpleDB.rb

def decode_date(value_str)
  if value_str[0..1] == '!d'
    return Time.parse(value_str[2..-1])
  else
    raise "Cannot decode date from string: #{value_str}"
  end
end

These date strings should look very familiar, because the AWS services use the ISO 8601 date format extensively.

irb> sdb.encode_date(Time.now)
=> "!d2008-01-03T05:14:39Z"

irb> sdb.decode_date('!d2008-01-03T05:12:50Z')
=> Thu Jan 03 05:12:50 UTC 2008

Integer Encoding

Integer values do not sort well when converted to text, because the ordering is affected by the number of digits in the string and by the presence of a minus sign for negative values. To encode positive integers to text strings, we add zeros to the beginning of the string so that all integer strings are the same length. Encoding negative numbers is more difficult. In this case we record the number as the difference between the actual value and the largest value that can be represented using our formatting scheme, given a limit on how many digits can be included. Positive and negative numbers are identified with the prefixes !i and !I respectively. Example 13-14 and Example 13-15 define methods that encode and decode integer values.

Example 13-14. Encode integer value: SimpleDB.rb

def encode_integer(value, max_digits=18)
  upper_bound = (10 ** max_digits)

  if value >= upper_bound or value < -upper_bound
    raise "Integer #{value} is outside encoding range (-#{upper_bound} " +
      "to #{upper_bound - 1})"
  end

  if value < 0
    return "!I" + format("%0#{max_digits}d", upper_bound + value)
  else
    return "!i" + format("%0#{max_digits}d", value)
  end
end

Example 13-15. Decode integer value: SimpleDB.rb

def decode_integer(value_str)
  if value_str[0..1] == '!I'
    # Encoded value is a negative integer
    max_digits = value_str.size - 2
    upper_bound = (10 ** max_digits)

    return value_str[2..-1].to_i - upper_bound
  elsif value_str[0..1] == '!i'
    # Encoded value is a positive integer
    return value_str[2..-1].to_i
  else
    raise "Cannot decode integer from string: #{value_str}"
  end
end

Some example encodings may make clearer how the integer encoding format produces strings that sort in the correct order.

# Maximum number of digits allowed in encoded strings (default is 18)
irb> max_digits = 2

irb> sdb.encode_integer(7, max_digits)
=> "!i07"

irb> sdb.encode_integer(25, max_digits)
=> "!i25"

irb> sdb.encode_integer(-3, max_digits)
=> "!I97"

irb> sdb.encode_integer(-100, max_digits)
=> "!I00"

# Confirm the encoded values sort correctly
irb> ["!i07", "!i25", "!I97", "!I00"].sort
=> ["!I00", "!I97", "!i07", "!i25"]
#ie  -100,     -3,      7,     25

Float Encoding

The floating-point data type is the most difficult one to encode into text values, because we must handle three separate components: the number’s sign, exponent, and fraction. To encode a float’s sign, we use the !f and !F prefixes to represent positive and negative values respectively. We store the exponent as a zero-padded integer with an offset value added to convert negative exponents into positive values. The fraction component is stored as an integer using the same technique described above to handle positive and negative values. If the fraction component is too large to fit in the space allowed, we reduce the precision by rounding the value.

Example 13-16 defines a method that encodes a positive or negative floating-point value, while Example 13-17 defines a method that decodes it. By default, the number of digits allocated to the float’s exponent is 2, which allows for exponent values between –50 and 49 to be encoded. The default number of digits allocated to the float’s fraction is 15, which represents a precision greater than the floating-point data type of most languages.

Example 13-16. Encode float value: SimpleDB.rb

def encode_float(value, max_exp_digits=2, max_precision_digits=15)
  exp_midpoint = (10 ** max_exp_digits) / 2

  sign, fraction, base, exponent = BigDecimal(value.to_s).split

  if exponent >= exp_midpoint or exponent < -exp_midpoint
    raise "Exponent #{exponent} is outside encoding range " +
      "(-#{exp_midpoint} " + "to #{exp_midpoint - 1})"
  end

  if fraction.size > max_precision_digits
    # Round fraction value if it exceeds allowed precision.
    fraction_str = fraction[0...max_precision_digits] + '.' +
                   fraction[max_precision_digits..-1]
    fraction = BigDecimal(fraction_str).round(0).split[1]
  elsif fraction.size < max_precision_digits
    # Right-pad fraction with zeros if it is too short.
    fraction = fraction + ('0' * (max_precision_digits - fraction.size))
  end

  # The zero value is a special case, for which the exponent must be 0
  exponent = -exp_midpoint if value == 0

  if sign == 1
    return format("!f%0#{max_exp_digits}d", exp_midpoint + exponent) +
      format("!%0#{max_precision_digits}d", fraction.to_i)
  else
    fraction_upper_bound = (10 ** max_precision_digits)
    diff_fraction = fraction_upper_bound - BigDecimal(fraction)
    return format("!F%0#{max_exp_digits}d", exp_midpoint - exponent) +
      format("!%0#{max_precision_digits}d", diff_fraction)
  end
end

Example 13-17. Decode float value: SimpleDB.rb

def decode_float(value_str)
  prefix = value_str[0..1]

  if prefix != '!f' and prefix != '!F'
    raise "Cannot decode float from string: #{value_str}"
  end

  value_str =~ /![fF]([0-9]+)!([0-9]+)/
  exp_str = $1
  fraction_str = $2

  max_exp_digits = exp_str.size
  exp_midpoint = (10 ** max_exp_digits) / 2
  max_precision_digits = fraction_str.size

  if prefix == '!F'
    sign = -1
    exp = exp_midpoint - exp_str.to_i

    fraction_upper_bound = (10 ** max_precision_digits)
    fraction = fraction_upper_bound - BigDecimal(fraction_str)
  else
    sign = 1
    exp = exp_str.to_i - exp_midpoint

    fraction = BigDecimal(fraction_str)
  end

  return sign * "0.#{fraction.to_i}".to_f * (10 ** exp)
end

More encoding examples are shown here to further illustrate working with floating-point values.

irb> sdb.encode_float(0.0)
=> "!f00!000000000000000"
# Exponent: 0, 15-digit fraction: 000000000000000

irb> sdb.encode_float(12345678901234567890)
=> "!f70!123456789012346"
# Exponent: 20, Rounded 15-digit fraction: 123456789012346

irb> sdb.encode_float(0.12345678901234567890)
=> "!f50!123456789012346"
# Exponent: 0, Rounded 15-digit fraction: 123456789012346

irb> sdb.encode_float(-12345678901234567890)
=> "!F30!876543210987654"
# Exponent: -20, 15-digit fraction difference: 876543210987654

irb> sdb.encode_float(-0.12345678901234567890)
=> "!F50!876543210987654"
# Exponent: 0, 15-digit fraction difference: 876543210987654

# Confirm the encoded values sort correctly
irb> ["!f00!000000000000000","!f70!123456789012346","!f50!123456789012346",
irb>  "!F30!876543210987654","!F50!876543210987654"].sort

=> ["!F30!876543210987654",  # -12345678901234567890
    "!F50!876543210987654",  # -0.12345678901234567890
    "!f00!000000000000000",  # 0.0
    "!f50!123456789012346",  # 0.12345678901234567890
    "!f70!123456789012346"]  # 12345678901234567890

Note

The documentation and code samples provided by Amazon describe alternative strategies for encoding integers and floating-point numbers. You may prefer Amazon’s approach to ours, because it is more straight-forward, though it requires that you know in advance the largest negative numbers you will need to store.

Automated Encoding and Decoding of Values

To allow the SimpleDB class to automatically encode and decode attribute values on the fly, in Example 13-18 and Example 13-19 we will define the methods encode_attribute_value and decode_attribute_value. These are called by the existing class methods when attribute values are set (see Example 13-6) or retrieved (see Example 13-7).

Example 13-18. Encode an attribute value of any type: SimpleDB.rb

def encode_attribute_value(value)
  if value == true or value == false
    return encode_boolean(value)
  elsif value.is_a? Time
    return encode_date(value)
  elsif value.is_a? Integer
    return encode_integer(value)
  elsif value.is_a? Numeric
    return encode_float(value)
  else
    # No type-specific encoding is available, so we simply convert
    # the value to a string.
    return value.to_s
  end
end

Example 13-19. Decode an attribute value of any type: SimpleDB.rb

def decode_attribute_value(value_str)
  return '' if value_str.nil?

  # Check whether the '!' flag is present to indicate an encoded value
  return value_str if value_str[0..0] != '!'

  prefix = value_str[0..1].downcase
  if prefix == '!b'
    return decode_boolean(value_str)
  elsif prefix == '!d'
    return decode_date(value_str)
  elsif prefix == '!i'
    return decode_integer(value_str)
  elsif prefix == '!f'
    return decode_float(value_str)
  else
    return value_str
  end
end