Converting text to bytes

If we need to convert incoming bytes into Unicode, we're clearly also going to have situations where we convert outgoing Unicode into byte sequences. This is done with the encode method on the str class, which, like the decode method, requires a character set. The following code creates a Unicode string and encodes it in different character sets:

characters = "cliché" 
print(characters.encode("UTF-8")) 
print(characters.encode("latin-1")) 
print(characters.encode("CP437")) 
print(characters.encode("ascii")) 

The first three encodings create a different set of bytes for the accented character. The fourth one can't even handle that byte:

    b'clich\xc3\xa9'
    b'clich\xe9'
    b'clich\x82'
    Traceback (most recent call last):
      File "1261_10_16_decode_unicode.py", line 5, in <module>
        print(characters.encode("ascii"))
    UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 5: ordinal not in range(128)  

Now you should understand the importance of encodings! The accented character is represented as a different byte for each encoding; if we use the wrong one when we are decoding bytes to text, we get the wrong character.

The exception in the last case is not always the desired behavior; there may be cases where we want the unknown characters to be handled in a different way. The encode method takes an optional string argument named errors that can define how such characters should be handled. This string can be one of the following:

The strict replacement strategy is the default we just saw. When a byte sequence is encountered that does not have a valid representation in the requested encoding, an exception is raised. When the replace strategy is used, the character is replaced with a different character; in ASCII, it is a question mark; other encodings may use different symbols, such as an empty box. The ignore strategy simply discards any bytes it doesn't understand, while the xmlcharrefreplace strategy creates an xml entity representing the Unicode character. This can be useful when converting unknown strings for use in an XML document. Here's how each of the strategies affects our sample word:

Strategy

Result of applying "cliché".encode("ascii", strategy)

replace

b'clich?'

ignore

b'clich'

xmlcharrefreplace

b'cliché'

It is possible to call the str.encode and bytes.decode methods without passing an encoding name. The encoding will be set to the default encoding for the current platform. This will depend on the current operating system and locale or regional settings; you can look it up using the sys.getdefaultencoding() function. It is usually a good idea to specify the encoding explicitly, though, since the default encoding for a platform may change, or the program may one day be extended to work on text from a wider variety of sources.

If you are encoding text and don't know which encoding to use, it is best to use UTF-8 encoding. UTF-8 is able to represent any Unicode character. In modern software, it is a de facto standard encoding to ensure documents in any language—or even multiple languages can be exchanged. The various other possible encodings are useful for legacy documents or in regions that still use different character sets by default.

The UTF-8 encoding uses one byte to represent ASCII and other common characters, and up to four bytes for more complex characters. UTF-8 is special because it is backwards-compatible with ASCII; any ASCII document encoded using UTF-8 will be identical to the original ASCII document.

I can never remember whether to use encode or decode to convert from binary bytes to Unicode. I always wished these methods were named to_binary and from_binary instead. If you have the same problem, try mentally replacing the word code with binary; enbinary and debinary are pretty close to to_binary and from_binary. I have saved a lot of time by not looking up the method help files since devising this mnemonic.