3.8. Evaluating URL Encodings

Problem

You need to decode a Uniform Resource Locator (URL).

Solution

Iterate over the characters in the URL looking for a percent symbol followed by two hexadecimal digits. When such a sequence is encountered, combine the hexadecimal digits to obtain the character with which to replace the entire sequence. For example, in the ASCII character set, the letter "A" has the value 0x41, which could be encoded as "%41".

Discussion

RFC 1738 defines the syntax for URLs. Section 2.2 of that document also defines the rules for encoding characters in a URL. While some characters must always be encoded, any character may be encoded. Essentially, this means that before you do anything with a URL—whether you need to parse the URL into pieces (i.e., username, password, host, and so on), match portions of the URL against a whitelist or blacklist, or something else entirely—you need to decode it.

The problem is that you must make certain that you never decode a URL that has already been decoded; otherwise, you will be vulnerable to double-encoding attacks. Suppose that the URL contains the sequence "%25%34%31". Decoded once, the result is "%41" because "%25" is the encoding for the percent symbol, "%34" is the encoding for the number 4, and "%31" is the encoding for the number 1. Decoded twice, the result is "A".

At first glance, this may seem harmless, but what if you were to decode repeatedly until there were no more escaped characters? You would end up with certain sequences of characters that are impossible to represent. The purpose of encoding in the first place is to allow the use of characters that have special meaning or that cannot be represented visually.

Another potential problem with encoding that is limited primarily to C and C++ is that a NULL -terminator can be encoded anywhere in the URL. There are several approaches to dealing with this problem. One is to treat the decoded string as a binary array rather than a C-style string; another is to use the SafeStr library described in Recipe 3.4 because it gives no special significance to any one character.

You can use the following spc_decode_url( ) function to decode a URL. It returns a dynamically allocated copy of the URL in decoded form. The result will be NULL-terminated, so it may be treated as a C-style string, but it may contain embedded NULLs as well. You can determine whether it contains embedded NULLs by comparing the number of bytes spc_decode_url( ) indicates that it returns with the result of calling strlen( ) on the decoded URL. If the URL contains embedded NULLs, the result from strlen( ) will be less than the number of bytes indicated by spc_decode_url( ).

#include <stdlib.h>
#include <string.h>
#include <ctype.h>
   
#define SPC_BASE16_TO_10(x) (((x) >= '0' && (x) <= '9') ? ((x) - '0') : \
                             (toupper((x)) - 'A' + 10))
   
char *spc_decode_url(const char *url, size_t *nbytes) {
  char       *out, *ptr;
  const char *c;
   
  if (!(out = ptr = strdup(url))) return 0;
  for (c = url;  *c;  c++) {
    if (*c != '%' || !isxdigit(c[1]) || !isxdigit(c[2])) *ptr++ = *c;
    else {
      *ptr++ = (SPC_BASE16_TO_10(c[1]) * 16) + (SPC_BASE16_TO_10(c[2]));
      c += 2;
    }
  }
  *ptr = 0;
  if (nbytes) *nbytes = (ptr - out); /* does not include null byte */
  return out;
}

3.8. Evaluating URL Encodings

Problem

Solution

Discussion

See Also