Iterate over the characters in the URL looking for a
percent symbol followed by two
hexadecimal digits. When such a sequence is encountered, combine the
hexadecimal digits to obtain the character with which to replace the
entire sequence. For example, in the ASCII character set, the letter
"A" has the value
0x41
, which could be encoded as
"%41".
RFC 1738 defines the syntax for URLs. Section 2.2 of that document also defines the rules for encoding characters in a URL. While some characters must always be encoded, any character may be encoded. Essentially, this means that before you do anything with a URL—whether you need to parse the URL into pieces (i.e., username, password, host, and so on), match portions of the URL against a whitelist or blacklist, or something else entirely—you need to decode it.
The problem is that you must make certain that you never decode a URL that has already been decoded; otherwise, you will be vulnerable to double-encoding attacks. Suppose that the URL contains the sequence "%25%34%31". Decoded once, the result is "%41" because "%25" is the encoding for the percent symbol, "%34" is the encoding for the number 4, and "%31" is the encoding for the number 1. Decoded twice, the result is "A".
At first glance, this may seem harmless, but what if you were to decode repeatedly until there were no more escaped characters? You would end up with certain sequences of characters that are impossible to represent. The purpose of encoding in the first place is to allow the use of characters that have special meaning or that cannot be represented visually.
Another potential problem with encoding that is limited primarily to
C and C++ is that a NULL
-terminator can be
encoded anywhere in the URL. There are several approaches to dealing
with this problem. One is to treat the decoded string as a binary
array rather than a C-style string; another is to use the
SafeStr library described in Recipe 3.4 because
it gives no special significance to any one character.
You can use the following spc_decode_url(
)
function to decode a URL. It returns a
dynamically allocated copy of the URL in decoded form. The result
will be NULL
-terminated, so it may be treated as a
C-style string, but it may contain embedded NULL
s
as well. You can determine whether it contains embedded
NULL
s by comparing the number of bytes
spc_decode_url( )
indicates that it returns with
the result of calling strlen( )
on the decoded
URL. If the URL contains embedded NULL
s, the
result from strlen( )
will be less than the number
of bytes indicated by spc_decode_url( )
.
#include <stdlib.h> #include <string.h> #include <ctype.h> #define SPC_BASE16_TO_10(x) (((x) >= '0' && (x) <= '9') ? ((x) - '0') : \ (toupper((x)) - 'A' + 10)) char *spc_decode_url(const char *url, size_t *nbytes) { char *out, *ptr; const char *c; if (!(out = ptr = strdup(url))) return 0; for (c = url; *c; c++) { if (*c != '%' || !isxdigit(c[1]) || !isxdigit(c[2])) *ptr++ = *c; else { *ptr++ = (SPC_BASE16_TO_10(c[1]) * 16) + (SPC_BASE16_TO_10(c[2])); c += 2; } } *ptr = 0; if (nbytes) *nbytes = (ptr - out); /* does not include null byte */ return out; }
RFC 1738: Uniform Resource Locators (URL)