Sometimes insight into the person responsible for a message or web site can come from the language they use to express themselves. Part of this is glaringly obvious. If someone sends me an email in Korean, then it is a good bet that she is Korean. But in the case of English, the most common language used on the Internet, you cannot assume that to be the author’s native language.
But careful examination may reveal clues about that language. In most cases, these will add weight to other clues about location and nationality. In others, they disagree with other evidence, suggesting that the author is using a computer in a foreign country or that he is a resident in that country.
Email is usually the richest source of this type of clue. Here you
want to look at the headers Content-Transfer-Encoding
and Content-Type
. These occur in the main block of
mail headers or in each block of a multipart message. Here is a simple
example:
Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="iso-8859-1"
The Content-Type
header is the
more important of the two, but it helps to know a little about content
encoding first.
The original specification for email was only set up to handle the
first 128 characters of the ASCII character set, which can be encoded in
7 bits. That was fine for basic messages in English or languages that
used this basic character set. But for languages with even a few special
characters, such as a German umlaut or French accented characters, the
specification was too rigid. The solution was to encode the additional
characters using a pair of characters from the 7-bit alphabet along with
a special character, the equals sign (=
), which told email software where these
special codes started. This type of encoding is called quoted-printable
and is very widely used in
email today. Mail readers handle it transparently but it makes things
difficult for anyone looking through the source of a message, as this
example of some encoded Korean text illustrates:
<TITLE>=C3=BB=B1=B9=C8=AF<br>*** =C0=CC=B9=CC=C1=F6=BA=B8=B1= =E2=B8=A6=B4=AD==B7=AF=C1=D6=BC=BC=BF=E4 ***<br>***=C0=CC= =B9=CC=C1=F6=BA=B8=B1=E2=B8=A6=B4=AD=B7=AF=C1=D6=BC=BC=BF=E4= ***<br></TITLE>
The Perl script shown in Example 9-1 will convert quoted-printable text into plain ASCII, as far as it can. Decoded characters that are not in the ASCII character set will not be displayed, so use the script with caution. It does help you read an otherwise indecipherable message. This type of encoding can be used as a form of obfuscation, such as those described in Chapter 4, but in most cases it is used legitimately for handling international characters.
Example 9-1. convert_quoted_printable.pl
#!/usr/bin/perl -w if(@ARGV > 1) { die "Usage: $0 <mail file>\n"; } elsif(@ARGV == 0) { $ARGV[0] = '-'; } my $lastline = ''; open INPUT, "< $ARGV[0]" or die "$0: Unable to open file $ARGV[0]\n"; while(<INPUT>) { my $line = $_; $line =~ s/\=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge; if($line =~ /\=$/) { chomp $line; $line =~ s/\=$//; $lastline .= $line; } else { $line = $lastline . $line; print $line; $lastline = ''; } } close INPUT;
A more recent alternative to quoted-printable
uses all 8 bits in a byte to
carry information. This encoding is more compact and allows you to read
the source for an email message in the native alphabet, given an
appropriate mail client.
Content encoding lets you transfer international characters via the ASCII-based mail protocol. But if you are to get your message across, you need to tell the receiving mail client what language those codes represent. That is defined in the Character Set, also known as the Code Page. All this machinery falls under the term Internationalization. Be very grateful that other people have figured out how to do this, so you don’t have to! Fortunately, we are only concerned with character sets, although that is complicated enough.
A character set is basically a lookup table that maps a code into a font character. Modern mail clients come with a collection of these sets and the ability to display them. A mail message that wants to display German characters, for example, will encode those characters and include a mail header that specifies which character set should be used.
There are a lot of character sets. Many more, in fact, than there are languages in the world. You can learn more about them from this online tutorial http://www.w3.org/International/tutorials/tutorial-char-enc/, and see a list of them at http://www.iana.org/assignments/character-sets.
The character set that should be used with a specific email
message is defined in the charset
attribute of the Content-Type
header.
Content-Type: text/html; charset="iso-8859-1"
Probably the most common character set used is iso-8859-1
, which covers what linguists call
the Latin alphabet. This covers all the characters in the English
alphabet and most of those needed to represent the majority of Western
European languages, as well as Swahili and Afrikaans. More interesting
from the forensics perspective are those for other alphabets such as
Cyrillic, Arabic, Hebrew, Korean, and so on.
If your language does not use the Latin alphabet, then you will
most likely set your operating system to use the appropriate character
set. When you send an email message that set is defined in the Content-Type
header. Most character sets can
represent the core English alphabet in addition to their own characters.
So you often find English text displayed in one of these alternative
character sets. By looking for that mismatch, you may be able to
identify the native language of the author. This pair of headers, taken
from a phishing email, is a good example:
Content-Transfer-Encoding: Base64 Content-Type: text/html; charset="windows-1251"
Decoded from its Base64
representation, the content was a message in English that appeared to
come from a bank. The character set is defined as windows-1251
. Microsoft has defined a number
of their own character sets; this one happens to be used for Cyrillic
alphabets. That is a strong indication that the author speaks one of the
Slavic languages, such as Russian, Bulgarian, Ukrainian, and so forth.
Software used to create web pages will also define the appropriate
character set, typically as a meta
tag. In these three examples, the first defines a Cyrillic character set
followed by two variants of the Korean alphabet.
<meta http-equiv="Content-Type" content="text/html; charset= iso-8859-5"> <meta http-equiv="Content-Type" Content="text-html; charset=ks_c_5601-1987"> <meta http-equiv="Content-Type" content="text/html; charset=euc_kr">
There are so many character sets in use that I can’t list them all here. Running a Google search on the name is probably the easiest way to find out more about a specific set.
Interestingly, some of the unusual sets that I have encountered
turn out to be bogus. iso-4238-5
,
iso-7981-6
, iso-2426-6
, and iso-9110-9
do not match any character set in
any list that I can find and produce no hits with Google. They all
occurred in spam emails, so they may have been placed there as a way to
avoid spam filters. However, they may do the spammer more harm than good
in that they could serve as distinctive signature strings for this
source of spam.
If you have looked at much spam, you will be familiar with the
poor usage of the English language in many of them. This alone may
indicate a source outside the main English-speaking countries, but
trying to infer any more detail than this is effectively impossible.
However, if you are able to access the source code of a script, then you
may get lucky. Assuming that no one else will look at their code,
programmers may be tempted to use variable names and strings in their
native language. Example
5-4 illustrates this with the use of the word “parola,” in place
of “password,” suggesting that the author is Italian. The U.K. Honeynet
Project have identified a Romanian connection to a script they
uncovered, based on the variable names $mesaj
and $muie
(http://www.honeynet.org/papers/phishing/). Such examples
are rare but are very rewarding when you find them.