Language

Sometimes insight into the person responsible for a message or web site can come from the language they use to express themselves. Part of this is glaringly obvious. If someone sends me an email in Korean, then it is a good bet that she is Korean. But in the case of English, the most common language used on the Internet, you cannot assume that to be the author’s native language.

But careful examination may reveal clues about that language. In most cases, these will add weight to other clues about location and nationality. In others, they disagree with other evidence, suggesting that the author is using a computer in a foreign country or that he is a resident in that country.

Email is usually the richest source of this type of clue. Here you want to look at the headers Content-Transfer-Encoding and Content-Type. These occur in the main block of mail headers or in each block of a multipart message. Here is a simple example:

    Content-Transfer-Encoding: quoted-printable
    Content-Type: text/html; charset="iso-8859-1"

The Content-Type header is the more important of the two, but it helps to know a little about content encoding first.

The original specification for email was only set up to handle the first 128 characters of the ASCII character set, which can be encoded in 7 bits. That was fine for basic messages in English or languages that used this basic character set. But for languages with even a few special characters, such as a German umlaut or French accented characters, the specification was too rigid. The solution was to encode the additional characters using a pair of characters from the 7-bit alphabet along with a special character, the equals sign (=), which told email software where these special codes started. This type of encoding is called quoted-printable and is very widely used in email today. Mail readers handle it transparently but it makes things difficult for anyone looking through the source of a message, as this example of some encoded Korean text illustrates:

    <TITLE>=C3=BB=B1=B9=C8=AF<br>*** =C0=CC=B9=CC=C1=F6=BA=B8=B1=
    =E2=B8=A6=B4=AD==B7=AF=C1=D6=BC=BC=BF=E4 ***<br>***=C0=CC=
    =B9=CC=C1=F6=BA=B8=B1=E2=B8=A6=B4=AD=B7=AF=C1=D6=BC=BC=BF=E4=
     ***<br></TITLE>

The Perl script shown in Example 9-1 will convert quoted-printable text into plain ASCII, as far as it can. Decoded characters that are not in the ASCII character set will not be displayed, so use the script with caution. It does help you read an otherwise indecipherable message. This type of encoding can be used as a form of obfuscation, such as those described in Chapter 4, but in most cases it is used legitimately for handling international characters.

Example 9-1. convert_quoted_printable.pl

#!/usr/bin/perl -w
if(@ARGV > 1) {
   die "Usage: $0 <mail file>\n";
} elsif(@ARGV == 0) {
   $ARGV[0] = '-';
}
my $lastline = '';
open INPUT, "< $ARGV[0]" or die "$0: Unable to open file $ARGV[0]\n";
while(<INPUT>) {
    my $line = $_;
    $line =~ s/\=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge;
    if($line =~ /\=$/) {
       chomp $line;
       $line =~ s/\=$//;
       $lastline .= $line;
    } else {
       $line = $lastline . $line;
       print $line;
       $lastline = '';
    }
}
close INPUT;

A more recent alternative to quoted-printable uses all 8 bits in a byte to carry information. This encoding is more compact and allows you to read the source for an email message in the native alphabet, given an appropriate mail client.

Content encoding lets you transfer international characters via the ASCII-based mail protocol. But if you are to get your message across, you need to tell the receiving mail client what language those codes represent. That is defined in the Character Set, also known as the Code Page. All this machinery falls under the term Internationalization. Be very grateful that other people have figured out how to do this, so you don’t have to! Fortunately, we are only concerned with character sets, although that is complicated enough.

A character set is basically a lookup table that maps a code into a font character. Modern mail clients come with a collection of these sets and the ability to display them. A mail message that wants to display German characters, for example, will encode those characters and include a mail header that specifies which character set should be used.

There are a lot of character sets. Many more, in fact, than there are languages in the world. You can learn more about them from this online tutorial http://www.w3.org/International/tutorials/tutorial-char-enc/, and see a list of them at http://www.iana.org/assignments/character-sets.

The character set that should be used with a specific email message is defined in the charset attribute of the Content-Type header.

    Content-Type: text/html; charset="iso-8859-1"

Probably the most common character set used is iso-8859-1, which covers what linguists call the Latin alphabet. This covers all the characters in the English alphabet and most of those needed to represent the majority of Western European languages, as well as Swahili and Afrikaans. More interesting from the forensics perspective are those for other alphabets such as Cyrillic, Arabic, Hebrew, Korean, and so on.

If your language does not use the Latin alphabet, then you will most likely set your operating system to use the appropriate character set. When you send an email message that set is defined in the Content-Type header. Most character sets can represent the core English alphabet in addition to their own characters. So you often find English text displayed in one of these alternative character sets. By looking for that mismatch, you may be able to identify the native language of the author. This pair of headers, taken from a phishing email, is a good example:

    Content-Transfer-Encoding: Base64
    Content-Type: text/html; charset="windows-1251"

Decoded from its Base64 representation, the content was a message in English that appeared to come from a bank. The character set is defined as windows-1251. Microsoft has defined a number of their own character sets; this one happens to be used for Cyrillic alphabets. That is a strong indication that the author speaks one of the Slavic languages, such as Russian, Bulgarian, Ukrainian, and so forth. Software used to create web pages will also define the appropriate character set, typically as a meta tag. In these three examples, the first defines a Cyrillic character set followed by two variants of the Korean alphabet.

    <meta http-equiv="Content-Type" content="text/html; charset= iso-8859-5">
    <meta http-equiv="Content-Type" Content="text-html; charset=ks_c_5601-1987">
    <meta http-equiv="Content-Type" content="text/html; charset=euc_kr">

There are so many character sets in use that I can’t list them all here. Running a Google search on the name is probably the easiest way to find out more about a specific set.

Interestingly, some of the unusual sets that I have encountered turn out to be bogus. iso-4238-5, iso-7981-6, iso-2426-6, and iso-9110-9 do not match any character set in any list that I can find and produce no hits with Google. They all occurred in spam emails, so they may have been placed there as a way to avoid spam filters. However, they may do the spammer more harm than good in that they could serve as distinctive signature strings for this source of spam.

If you have looked at much spam, you will be familiar with the poor usage of the English language in many of them. This alone may indicate a source outside the main English-speaking countries, but trying to infer any more detail than this is effectively impossible. However, if you are able to access the source code of a script, then you may get lucky. Assuming that no one else will look at their code, programmers may be tempted to use variable names and strings in their native language. Example 5-4 illustrates this with the use of the word “parola,” in place of “password,” suggesting that the author is Italian. The U.K. Honeynet Project have identified a Romanian connection to a script they uncovered, based on the variable names $mesaj and $muie (http://www.honeynet.org/papers/phishing/). Such examples are rare but are very rewarding when you find them.