Chapter 3. Email

The vast majority of the scams that you might want to investigate are initiated by an email message. So it is only natural that these messages are a major target for forensic analysis. In this chapter, I will show you how to dissect message headers and distinguish between the real and forged information contained therein. I will show how you go about tracking back spam to its source and the approaches that spammers use to make that as difficult as possible. Then I will move on to the contents of email messages and show how you can safely extract attachments that may contain viruses or spyware.

Message Headers

The content of an email message is what first gets our attention but, in terms of forensics, the header block is the most interesting. Every message contains a series of header lines that instruct mail servers where to deliver it, tell mail readers how to process its content, and provide a record of the path taken by the message from its source to its destination. One reference on headers is RFC 2076 (Common Internet Message Headers), which can be found at http://rfc.net/rfc2076.html, but, as you will see, there is considerable variation in their format.

The fundamental flaw with email is that certain headers can be forged. This is what allows spam and all the other scams to flourish, even in the face of sophisticated filters and detection software. In looking at messages that are of interest to you, you need to understand what header information can be forged and what you can rely on. Let’s start by looking at the headers for a simple, legitimate message. The following is an email sent from my machine to a Gmail account at Google. I have deleted a few of the Gmail-specific headers and modified the addresses to protect privacy.

    Delivered-To: XYZ@gmail.com
    Return-Path: <ABC@craic.com>
    Received: by 10.54.18.32 with SMTP id 32cs2945wrr;
            Fri, 25 Feb 2005 15:27:07 -0800 (PST)
    Received: by 10.54.7.40 with SMTP id 40mr65062wrg;
            Fri, 25 Feb 2005 15:27:05 -0800 (PST)
    Received: from gateway.craic.com
            (gateway.craic.com [208.12.16.5])
            by mx.gmail.com
            with ESMTP id 9si124319wrl.2005.02.25.15.26.58;
            Fri, 25 Feb 2005 15:27:04 -0800 (PST)
    Received: from [192.168.2.7] (nexus.craic.com [208.12.16.2])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j1PNQvl31568
            for <XYZ@gmail.com>;
            Fri, 25 Feb 2005 15:26:58 -0800
    Message-ID: <421FB441.8030406@craic.com>
    Date: Fri, 25 Feb 2005 15:26:57 -0800
    From: ABC <ABC@craic.com>
    User-Agent: Mozilla Thunderbird 0.9 (X11/20041103)
    X-Accept-Language: en-us, en
    MIME-Version: 1.0
    To: XYZ@gmail.com
    Subject: Test
    Content-Type: text/plain; charset=ISO-8859-1; format=flowed
    Content-Transfer-Encoding: 7bit

    This is a test

These headers are usually hidden in common email clients, but you can reveal them easily enough—for example, by selecting View → Message Source in Mozilla Thunderbird.

Message headers fall into five classes. The basic addressing information is contained in the From and To lines, and information about the content is contained in the Subject line and those that begin with Content. The path taken from the sender through to delivery is recorded in the Received lines, and the unique identity of this message is captured in the Date and Message-ID lines. Ancillary information that might be useful for the email client is usually found in headers that begin with X-. The specific headers can vary widely according to the email client that was used to create the messages.

Looking at this example, you see that ABC@craic.com has sent a simple test message to XYZ@gmail.com. From the User-Agent header, you know that user ABC sent the message from the Mozilla Thunderbird email client.

The most interesting headers are the Received headers. In a legitimate email, each one of these represents a step taken by the message between two mail servers, or between a mail client and a server. With each additional step taken, a new header is added to the top of the message. By looking at these headers, you should be able to trace the complete path taken by a message from its source to its destination and vice versa.

Servers in this context are called Mail Transfer Agents (MTA ), and the majority of these communicate through either the Simple Mail Transfer Protocol (SMTP ) or the Enhanced Simple Mail Transfer Protocol (ESMTP ). In spite of Internet standards, the format used for Received headers is variable. In most cases, it takes this form:

    Received: from string (hostname [host IP address])
              by recipient host
              with protocol id message ID
              for recipient;
              timestamp

string: This is typically the hostname of the sending MTA, but it can be anything.
hostname: The hostname of the MTA if it can be determined by reverse DNS lookup on the IP address.
host IP address: The IP address of the sending MTA.
recipient host: This is typically the hostname of the receiving MTA. It is sometimes followed by the version of the MTA software running on that host.
protocol: The mail transfer protocol that was used for the transfer, such as SMTP.
message ID: A unique identifier for this transfer that can be searched for in the log files on the recipient MTA.
recipient: The email address of the recipient.
timestamp: The date and time at which the message was received by the MTA.

Note the use of parentheses and square brackets around the sending MTA. This will help distinguish truth from fiction when you look at forged headers.

Look at this example:

    Received: from biotech.craic.com (biotech.craic.com [208.12.16.3])
            by gateway.craic.com (8.11.6/8.11.6)
            with ESMTP id j21IBV720506
            for <XYZ@craic.com>;
            Tue, 1 Mar 2005 10:11:31 -0800

The numeric IP address in the square brackets defines the sending MTA, and a reverse DNS lookup by the receiving MTA has identified this machine as http://biotech.craic.com. That hostname is repeated in the string that precedes the parentheses. The message has been received by http://gateway.craic.com. There is no need for the IP address, since that MTA implicitly knows its own hostname. The version of the MTA software used is included here. The protocol used is ESMTP and the unique ID that follows should also appear in the log files on that server. The format of these IDs is arbitrary. This header includes the intended recipient for this message, although many headers do not. Finally, there is a timestamp that tells when the message was transferred, including the time difference from Greenwich Mean Time (GMT), which in this case is minus eight hours because the server is located in Seattle.

The string that precedes the parentheses on the from line is a favorite target for forging and it is worth understanding where this comes from. An SMTP or ESMTP transfer is initiated when the sending MTA identifies itself to the receiver. It does so by sending the string HELO, or EHLO in the case of ESMTP, followed by an identifying string. This can be anything the sender chooses and is the string that appears in the Received header. If the source of the message is a Linux system, then the default value for this string is taken from that system’s hostname in the file /etc/hosts. Changing that value will forge the apparent source of a message from that system.

Now you know how to read these headers, so you can retrace the steps taken by the example message, starting with the last Received header and working back to the first. The message appears to be sent from nexus to gateway. This is only partly correct. nexus happens to be a firewall between an internal network and the Internet. So gateway sees nexus as the source even though the real origin is behind that firewall. In this instance, you can identify that machine from the preceding string [192.168.2.7], but that will not generally be the case. The message is transferred to http://mx.gmail.com, then to IP address 10.54.7.40, and finally to 10.54.18.32. You can tell that these two addresses are part of Google’s private network because those numbers fall within one of the ranges of IP addresses that are reserved for internal networks.

Look at the time difference between the first and last header and see that it took nine seconds to deliver the message. Timestamps are extremely useful in assessing the performance of mail transfers, and a discrepancy in a series of them is often a clear indication that one or more headers have been forged. But timestamps are only as accurate as the clocks from which they derive. Keeping your system clocks synchronized using the Network Time Protocol (NTP) is strongly encouraged. You can find more information about this at http://www.ntp.org.

There is one other header to which you need to pay special attention. As well as the unique ID assigned by each MTA along the delivery route, the message itself has a second ID that is carried with it throughout its passage. For example:

    Message-ID: <421FB441.8030406@craic.com>

This Message-ID tag was assigned by the mail client used to create the message. These IDs allow you to search for a given message in the log files on multiple servers.

Take a look at some of the legitimate messages in your own Inbox and get a feel for the variation in headers and the steps that messages have to take to get from one place to another.