The vast majority of the scams that you might want to investigate are initiated by an email message. So it is only natural that these messages are a major target for forensic analysis. In this chapter, I will show you how to dissect message headers and distinguish between the real and forged information contained therein. I will show how you go about tracking back spam to its source and the approaches that spammers use to make that as difficult as possible. Then I will move on to the contents of email messages and show how you can safely extract attachments that may contain viruses or spyware.
The content of an email message is what first gets our attention but, in terms of forensics, the header block is the most interesting. Every message contains a series of header lines that instruct mail servers where to deliver it, tell mail readers how to process its content, and provide a record of the path taken by the message from its source to its destination. One reference on headers is RFC 2076 (Common Internet Message Headers), which can be found at http://rfc.net/rfc2076.html, but, as you will see, there is considerable variation in their format.
The fundamental flaw with email is that certain headers can be forged. This is what allows spam and all the other scams to flourish, even in the face of sophisticated filters and detection software. In looking at messages that are of interest to you, you need to understand what header information can be forged and what you can rely on. Let’s start by looking at the headers for a simple, legitimate message. The following is an email sent from my machine to a Gmail account at Google. I have deleted a few of the Gmail-specific headers and modified the addresses to protect privacy.
Delivered-To: XYZ@gmail.com Return-Path: <ABC@craic.com> Received: by 10.54.18.32 with SMTP id 32cs2945wrr; Fri, 25 Feb 2005 15:27:07 -0800 (PST) Received: by 10.54.7.40 with SMTP id 40mr65062wrg; Fri, 25 Feb 2005 15:27:05 -0800 (PST) Received: from gateway.craic.com (gateway.craic.com [208.12.16.5]) by mx.gmail.com with ESMTP id 9si124319wrl.2005.02.25.15.26.58; Fri, 25 Feb 2005 15:27:04 -0800 (PST) Received: from [192.168.2.7] (nexus.craic.com [208.12.16.2]) by gateway.craic.com (8.11.6/8.11.6) with ESMTP id j1PNQvl31568 for <XYZ@gmail.com>; Fri, 25 Feb 2005 15:26:58 -0800 Message-ID: <421FB441.8030406@craic.com> Date: Fri, 25 Feb 2005 15:26:57 -0800 From: ABC <ABC@craic.com> User-Agent: Mozilla Thunderbird 0.9 (X11/20041103) X-Accept-Language: en-us, en MIME-Version: 1.0 To: XYZ@gmail.com Subject: Test Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit This is a test
These headers are usually hidden in common email clients, but you can reveal them easily enough—for example, by selecting View → Message Source in Mozilla Thunderbird.
Message headers fall into five classes. The basic addressing
information is contained in the From
and To
lines, and information about
the content is contained in the Subject
line and those that begin with
Content
. The path taken from the
sender through to delivery is recorded in the Received
lines, and the unique identity of
this message is captured in the Date
and Message-ID
lines. Ancillary
information that might be useful for the email client is usually found
in headers that begin with X-
. The
specific headers can vary widely according to the email client that was
used to create the messages.
Looking at this example, you see that ABC@craic.com
has sent a simple test message to XYZ@gmail.com. From the
User-Agent
header, you know that user
ABC sent the message from the Mozilla Thunderbird email client.
The most interesting headers are the Received
headers. In a legitimate email, each
one of these represents a step taken by the message between two mail
servers, or between a mail client and a server. With each additional
step taken, a new header is added to the top of the message. By looking
at these headers, you should be able to trace the complete path taken by
a message from its source to its destination and vice versa.
Servers in this context are called Mail
Transfer Agents (MTA ), and the majority of these communicate through either
the Simple Mail Transfer Protocol
(SMTP ) or the Enhanced Simple Mail
Transfer Protocol (ESMTP ). In spite of Internet standards, the format used for
Received
headers is variable. In most
cases, it takes this form:
Received: fromstring
(hostname
[host IP address
]) byrecipient host
withprotocol
idmessage ID
forrecipient;
timestamp
This is typically the hostname of the sending MTA, but it can be anything.
The hostname of the MTA if it can be determined by reverse DNS lookup on the IP address.
The IP address of the sending MTA.
This is typically the hostname of the receiving MTA. It is sometimes followed by the version of the MTA software running on that host.
The mail transfer protocol that was used for the transfer, such as SMTP.
A unique identifier for this transfer that can be searched for in the log files on the recipient MTA.
The email address of the recipient.
The date and time at which the message was received by the MTA.
Note the use of parentheses and square brackets around the sending MTA. This will help distinguish truth from fiction when you look at forged headers.
Look at this example:
Received: from biotech.craic.com (biotech.craic.com [208.12.16.3]) by gateway.craic.com (8.11.6/8.11.6) with ESMTP id j21IBV720506 for <XYZ@craic.com>; Tue, 1 Mar 2005 10:11:31 -0800
The numeric IP address in the square brackets defines the sending MTA, and a reverse DNS lookup by the receiving MTA has identified this machine as http://biotech.craic.com. That hostname is repeated in the string that precedes the parentheses. The message has been received by http://gateway.craic.com. There is no need for the IP address, since that MTA implicitly knows its own hostname. The version of the MTA software used is included here. The protocol used is ESMTP and the unique ID that follows should also appear in the log files on that server. The format of these IDs is arbitrary. This header includes the intended recipient for this message, although many headers do not. Finally, there is a timestamp that tells when the message was transferred, including the time difference from Greenwich Mean Time (GMT), which in this case is minus eight hours because the server is located in Seattle.
The string that precedes the parentheses on the from
line is a favorite target for forging and
it is worth understanding where this comes from. An SMTP or ESMTP
transfer is initiated when the sending MTA identifies itself to the
receiver. It does so by sending the string HELO
, or EHLO
in the case of ESMTP, followed by an
identifying string. This can be anything the sender chooses and is the
string that appears in the Received
header. If the source of the message is a Linux system, then the default
value for this string is taken from that system’s hostname in the file
/etc/hosts. Changing that value
will forge the apparent source of a message from that system.
Now you know how to read these headers, so you can retrace the
steps taken by the example message, starting with the last Received
header and working back to the first.
The message appears to be sent from nexus
to gateway
. This is only partly correct. nexus
happens to be a firewall between an
internal network and the Internet. So gateway
sees nexus
as the source even though the real
origin is behind that firewall. In this instance, you can identify that
machine from the preceding string [192.168.2.7]
, but that will not generally be
the case. The message is transferred to http://mx.gmail.com, then to IP address 10.54.7.40
, and finally to 10.54.18.32
. You can tell that these two
addresses are part of Google’s private network because those numbers
fall within one of the ranges of IP addresses that are reserved for
internal networks.
Look at the time difference between the first and last header and see that it took nine seconds to deliver the message. Timestamps are extremely useful in assessing the performance of mail transfers, and a discrepancy in a series of them is often a clear indication that one or more headers have been forged. But timestamps are only as accurate as the clocks from which they derive. Keeping your system clocks synchronized using the Network Time Protocol (NTP) is strongly encouraged. You can find more information about this at http://www.ntp.org.
There is one other header to which you need to pay special attention. As well as the unique ID assigned by each MTA along the delivery route, the message itself has a second ID that is carried with it throughout its passage. For example:
Message-ID: <421FB441.8030406@craic.com>
This Message-ID
tag was
assigned by the mail client used to create the message. These IDs allow
you to search for a given message in the log files on multiple
servers.
Take a look at some of the legitimate messages in your own Inbox and get a feel for the variation in headers and the steps that messages have to take to get from one place to another.