Chapter 14. Webbots That Read Email

When a webbot can read email, it’s easier for it to communicate with the outside world.^[44] Webbots capable of reading email can take instruction via email commands, share data with handheld devices such as iPads and Chrome Books, and filter messages for content.

For example, if package-tracking information is sent to an email account that a webbot can access, the webbot can parse incoming email from the carrier to track delivery status. Such a webbot could also send email warnings when shipments are late, communicate shipping charges to your corporate accounting software, or create reports that analyze a company’s use of overnight shipping.

The POP3 Protocol

Of the many protocols for reading email from mail servers, I selected Post Office Protocol 3 (POP3) for this task because of its simplicity and near-universal support among mail servers. POP3 instructions are also easy to perform in any Telnet or standard TCP/IP terminal program.^[45] The ability to use Telnet to execute POP3 commands will provide an understanding of POP3 commands, which we will later convert into PHP routines that any webbot may execute.

Logging into a POP3 Mail Server

Example 14-1 shows how to connect to a POP3 mail server though a Telnet client. Simply enter telnet, followed by the mail server name and the port number (which is always 110 for POP3). The mail server should reply with a message similar to the one in Example 14-1.

Example 14-1. Making a Telnet connection to a POP3 mail server

telnet mail.server.net 110
+OK <9238.1142228@mail2.server.net>

The reply shown in Example 14-1 says that you’ve made a connection to the POP3 mail server and that it is waiting for its next command, which should be your attempt to log in. Example 14-2 shows the process for logging in to a POP3 mail server.

Example 14-2. Successful authentication to a POP3 mail server

user me@server.com
+OK
pass xxxxxxxx
+OK

When you try this, be sure to substitute your email account in place of me@server.com and the password associated with your account for xxxxxxxx.

If authentication fails, the mail server should return an authentication failure message, as shown in Example 14-3.

Example 14-3. POP3 authentication failure

-ERR authorization failed

Reading Mail from a POP3 Mail Server

Before you can download email messages from a POP3 mail server, you’ll need to execute a LIST command. The mail server will then respond with the number of messages on the server.

The POP3 LIST Command

The LIST command will also reveal the size of the email messages and, more importantly, how to reference individual email messages on the server.

The response to the LIST command contains a line for every available message for the specified account. Each line consists of a sequential mail ID number, followed by the size of the message in bytes. Example 14-4 shows the results of a LIST command on an account with two pieces of email.

Example 14-4. Results of a POP3 LIST command

LIST
+OK
1 2398
2 2023
.

The server’s reply to the LIST command tells us that there are two messages on the server for the specified account. We can also tell that message 1 is the larger message, at 2,398 bytes, and that message 2 is 2,023 bytes in length. Beyond that, we don’t know anything specific about any of these messages.

The last line in the response is the end of message indicator. Servers always terminate POP3 responses with a line containing only a period.

The POP3 RETR Command

To read a specific message, enter RETR followed by a space and the mail ID received from the LIST command. The command in Example 14-5 requests message 1.

Example 14-5. Requesting a message from the server

RETR 1

The mail server should respond to the RETR command with a string of characters resembling the contents of Example 14-6.

Example 14-6. A raw email message read from the server using the RETR POP3 command

+OK 2398 octets
Return-Path: <returnpath@server.com>
Delivered-To: me@server.com
Received: (qmail 73301 invoked from network); 19 Feb 2011 20:55:31 -0000
Received: from mail2.server.net
          by mail1.server.net (qmail-ldap-1.03) with compressed QMQP; 19 Feb
2006 20:55:31 -0000
Delivered-To: CLUSTERHOST mail2.server.net me@server.com
Received: (qmail 50923 invoked from network); 19 Feb 2011 20:55:31 -0000
Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s
         scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4
Received: from web30515.mail.mud.server.com
          (envelope-sender <sender@server.com>)
          by mail2.server.net (qmail-ldap-1.03) with SMTP
          for <me@server.com>; 19 Feb 2011 20:55:28 -0000
Received: (qmail 7734 invoked by uid 60001); 19 Feb 2011 20:55:26 -0000
Message-ID: <20060219205526.7732.qmail@web30515.mail.mud.server.com>
Date: Sun, 19 Feb 2011 12:55:26 -0800 (PST)
From: mike schrenk <sender@server.com>
Subject: Hey, Can you read this email?
To: mike schrenk <me@server.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581"
Content-Transfer-Encoding: 8bit
X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.com
X-Spam-Level:
X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE,
        HTML_SHORT_LENGTH autolearn=no version=3.0.4

--0-349883719-1140382526=:7581
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

This is an email sent from my Yahoo! email account.
--0-349883719-1140382526=:7581
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

This is an email sent from my Yahoo! email account.<br><BR><BR
--0-349883719-1140382526=:7581--
.

As you can see, even a short email message has a lot of overhead. Most of the returned information has little to do with the actual text of a message. For example, the email message retrieved in Example 14-6 doesn’t appear until over halfway down the listing. The rest of the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth.

These headers include some familiar information such as the subject header, the to and from values, and the MIME version. You can easily parse this information with the return_between() function found in the LIB_parse library (see Chapter 4), as shown in Example 14-7.

Example 14-7. Parsing header values

$ret_path = return_between($raw_message, "Return-Path: ", "\n", EXCL );
$deliver_to = return_between($raw_message, "Delivered-To: ", "\n", EXCL );
$date = return_between($raw_message, "Date: ", "\n", EXCL );
$from = return_between($raw_message, "From: ", "\n", EXCL );
$subject = return_between($raw_message, "Subject: ", "\n", EXCL );

The header values in Example 14-6 are separated by their names and a \n (carriage return) character. Note that the header name must be followed by a colon (:) and a space, as these words may appear elsewhere in the raw message returned from the mail server.

Parsing the actual message is more involved, as shown in Example 14-8.

Example 14-8. Parsing the actual message from a raw POP3 response

$content_type = return_between($raw_message, "Content-Type: ", "\n", EXCL);
$boundary = get_attribute($content_type, "boundary");
$raw_msg = return_between($message, "--".$boundary, "--".$boundary, EXCL );
$msg_separator = $raw_msg, chr(13).chr(10).chr(13).chr(10);
$clean_msg = return_between($raw_msg, $msg_separator, $msg_separator, EXCL );

When parsing the message, you must first identify the Content-Type, which holds the boundaries describing where the message is found. The Content-Type is further parsed with the get_attribute() function, to obtain the actual boundary value.^[46] Finally, the text defined within the boundaries may contain additional information that tells the client how to display the content of the message. This information, if it exists, is removed by parsing only what’s within the message separator, a combination of carriage returns and line feeds.

Other Useful POP3 Commands

The DELE and QUIT (followed by the mail id) commands mark a message for deletion. Example 14-9 shows demonstrations of both the DELE and QUIT commands.

Example 14-9. Using the POP3 DELE and QUIT commands

DELE 8
+OK
QUIT
+OK

When you use DELE, the deleted message is only marked for deletion and not actually deleted. The deletion doesn’t occur until you execute a QUIT command and your server session ends.

Note

If you’ve accidentally marked a message with the DELE function and wish to retain it when you quit, enter RSET followed by the message number. The message will not be marked for deletion when you issue the QUIT command (retention is the default condition).

^[44] See Chapter 15 to learn how to send email with webbots and spiders.

^[45]Telnet clients are standard on all Windows, Mac OS X, Linux, and Unix distributions.

^[46]The actual boundary, which defines the message, is prefixed with -- characters to distinguish the actual boundary from where it is defined.