Chapter 14. Webbots That Read Email

When a webbot can read email, it’s easier for it to communicate with the outside world.[44] Webbots capable of reading email can take instruction via email commands, share data with handheld devices such as iPads and Chrome Books, and filter messages for content.

For example, if package-tracking information is sent to an email account that a webbot can access, the webbot can parse incoming email from the carrier to track delivery status. Such a webbot could also send email warnings when shipments are late, communicate shipping charges to your corporate accounting software, or create reports that analyze a company’s use of overnight shipping.

Of the many protocols for reading email from mail servers, I selected Post Office Protocol 3 (POP3) for this task because of its simplicity and near-universal support among mail servers. POP3 instructions are also easy to perform in any Telnet or standard TCP/IP terminal program.[45] The ability to use Telnet to execute POP3 commands will provide an understanding of POP3 commands, which we will later convert into PHP routines that any webbot may execute.

Example 14-1 shows how to connect to a POP3 mail server though a Telnet client. Simply enter telnet, followed by the mail server name and the port number (which is always 110 for POP3). The mail server should reply with a message similar to the one in Example 14-1.

The reply shown in Example 14-1 says that you’ve made a connection to the POP3 mail server and that it is waiting for its next command, which should be your attempt to log in. Example 14-2 shows the process for logging in to a POP3 mail server.

When you try this, be sure to substitute your email account in place of and the password associated with your account for xxxxxxxx.

If authentication fails, the mail server should return an authentication failure message, as shown in Example 14-3.

Before you can download email messages from a POP3 mail server, you’ll need to execute a LIST command. The mail server will then respond with the number of messages on the server.

To read a specific message, enter RETR followed by a space and the mail ID received from the LIST command. The command in Example 14-5 requests message 1.

The mail server should respond to the RETR command with a string of characters resembling the contents of Example 14-6.

Example 14-6. A raw email message read from the server using the RETR POP3 command

+OK 2398 octets
Return-Path: <returnpath@server.com>
Delivered-To: me@server.com
Received: (qmail 73301 invoked from network); 19 Feb 2011 20:55:31 -0000
Received: from mail2.server.net
          by mail1.server.net (qmail-ldap-1.03) with compressed QMQP; 19 Feb
2006 20:55:31 -0000
Delivered-To: CLUSTERHOST mail2.server.net me@server.com
Received: (qmail 50923 invoked from network); 19 Feb 2011 20:55:31 -0000
Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s
         scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4
Received: from web30515.mail.mud.server.com
          (envelope-sender <sender@server.com>)
          by mail2.server.net (qmail-ldap-1.03) with SMTP
          for <me@server.com>; 19 Feb 2011 20:55:28 -0000
Received: (qmail 7734 invoked by uid 60001); 19 Feb 2011 20:55:26 -0000
Message-ID: <20060219205526.7732.qmail@web30515.mail.mud.server.com>
Date: Sun, 19 Feb 2011 12:55:26 -0800 (PST)
From: mike schrenk <sender@server.com>
Subject: Hey, Can you read this email?
To: mike schrenk <me@server.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581"
Content-Transfer-Encoding: 8bit
X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.com
X-Spam-Level:
X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE,
        HTML_SHORT_LENGTH autolearn=no version=3.0.4

--0-349883719-1140382526=:7581
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

This is an email sent from my Yahoo! email account.
--0-349883719-1140382526=:7581
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

This is an email sent from my Yahoo! email account.<br><BR><BR
--0-349883719-1140382526=:7581--
.

As you can see, even a short email message has a lot of overhead. Most of the returned information has little to do with the actual text of a message. For example, the email message retrieved in Example 14-6 doesn’t appear until over halfway down the listing. The rest of the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth.

These headers include some familiar information such as the subject header, the to and from values, and the MIME version. You can easily parse this information with the return_between() function found in the LIB_parse library (see Chapter 4), as shown in Example 14-7.

The header values in Example 14-6 are separated by their names and a \n (carriage return) character. Note that the header name must be followed by a colon (:) and a space, as these words may appear elsewhere in the raw message returned from the mail server.

Parsing the actual message is more involved, as shown in Example 14-8.

When parsing the message, you must first identify the Content-Type, which holds the boundaries describing where the message is found. The Content-Type is further parsed with the get_attribute() function, to obtain the actual boundary value.[46] Finally, the text defined within the boundaries may contain additional information that tells the client how to display the content of the message. This information, if it exists, is removed by parsing only what’s within the message separator, a combination of carriage returns and line feeds.



[44] See Chapter 15 to learn how to send email with webbots and spiders.

[45] Telnet clients are standard on all Windows, Mac OS X, Linux, and Unix distributions.

[46] The actual boundary, which defines the message, is prefixed with -- characters to distinguish the actual boundary from where it is defined.