When a webbot can read email, it’s easier for it to communicate with the outside world.[44] Webbots capable of reading email can take instruction via email commands, share data with handheld devices such as iPads and Chrome Books, and filter messages for content.
For example, if package-tracking information is sent to an email account that a webbot can access, the webbot can parse incoming email from the carrier to track delivery status. Such a webbot could also send email warnings when shipments are late, communicate shipping charges to your corporate accounting software, or create reports that analyze a company’s use of overnight shipping.
Of the many protocols for reading email from mail servers, I selected Post Office Protocol 3 (POP3) for this task because of its simplicity and near-universal support among mail servers. POP3 instructions are also easy to perform in any Telnet or standard TCP/IP terminal program.[45] The ability to use Telnet to execute POP3 commands will provide an understanding of POP3 commands, which we will later convert into PHP routines that any webbot may execute.
Example 14-1 shows how to connect to a POP3 mail server though a Telnet client. Simply enter telnet
, followed by the mail server name and the port number (which is always 110 for POP3). The mail server should reply with a message similar to the one in Example 14-1.
Example 14-1. Making a Telnet connection to a POP3 mail server
telnet mail.server.net 110 +OK <9238.1142228@mail2.server.net>
The reply shown in Example 14-1 says that you’ve made a connection to the POP3 mail server and that it is waiting for its next command, which should be your attempt to log in. Example 14-2 shows the process for logging in to a POP3 mail server.
Example 14-2. Successful authentication to a POP3 mail server
user me@server.com +OK pass xxxxxxxx +OK
When you try this, be sure to substitute your email account in place of me@server.com and the password associated with your account for xxxxxxxx.
If authentication fails, the mail server should return an authentication failure message, as shown in Example 14-3.
Before you can download email messages from a POP3 mail server, you’ll need to execute a LIST
command. The mail server will then respond with the number of messages on the server.
The LIST
command will also reveal the size of the email messages and, more importantly, how to reference individual email messages on the server.
The response to the LIST
command contains a line for every available message for the specified account. Each line consists of a sequential mail ID number, followed by the size of the message in bytes. Example 14-4 shows the results of a LIST
command on an account with two pieces of email.
The server’s reply to the LIST
command tells us that there are two messages on the server for the specified account. We can also tell that message 1 is the larger message, at 2,398 bytes, and that message 2 is 2,023 bytes in length. Beyond that, we don’t know anything specific about any of these messages.
The last line in the response is the end of message indicator. Servers always terminate POP3 responses with a line containing only a period.
To read a specific message, enter RETR followed by a space and the mail ID received from the LIST
command. The command in Example 14-5 requests message 1.
The mail server should respond to the RETR
command with a string of characters resembling the contents of Example 14-6.
Example 14-6. A raw email message read from the server using the RETR
POP3 command
+OK 2398 octets Return-Path: <returnpath@server.com> Delivered-To: me@server.com Received: (qmail 73301 invoked from network); 19 Feb 2011 20:55:31 -0000 Received: from mail2.server.net by mail1.server.net (qmail-ldap-1.03) with compressed QMQP; 19 Feb 2006 20:55:31 -0000 Delivered-To: CLUSTERHOST mail2.server.net me@server.com Received: (qmail 50923 invoked from network); 19 Feb 2011 20:55:31 -0000 Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4 Received: from web30515.mail.mud.server.com (envelope-sender <sender@server.com>) by mail2.server.net (qmail-ldap-1.03) with SMTP for <me@server.com>; 19 Feb 2011 20:55:28 -0000 Received: (qmail 7734 invoked by uid 60001); 19 Feb 2011 20:55:26 -0000 Message-ID: <20060219205526.7732.qmail@web30515.mail.mud.server.com> Date: Sun, 19 Feb 2011 12:55:26 -0800 (PST) From: mike schrenk <sender@server.com> Subject: Hey, Can you read this email? To: mike schrenk <me@server.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581" Content-Transfer-Encoding: 8bit X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.com X-Spam-Level: X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE, HTML_SHORT_LENGTH autolearn=no version=3.0.4 --0-349883719-1140382526=:7581 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo! email account. --0-349883719-1140382526=:7581 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: 8bit This is an email sent from my Yahoo! email account.<br><BR><BR --0-349883719-1140382526=:7581-- .
As you can see, even a short email message has a lot of overhead. Most of the returned information has little to do with the actual text of a message. For example, the email message retrieved in Example 14-6 doesn’t appear until over halfway down the listing. The rest of the text returned by the mail server consists of headers, which tell the mail client the path the message took, which services touched it (like SpamAssassin), how to display or handle the message, to whom to send replies, and so forth.
These headers include some familiar information such as the subject header, the to and from values, and the MIME version. You can easily parse this information with the return_between()
function found in the LIB_parse
library (see Chapter 4), as shown in Example 14-7.
Example 14-7. Parsing header values
$ret_path = return_between($raw_message, "Return-Path: ", "\n", EXCL ); $deliver_to = return_between($raw_message, "Delivered-To: ", "\n", EXCL ); $date = return_between($raw_message, "Date: ", "\n", EXCL ); $from = return_between($raw_message, "From: ", "\n", EXCL ); $subject = return_between($raw_message, "Subject: ", "\n", EXCL );
The header values in Example 14-6 are separated by their names and a \n
(carriage return) character. Note that the header name must be followed by a colon (:
) and a space, as these words may appear elsewhere in the raw message returned from the mail server.
Parsing the actual message is more involved, as shown in Example 14-8.
Example 14-8. Parsing the actual message from a raw POP3 response
$content_type = return_between($raw_message, "Content-Type: ", "\n", EXCL); $boundary = get_attribute($content_type, "boundary"); $raw_msg = return_between($message, "--".$boundary, "--".$boundary, EXCL ); $msg_separator = $raw_msg, chr(13).chr(10).chr(13).chr(10); $clean_msg = return_between($raw_msg, $msg_separator, $msg_separator, EXCL );
When parsing the message, you must first identify the Content-Type
, which holds the boundaries describing where the message is found. The Content-Type
is further parsed with the get_attribute()
function, to obtain the actual boundary value.[46] Finally, the text defined within the boundaries may contain additional information that tells the client how to display the content of the message. This information, if it exists, is removed by parsing only what’s within the message separator, a combination of carriage returns and line feeds.
The DELE
and QUIT
(followed by the mail id) commands mark a message for deletion. Example 14-9 shows demonstrations of both the DELE
and QUIT
commands.
When you use DELE
, the deleted message is only marked for deletion and not actually deleted. The deletion doesn’t occur until you execute a QUIT
command and your server session ends.
[44] See Chapter 15 to learn how to send email with webbots and spiders.
[45] Telnet clients are standard on all Windows, Mac OS X, Linux, and Unix distributions.
[46] The actual boundary, which defines the message, is prefixed with --
characters to distinguish the actual boundary from where it is defined.