Basic Authentication

The most common form of online is authentication is basic authentication. Basic authentication is a dialogue between the webserver and browsing agent in which the login credentials are requested and processed, as shown in Figure 20-1.

Web pages subject to basic authentication exist in what’s called a realm. Generally, realms refer to all web pages in the current server directory as well as the web pages in sub-directories. Fortunately, browsers shield people from many of the details defined in Figure 20-1. Once you authenticate yourself with a browser, it appears that you don’t re-authenticate yourself when accessing other pages within the realm. In reality, the dialogue from Figure 20-1 happens for each page downloaded within the realm. Your browser automatically resubmits your authentication credentials without asking you again for your username and password. When accessing a basic authenticated website with a webbot, you will need to send your login credentials every time the webbot requests a page within the authenticated realm, as shown later in the example script.

Basic authentication dialogue

Figure 20-1. Basic authentication dialogue

Before you write an auto-authenticating webbot, you should first visit the target website and manually authenticate yourself into the site with a browser. This way you can validate your login credentials and learn about the target site before you design your webbot. When you request a web page from the book’s basic authentication test area, your browser will initially present a login form for entering usernames and passwords, as shown in Figure 20-2.

Basic authentication login form

Figure 20-2. Basic authentication login form

After entering your username and password, you will gain access to a simple set of practice pages (shown in Figure 20-3) for testing auto-authenticating webbots and basic authentication. You should familiarize yourself with these simple pages before reading further.

Basic authentication test pages

Figure 20-3. Basic authentication test pages

The commands required to download a web page with basic authentication are very similar to those required to download a page without authentication. The only change is that you need to configure the CURLOPT_USERPWD option to pass the login credentials to PHP/CURL. The format for login credentials is the username and password separated by a colon, as shown in Example 20-1.

Example 20-1. The minimal code required to access the basic authentication test pages

<?
# Define target page
$target = "http://www.WebbotsSpidersScreenScrapers.com/basic_authentication/index.php";

# Define login credentials for this page
$credentials = "webbot:sp1der3";

# Create the PHP/CURL session
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target);             // Define target site
curl_setopt($ch, CURLOPT_USERPWD, $credentials);    // Send credentials
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);     // Return page in string

# Echo page
$page = curl_exec($ch);                             // Place web page into a string
echo $page;                                         // Echo downloaded page

# Close the PHP/CURL session
curl_close($ch);
?>

Once the favored form of authentication, basic authentication is losing out to other techniques because it is weaker. For example, with basic authentication, there is no way to log out without closing your browser. There is also no way to change the appearance of the authentication form because the browser creates it. Basic authentication is also not very secure, as the browser sends the login criteria to the server in cleartext. Digest authentication is an improvement over basic authentication. Unlike basic authentication, digest authentication sends the password to the server as an MD5 digest with 128-bit encryption. Unfortunately, support for digest authentication is spotty, especially with older browsers.