Chapter 21. Advanced Cookie Management

In the previous chapter, you learned how to use cookies to authenticate webbots to access password-protected websites. This chapter further explores cookies and the challenges they present to webbot developers.

Cookies are small pieces of ASCII data that websites store on your computer. Without using cookies, websites cannot distinguish between new visitors and those that visit on a daily basis. Cookies add persistence, the ability to identify people who have previously visited the site, to an otherwise stateless environment. Through the magic of cookies, web designers can write scripts to recognize people’s preferences, shipping address, login status, and other personal information.

There are two types of cookies. Temporary cookies are stored in RAM and expire when the client closes his or her browser; permanent cookies live on the client’s hard drive and exist until they reach their expiration date (which may be so far into the future that they’ll outlive the computer they’re on). For example, consider the script in Example 21-1, which writes one temporary cookie and one permanent cookie that expires in one hour.

Example 21-1 shows the cookies’ names, values, and expiration dates, as required. Figure 21-1 and Figure 21-2 show how the cookies written by the script in Example 21-1 appear in the privacy settings of a browser. Go ahead and load the URL, http://www.WebbotsSpidersScreenScrapers.com/Listing_21_1.php, and check the cookie status for yourself.

Browsers and webservers exchange cookies in HTTP headers. When a browser requests a web page from a webserver, it looks to see if it has any cookies previously stored by that web page’s domain. If it finds any, it will send those cookies to the webserver in the HTTP header of the fetch request. When you execute the PHP/CURL command in Example 21-2, you can see the cookies as they appear in the returned header.

A browser will never modify a cookie unless it expires or unless the user erases it using the browser’s privacy settings. Servers, however, may write new information to cookies every time they deliver a web page. These new cookie values are then passed to the web browser in the HTTP header, along with the requested web page. According to the specification, a browser will only expose cookies to the domain that wrote them. Webbots, however, are not bound by these rules and can manipulate cookies as needed.