Chapter 18. Securing Your Web Service

In this chapter, we’ll look at technical issues involving entities outside your organization that you need to consider if you hope to offer a reliable web service.

The most powerful technique for providing a reliable service is redundancy: using multiple systems, Internet connections, routes, web servers, and locations to protect your services from failure. Let’s look at what can happen if you don’t provide sufficient redundancy.

It was Monday, July 10, and all of Martha’s Vineyard was full of vacation cheer. But for Simson Garfinkel, who had spent the weekend on the Massachusetts island, things were not so good. At 9:30 a.m., Simson was in the middle of downloading his email when suddenly IP connectivity between his house on Martha’s Vineyard and his house on the mainland was interrupted. No amount of pinging, traceroutes, or praying, for that matter, would make it come back.

What had happened to the vaunted reliability of the Walden network that we described in Chapter 2? In a word, it was gone. The Walden network had been decommissioned on June 30, when Simson moved from Cambridge to Belmont. Because Belmont’s telephone central office was wired for DSL, Simson was able to move his Megapath Networks DSL line, but Belmont’s cable plant had not been upgraded for data, so he had to leave his cable modem behind. So long, redundancy! On the plus side, with the money that Simson saved by canceling the cable modem, he was able to upgrade the DSL line from 384 Kbps to 1.1 Mbps without incurring an additional monthly fee. Moving from 384 Kbps to 1.1 Mbps offered a huge improvement in performance. But the new setup also represented an increase in risk: Simson now only had one upstream provider. Simson had taken a gamble, trading redundancy for performance . . . and he had lost.

Simson caught a 10:45 a.m. ferry from Oak Bluffs to Woods Hole, then drove up to Belmont. With a few more detours and delays, he made it to Belmont at 4:00 p.m. Then he called Megapath Networks, his ISP, and waited on hold for more than 30 minutes. Finally, at 4:45 p.m. a technician answered the phone. The technician’s advice was straightforward: turn off the DSL router, wait a minute, then turn it back on. When that didn’t work, the technician put Simson on hold and called Rhythms, the CLEC that Megapath had contracted with to provide Simson’s DSL service. After another 30 minutes on hold, a Rhythms technician picked up the phone. That engineer did a line test, asked Simson to unplug his DSL router from the phone line, then did another line test. At last the verdict was in: “It seems that your DSL line recently got a whole lot shorter,” the engineer said.

How could a telephone line suddenly get shorter? Quite easily, it turns out. Sometimes, when a telephone installer working for the incumbent local exchange company (the ILEC) is looking for a pair of wires to run a telephone circuit from one place to another, the linesman will listen to the line with a telephone set to see if it is in use or not. This sort of practice is especially common in neighborhoods like Cambridge and Belmont, where the telephone wires can be decades old and the telephone company’s records are less than perfect. If the telephone installer hears a dial tone or a telephone conversation on a pair of wires, he knows that the pair is in use. But DSL lines don’t have dial tones or conversations—they simply have a hiss of data. In all likelihood, a linesman needed a pair for a new telephone that was being installed, clicked onto Simson’s DSL line, heard nothing, and reallocated the pair to somebody else.

Two hours later, Simson called Megapath back, which in turn called Rhythms. “Good thing you called!” he was told. Apparently nothing had been done—one of the companies, either Megapath or Rhythms, had a policy of not escalating service outage complaints until the subscriber called back a second time. Rhythms finally made a call to Verizon and was told that a new line would be installed within 48 hours.

Fortunately, things went better the next day. An installer from Verizon showed up at 10 a.m. that morning. That installer discovered that the DSL line had been cut by a junior Verizon employee who was installing Simson’s fax line. By 2 p.m. the DSL line was back up. Unfortunately, by that time, Simson’s email had already started to bounce, because many ISPs configure their mail systems to return email to the sender if the destination computer is unavailable for more than 24 hours.

What’s the lesson of this story?

If your Internet service provider promises “99% uptime,” you should probably look for another ISP. With 365 days in a year, 99% uptime translates to 3 days, 15 hours and 36 minutes of downtime every year. Much better is a commitment of “5 nines” reliability, or 99.999%—that translates to only 5 minutes and 15 seconds of downtime in a year. For many users, 5 minutes of downtime each year is an acceptable level of performance.

Unfortunately, the problem with uptime commitments is that they are only promises—you can still have downtime. Even companies that offer so-called Service Level Agreements (SLAs) can’t prevent your network connection from going down if the line is physically cut by a backhoe; all they can do is refund your money or make some kind of restitution after an outage. If Internet connectivity is critical for your service’s continued operation, you should obtain redundant Internet connections from separate organizations or provide for backup services that will take over in the event that you are disconnected.

Deploying independent, redundant systems is the safest and ultimately the least expensive way to improve the reliability of your information services. Consider: instead of spending $1000/month for a T1 line that promises less than 6 hours of downtime a month, you could purchase three less-reliable DSL circuits for $200/month from providers that promise 48-hour response time. Not only would save you money, you would actually have a higher predicted level of reliability. In this case, if the T1 were actually down for 6 hours a month, then there would be a 0.83% chance of the line being down on any given hour. For each DSL line, the chance that it would be down on any given hour would be 6.6%. But if the DSL lines are truly independent—coming in from different providers, using different CLECs and different backbones—then the chance that they would all be down for the same hour is 0.029%, making the three DSL lines almost 30 times more reliable than the single T1. The three 1.1 Mbps DSL lines would also offer nearly twice the bandwidth of a single T1 when all are operating normally.

While redundancy is a powerful tool, deploying redundant systems can also be a difficult balancing act. Coordinating multiply redundant systems is both a technical and a managerial challenge. Testing the systems can also present problems. If you are going to deploy redundant systems, here are some things that you should consider: