Redundancy and Wireless

Around Vineyard.NET’s third anniversary, I started worrying about the possibility of a fire—and what it would do to our company. At that time, Vineyard.NET was located completely within my house, a 150-year-old wood frame building. One careless night with wine and some candles, and the whole ISP could be history.

I had heard of something called business continuity insurance, so I called my insurance agent and asked what it was all about. He said that I would need to prepare a description of Vineyard.NET for the underwriters, the sort of problems that we could encounter, how much lost revenue each month we would have, and how we would recover after a loss. I chuckled; he was asking me for the very sort of disaster recovery plan that we advocate in earlier chapters of this book.

Vineyard.NET’s first disaster recovery plan was pretty pathetic. “Well, basically we would set up shop in a new building, buy all new equipment, and have the phone company pull in new circuits,” I explained to my insurance agent.

“So you would be down for a month? How much money would you lose?” he asked.

“Well, we would probably be down for 45 days, because these high-speed circuits can take a long time to get installed,” I said. “But by that time, all of our customers would have left and gone elsewhere. So it would basically wipe out the business.”

I realized that we didn’t have an insurable risk. Before I could expect an insurance company to stand behind my company, we needed to improve the company’s disaster planning so that there was something to stand behind.

Given that my primary concern was the possibility of fire, and my secondary concern was the possibility of theft, the logical thing for Vineyard.NET to do was to set up a second machine room in a second building. We had a location: the basement of our largest reseller, Educomp. All we needed to do was to get the phone company to pull a second 100-pair network cable to that location, put some spare equipment there, and be all ready for the eventual fire that we hoped would never come.

A quick call to the phone company revealed that this plan was more complicated than it seemed. NYNEX[228] said that it would not put facilities in a location without an order. Furthermore, once we placed the order, getting the facilities installed would require that a new conduit be installed between our new building and the manhole in the street; the operation would cost thousands of dollars and require shutting down the street again. And this time, we would need to pay for the work ourselves, as the lines were being installed in a commercial facility.

We thought of this expense as the first installment of our insurance policy.

While the phone company was working on providing facilities to our new location, we started working on the second part of the problem—figuring out a way to tie together the two machine rooms. A number of approaches presented themselves:

We decided to go with the wireless approach. The first equipment that we tried came from an Israeli company named Breezecom. This equipment operated at 2.4 GHz using the 802.11 frequency-hopping standard. After a few months of trials, we gave up on the Breezecom equipment: it simply was not reliable enough. Our next try was with hardware from a company called C-Spec. The hardware was basically a 486 PC with a Lucent 915 MHz frequency-hopping Wavelan card and special software that C-Spec had written. The C-Spec equipment cost more than the Breezecom, but it worked without problems.

Vineyard.NET’s largest reseller was extremely happy with its new high-speed wireless connection to the Internet; for the previous three years, Educomp’s only connection to the Internet had been multiple dialup connections. But for Educomp, getting the wireless to work had been quite easy: the wireless system was an Ethernet bridge, so all we needed to do was to plug one wireless system into Vineyard.NET’s Ethernet and plug the second one into Educomp’s. Getting the wireless system to be usable for Vineyard.NET required considerably more work.

The first question that we were faced with, of course, was “What do we want to do with the backup site?” We knew that we wanted it to be our backup site, so we decided that we needed a backup computer system there. We took an old PC that we had upgraded, put some big SCSI hard drives on it, and put it in a rack in Educomp’s basement. We added to the setup a rack of 16 modems and a Cisco 2516 router. Normally, the Cisco would simply be an access server. But if our main building ever burned, we would have the phone company jumper the T1 to the new location and we could use the 2516 as our upstream router as well.

Once we had the computer at the backup site operational, our next order of business was to make it truly functional. We set up a series of jobs on our primary computer that would automatically back up the hard drives to the backup system on a daily basis. Then we set up another job that would copy over our most critical files—the accounting files, people’s email, and so on—on an hourly basis.

Although the backup system was designed to help us survive a fire, we quickly realized that having a secondary system would also make it possible for us to survive a server crash, something that was far more likely. In the event that our primary server crashed, we wanted the backup server to be able to take over from the primary. This meant that it needed to be able to serve web pages, accept mail, and generally pretend to be the primary system.

To make this illusion successful, we gave the backup computer the IP address of our secondary nameserver. We set up a copy of our web server so that the backup computer could serve the web pages for all of our customers. We further modified the system so that some of the scripts would notice if they were running on the backup system and, if so, not execute. We decided that it was simply easier to prevent users from changing their passwords or account options while they were running on the backup server, rather than try to figure out how to propagate the changes from the backup systems back to the primary system.

Finally, we waited.

Over the following two years we had very little use for an online backup server. Whenever we accidentally deleted a file, we could get the backup from the backup computer.

Then in the fall of 1998, our backup system got its first real test. One afternoon, everything on our primary computer started to go haywire. Our ls and du commands were dumping core. We thought that we were either under attack or had suffered a really serious hardware problem. But then we noticed that other, dramatically more accomplished subsystems were working fine: we were able to log into the computer using ssh, and emacs still worked perfectly.

We tried to debug the problem with BSDI, but nobody had heard of the problems that we were having. We explained that we clearly had an operating system bug; we needed help. The best help that BSDI could give us, I said, was the source code to the ls program. I could then compile the program with debug symbols, see where the crash was, and figure out what we had done to trigger the problem.

But BSDI refused to give us the source code to the system that we had: “We just don’t do that,” I was told. Vineyard.NET could purchase the source code, but they would not give it to us, not even to help us find an operating system bug.

After an hour of screwing around with BSDI, we decided that we were on our own. For the first time since we had built the system, we switched over to run completely off the backup system. Doing the switchover was far easier than I thought it would be: we simply copied the current mail files from our primary system to the backup system, then we halted the primary system and gave its IP address to the backup. Suddenly Vineyard.NET was up and running again. Our customers never found out that we were running on a machine with a fraction of the capacity of the primary.

An hour later, an engineer at BSDI called me back. He said that he couldn’t give me the source code, but he could give me a specially compiled version of the ls command with debug symbols left in. I ran the program, it crashed, and I examined the core dump. According to the core dump, the program had crashed when attempting to access a function called the getgrent( )—a function that reads through the /etc/group file. I examined the file and discovered that it had some trailing blank lines. I removed the blank lines and the problem went away. Apparently the extra blank lines had tickled a bug in the BSDI shared library. Programs like emacs and ssh were not affected because our copies of these programs had been compiled and linked before we had upgraded our system to the 3.0 release of the operating system, so they were using the 2.0 shared library, which did not have the bug.

With the problem diagnosed, we now could switch back to our primary system. There was only one problem—we couldn’t figure out a way to do this without interrupting service. At 4:00 a.m. that night, we turned off our SMTP server, copied everybody’s mail files back to the primary system, and moved back the IP addresses.



[228] Bell Atlantic and NYNEX had merged to become NYNEX.