The Design and Construction of eBooks, by Steve Thomas : part3

Selection Criteria

The original “project brief” was to demonstrate that books could be formatted in HTML with acceptable results. Having proved that, the next goal was to make accessible “The Great Books”. This was a good goal in the early days of the project, notwithstanding the controversies over what constitutes a great book, because others had already done the work of digitising most of the texts, making them readily available for adding to our collection.

Since those early days, selection has been broadened to encompass several literature genres (gothic, sci-fi, etc.), travel, science; and also books that others have asked for (where possible).

We originally conceived of a great books program, similar to those running elsewhere. This is as good a starting point as any for selecting texts. But the main driver turned out to be the availability of texts. Although we initially expected to be scanning works ourselves, we don’t (yet) have the resources. And there’s no point scanning works when plain text copies are freely available from elsewhere. The intention is to eventually digitise selected works from our own print collection.

Otherwise, the main selection policy is “what interests me” at any particular moment.

Sources

Where do the books come from? If I’d scanned and OCR’d every book myself, there would be perhaps only a hundred or so, rather than the 3,000+ available today. Scanning is a lot of work. Also, you need a copy of a printed edition in order to scan, and despite working in a University library with two million volumes, I don’t have ready access to everything I might want. Even more than scanning, proof-reading and correction after OCR is even more time consuming. And I’m only one guy, so . . . in most cases, I’ve taken the source text from some other site providing free public domain texts.

Chief of these is the venerable Project Gutenberg, which I’ve been following since they had only a couple of thousand books. They now have over 40,000, thanks largely to the efforts of the Distributed Proofreaders project, which very cleverly taps into the crowd-sourcing power of the internet to do the OCR proof-reading and correction. Neither is perfect: Project Gutenberg has a lot of duplication; sometimes they have incomplete multi-volume works — volume 2 without volume 1, for example; some complex works are a mess that requires considerable effort from me to untangle and produce a well-structured book. Distributed Proofreaders also has some eccentricities that can cause me difficulties, particularly with notes, quotations and verse. But by and large, together they have produced a marvellous resource for constructing ebooks.

I also make good use of the Australian and Canadian versions of Project Gutenberg, for material that’s copyright in the USA and therefore unavailable on the US site.

“Raiders of the Lost Books”

In the early days, I made use of some other early etext collections: Wiretap (still going); MIT’s Classics Archive (now being resurrected from the dead); the ERIS Project (defunct) from Virginia Tech. Many of these works I found subsequently had been added to Project Gutenberg.

When these sources have failed me, I’ve often been able to locate an ebook version of a title on some other site. I have taken these and reformatted with a clear conscience, since the original text is in the public domain, while being (privately) grateful to the site owner for making the work available. Usually, for the reasons stated in the introduction, the formatting on the source site is horrible, so I’m liberating the book from unreadable formats to make it more accessible. Also, many of these sites are “fragile”, in the sense of existing at the whim of the site owner, and may at any time disappear as a result of financial pressures or lack of energy. I have seen a number of sites disappear in this way. (There were some useful sources on the old Geocities site, which sadly vanished when Yahoo decided to shut it down.) So my work is also rescuing ebooks from potential oblivion.

The Internet Archive

When all else fails, I turn to the Internet Archive, which has a huge collection of works, but scanned images only, derived from the original Google Books and Microsoft Books projects in concert with a number of University and large public library collections. This is an amazing resource, with all manner of obscure and almost unobtainable texts. On the downside, it is also incredibly badly organised, and it sometimes requires considerable effort to sort out one volume or edition from another: basically, you have to examine each work to figure it out. Also, some of the scans are better than others (and some are so bad as to be worthless): some of the earliest Google Books project scans are very rough, intended only as material for OCR, whereas later scanning efforts seem to be much better quality. Those from Cornell University are usually of the best quality. But, if you can find a good scan, you’ll probably find that the text version (the raw OCR output) also available from the Internet Archive will be worth using, and I’ve made a few good ebooks from these (after a few hours of correction).

Producing the books

Production can be broken down into a number of steps:

Scanning a print book. This is tedious work, the most demanding part being making sure you don't miss a page. Keeping the pages aligned is also important. In theory, you can scan around 10 pages per minute with a decent scanner and a good pair of hands. So a 300 page novel should take around half an hour. In practice, you'll probably manage that book in a couple of hours, spread over a day. Fortunately, someone else has probably already done the scanning, so you mostly won't need this step.
Conversion to text (OCR). Once the book has been scanned, you need to convert the page images into text. There are plenty of tools to do this, sometimes bundled with scanning software. Personally, I use a Linux application called tesseract which generally does an outstanding job. But whatever software you use, you're going to get errors. OCR is notorious for mixing up "be" with "he", "c" and "o", "th" with "tli", etc. The clarity of the original scan, and the typeface used in the book, are crucial. If the letters are too close together, too small, italicised, you will get more errors. If the page was dirty, or scribbled on or underlined by a reader, then you will get more errors. Unless you can find a cleaner text to scan, you're stuck with fixing these errors. Which brings us to step 3:
Proof-reading/correction. Now that you have the raw text of your book, you need to correct all those OCR errors. This is as simple as it sounds: open the text in your editor of choice, preferably with spell-check highlighting turned on. Then you work through, page by page, fixing all the errors, probably with reference to the original page scans where the correct text is not obvious. This can take ages, so I find it's a good activity for an evening in front of the TV. Probably with a glass of wine.
Adding structure. Once corrected, you can proceed to the next stage, which is to apply some structure to the book. Remember, at this stage you just have a plain-text file, so you will need to go through and identify chapter headings, quotations, verse, and all the other things mentioned in previous sections of this work. And it's at this point that you need to decide whether to retain the original page numbers (useful if your book has an index) or discard them. I usually discard them for straight-forward fiction, keep them for complex non-fiction works. If I'm keeping them, I like to wrap the page number in braces, as {pg32}, which makes them easy to identify later.
Building the ebook. With a clean, well-structured plain text file, we can convert the text to HTML. I use a locally-written Perl script to do this, and then more scripts to split the HTML version into multiple parts, usually by chapter. Finally, a “front section” with table of contents is created linking them all together.

The Design and Construction of eBooks, by Steve Thomas

PART THREE

Production

Selection Criteria

Sources

“Raiders of the Lost Books”

The Internet Archive

Producing the books