In any discussion of document imaging, we must also talk about mass storage. The images captured by scanners take a lot of space. Currently, the preferred media is the optical disk. The advent of economical, highcapacity optical disks was one of the critical technological advances that enabled the imaging industry to become a reality. The main technology used in the imaging domain is called WORM for Write Once Read Many. WORM disks can store from 1.2 to 10 gigabytes of data on a single cartridge. The writeonce limitation may actually be an advantage. The data are physically impossible to erase, an advantage for most imaging applications with an archival function.
The other optical technologies, MO or Magneto Optical and CD-ROM, are not appropriate for document imaging applications, but the prices are always going down, so check the costs. MO disks allow reading and writing many times, but they are probably too expensive for the high data volumes needed. This may change in the future. CD-ROMs require a factory to master and replicate the information. But CD-ROMs are the media of choice for the distribution of many copies of large volumes of information. Actually, desktop CD-ROM production is now a reality, and lowvolume production is practical.
Significant imaging applications often require terabytes of on-line storage. The costeffective solution for keeping all this information accessible is to use optical jukeboxes. Just like the old Wurlitzer in the corner diner, the optical jukebox contains a set of cartridges. They are swapped, one at a time, into the drive. A wide variety of jukeboxes are available. They range in size from a toaster to a refrigerator. Some jukeboxes hold WORM cartridges, some hold MO (rewritable), and some contain both types. Jukebox capacities are largely dependent on the number of cartridges they can holdfrom 10 to 2000. The storage capacities can go as high as 12 terabytes for a single jukebox. A jukebox this large is larger than desktop size, however.
Let's also remember the wonderful world of micrographics. COM (Computer Output Microform) systems are still alive and kicking. Even today, images can be stored in a
costeffective way using microfilm, microfiche, and - the everpopular aperture card. Microfilm is accepted throughout the world as a legal archival copy.- .
Access to images stored - in these analog media is labor intensive and basically awkward. You can't even do simple text searches. Digital information will be able to take advantage of new storage technologies. You can't shrink microfilm images, but as digital storage technology improves, more and more information can be packed into less and less space. -Witness the new development in next generation CD-ROMs. (See Section 7.. 4. 1 CD- '■ ROM in Chapter 7 Applying Standards for more information on DVDs.) Some far-out technologies mentioned in the press are digital paper and holographic memory, with storage capacity several orders of magnitude greater than existing optical media.
The capabilities of document imaging systems are constantly expanding. Coupled with the improvements in text retrieval and networks, it seem likely that imaging systems present us with an important opportunityto add value to the information. . . .
[CHAPTER 9.0] [TABLE OF CONTENTS]
■>
Skip to chapter[1][2][3][4][5][6][7][8][9] . .'
© Prentice-Hall, Inc.
A Simon & Schuster Company Upper Saddle River, New Jersey 07458
"Few things are harder to put up with than the annoyance of a good example." -Mark Twain
In this chapter we will take a brief look at a number of real life projects. Each project has its own idiosyncrasies and unique constraints, but so do all publishing efforts. These case studies shed some light on the actual experiences of completed projects. These are not toy projects but fullfledged productions with realworld deadlines and problems.
The initial information on many of these projects was came from participants' answers to variations of the questions shown on the next page.
In some cases, the information came from e-mail conversations, published documents, Web sites, the Internet, or other massmedia sources.
In each case study, quotes are in indented paragraphs and editorial - comments [like this comment] are placed in brackets. For example,
The end result of the project was totally mind blowing. It will revolutionize [in the project manager's opinion] the chicken farming business making it one of the growth industries of the 90's.
In several cases I simply paraphrased the material from acknowledged sources. In a few cases, I did some minor editing of quotes send in an informal style. In all cases, I did not change the meaning.
9 . 1 The CAJUN Project
The CDROM Acrobat Journal Using Networks (CAJUN) project is an experiment in
publishing journal articles using Acrobat. Dissemination of the resulting PDF files occurs via network and CD-ROM.
The CAJUN and it's follow on, the CAJUN II projects, are efforts to move the publication of electronic journals toward a genuine technical and business reality. The project is a collaboration between the Electronic Publishing Research Group at the University of Nottingham in the United Kingdom and John Wiley & Sons, Ltd., the publishers. Initially the Wiley journal Electronic PublishingOrigination, Dissemination and Design (EP-odd) was the focus. A second journal, Chapman and Hall's Optical and Quantum Electronics (OQE) as a non-computer science journal, was also used for the project. The bulk of this case study comes from the paper "Journal publishing with Acrobat: the CAJUN project" by Philip N. Smith, David F. Brailsford, David R. Evans, Leon Harrison, Steve G. Probets and Peter E. Sutton, in the Proceedings of Electronic Publishing 94, Darmstadt Germany. The authors are all members of the Electronic Publishing Research Group at the University of Nottingham in the UK.
The CAJUN Project home page at the University of Nottingham
The CAJUN project has expanded to a set of nine journals. These can be found at: http:// www.ep.cs.nott.ac.uk/wiley/journals/. Each journal has or soon will have its own home page:
• Electronic Publishing - Origination, Dissemination and Design
• Software Practice and Experience
• Software Testing Verification and Reliability
• Visualisation and Computer and Animation
• Concurrency Practice and Experience
• Software Process - Improvement and Practice
Acrobat was chosen for the project because it had the potential of being a flexible, platform-independent standard, documented in the public domain, rather than yet another proprietary system. The Acrobat publishing process is also capable of using existing PostScript documents. The following figure, redrawn from the previously cited paper illustrates the Acrobat publishing process:
Acrobat has the ability to support hypertext links. These links in the Acrobat viewer application are displayed as text; a black box is drawn around the text. This default display is ugly and clashes with surrounding text. The CAJUN team decided to add PostScript color to the linked text and effectively replace the default display. This was accomplished using the pdfmark PostScript operator, part of the PDF enhancements to PostScript. After betatesting the displays, the CAJUN team decided on the following:
Buttons for links should be displayed in dark blue rather than being enclosed in a box. The shade of blue chosen is easy to distinguish on screen yet is dark enough to print clearly on a 300-dpi monochrome printer. Users of monochrome screen displays, probably small in number, will find it hard to distinguish the coloured text, but will at least know that all textual references are linked up.
Link destinations should bring up a view of the document with the target material positioned at the top of the screen. The view should be at the current magnification, as this is the most comfortable for the reader.
PDF bookmarks should be provided for all section heading at all levels. The destination view, after following a bookmark, should be the same as for other links (i.e., beginning of section at top of page and current magnification).
One "principal aim of the project is to automate the process of generating pdfmarks from the troff or LaTeX source."
The solution we have adopted involves delving into the output routines of TeX to intercept the page just before output. ...when version 2 of pdfmark becomes available, with an optional argument for 'source page number,' it will be possible to position all pdfmarks at the image of the PostScript file.
Generation of pdfmarks for bookmark entries also requires extra processing in order to arrange them into a hierarchy reflecting that of the sections in the paper. This is also done with extra PostScript procedures defined in the prologue.
Clearly the CAJUN project has demonstrated the feasibility of automatically generating electronic documents that retain page fidelity. The process, however is not simple. It required experienced document processing professionals. It was research, however, the lessons learned will apply to a wide range of practical publishing problems. As Acrobat becomes more widely available and integrated with the Web, the procedures the CAJUN project has invented will become even more valuable.
[SECTION 9.2] ITABLE OF CONTENTS]
Skip to chapter[1][2][3][4][5][6][7][8][9]
© Prentice-Hall, Inc.
A Simon & Schuster Company Upper Saddle River, New Jersey 07458
9 . 2 Text Encoding Initiative
The Text Encoding Initiative (TEI) illustrates how a large complex set of document specifications and standards were created for the humanities.
Most of the information about this project came from through publicly available information sources, particularly the TEI ListServer at the University of Illinois at Chicago. In addition, useful pointers and information were provided by Nancy Ide, a member of the TEI steering committee and organizer of the original planning conference, and Lou Burnard, one of the two editors of the guidelines.
The Text Encoding Initiative (TEI) arose out of a planning conference convened by the Association for Computers and the Humanities (ACH) at Poughkeepsie, New York, in November 1987. It is an international research project, sponsored jointly by the ACH, the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC), with the further participation of several other organizations and learned societies.
From this meeting there emerged there emerged a set of resolutions upon the necessity and feasibility of defining a set of guidelines to facilitate both the interchange of existing encoded texts and the creation of newly encoded texts. The resolutions stated that the guidelines would specify both what features should be encoded and also how they should be encoded, as well as suggesting ways of describing the resulting encoding scheme and its relationship with pre-existing schemes. Compatibility with existing schemes would be sought where possible, and in particular, ISO standard 8879, Standard Generalized Markup Language (SGML), would provide the basic syntax for the guidelines if feasible.
After the Vassar meeting, ACH, the Association for Literary and Linguistic Computing (ALLC), and the Association for Computational Linguistics (ACL) joined as co-sponsors of the project and defined a four-year work plan to achieve the project's goals. Funding for the works plan has since been provided by substantial grants from the American National Endowment for the Humanities and the European Economic Community.
The goal of the TEI is to develop and disseminate a clearly defined format for the interchange of machine-readable texts among researchers, so as to allow easier and more
efficient sharing of resources for textual computing and natural language processing. This interchange format is intended to specify how texts should be encoded or marked up so that they can be shared by different research projects for different purposes. The use of specific types of delimiters to distinguish markup from text, the use of specific delimiter characters, and the use of specific tags to express specific information, are all specified by this interchange format, which is closely based on the international standard ISO 8879, Standard Generalized Markup Language (SGML).
In addition, the TEI has taken on the task of making recommendations for the encoding of new texts using the interchange format. These recommendations give specific advice on what textual features to encode, when encoding texts from scratch, to help ensure that the text, once encoded, will be useful not only for the original encoder, but also for other researchers with other purposes. In these Guidelines, the requirements of the interchange format are expressed as imperatives; other recommendations, which are not binding for TEI-conformant texts, are marked as such by the use of such phrases as 'should' or 'where possible.'
The current Guidelines thus have two closely related goals: to define a format for text interchange and to recommend specific practices in the encoding of new texts. The format and recommendations may be used for interchange, for local processing of any texts, or for creation of new texts. (1)
Some of the design goals are that the guidelines should:(2)
1) suffice to represent the textual features needed for research
2) be simple, clear, and concrete
3) be easy for researchers to use without special-purpose software
4) allow the rigorous definition and efficient processing of texts
5) provide for user-defined extensions
6) conform to existing and emergent standards PROJECT ORGANIZATION
The TEI is a project of significant size and scope, so it requires a formal structure. The project has three organizational units: a Steering Committee, an Advisory Board, and Working Committees.
A sixmember steering committee with two representatives from each of the three sponsoring organizations guides the project.
The advisory board includes representatives from the American Anthropological Association, American Historical Association, American Philological Association, American Society for Information Science, Association for Computing Machinery-SIG for Information Retrieval, Association for Documentary Editing, Association for History and Computing, Association Internationale Bible et Informatique, Canadian Linguistic Association, Dictionary Society of North America, Electronic Publishing SIG, International Federation of Library Associations and Institutions, Linguistic Society of America, and Modern Language Association. The advisory board will be asked to endorse the guidelines when they are completed.
Two editors coordinate the work of the project's four Working Committees.
Committee 1: Text Documentation Committee 2: Text Representation Committee 3: Text Analysis and Interpretation Committee 4: Metalanguage and Syntax Issues
The organization and process that created the guidelines are similar to those for the development of national and international standards. The process is open; public comments are solicited and encouraged. The Guidelines emerge from a consensus reached in the Working Committees.
From the outset, the TEI has sought to involve the community of potential users in shaping the content and format of the Guidelines. Before the final report is published, it is essential that diverse groups of scholars and others involved in the encoding of literary and linguistic materials be given the opportunity to test and comment on our proposals. This draft begins the first of what we know will be several cycles of recommendation, review, and revision.(3)
The Document Type Definition (DTD) development process was mostly accomplished by the two individual who are the editors of the Guidelines, C.M. Sperberg-McQueen and Lou Burnard. The TEI committees, in especially the Metalanguage Committee provided a great deal of support and help.
The tools to develop the DTDs were "Every SGML editor we could lay our hands on: specifically MarkIt, Author/Editor, Exoterica, and more recently VM2."(4)
An initial fouryear grant from the U.S. National Endowment for the Humanities, the European Economic Community, and the Andrew K. Mellon Foundation, got the project off the ground. The project is jointly sponsored by the ACH, ACL and ALLC (see BACKGROUND section). Substantial new funding was obtained for Phase 2 of the project.
A significant draft of TEI Guidelines entitled Guidelines for the Encoding and Interchange of Machine-Readable Texts was published in 1990. The second version of the TEI Guidelines was published in mid-1992. Given the works scope and scale, the second version (known as P2) was distributed as a series of parts. Each part is made available when completed through the usual distribution channels (for example, ListServer and direct requests to the project).
Version 3, (known as P2) is distributed on the TEI Web site at: http://www-tei.uic.edu:80/ orgs/tei/. The guidelines are available in SGML or plain ASCII. In addition you can now purchase a CD-ROM of the guidelines which includes DynaText browsing software for the Mac and MS Windows PCs.
Along with the Guidelines, a series of DTDs was produced. These DTDs are as follows: TEI.1 - The main DTD, which:
...defines some useful entities, then defines the element TEI.1 and includes files with various specialized parts of the document type definition.
TEIhdrl - Header for TEI documents
A TEI header comprises a file description, a set of encoding declarations, and a revision history (change log).
TEI.wsd - Header for TEI writing system declaration
A TEI writing system declaration documents the character set being used in a document, whether for local work or for interchange.
TEIbase1 - Basic structure for conventional prose.
TEIfron - Element declarations for front matter.
Front matter comprises a title page and a series of other front-matter elements. Generic "front.part" tags may be used for other pieces which occur in a given text; they can occur before or between any others.
TEI.1 - Main document type declaration file.
Definitions for phrase-level elements
TEIcrys1 - Declarations for paragraph-level crystals.
Citations, Structured Citations, Lists of Citations occur between or within paragraphs.
TEIling1 - Declarations for linguistic analysis.
TEIrend1 - Declarations for typographic rendition.
TEItc - Declarations for critical apparatus
TEIdram1 - Basic structure for dramatic texts (An alternate base DTD to be used in lieu of TEIbase1)
Several working papers are available. The Metalanguage Committee has a definition of "TEI conformance, " a topic of concern to potential developers of TEI-conformant software.
The Guidelines were mailed to interested parties at no cost. An electronic mail also plays a major role in the dissemination of documents and timely information. The address of the listserver is: listserv@UICVM.BITNET orlistserv@uicvm.uic.edu.
Send a mail message with the line GET TEI-L FILELIST to obtain a list of currently available information. (See_ Section 7 . 4 . 4 Electronic Mail in Chapter 7Applying Standards for more information on listservers.)
The Guidelines and the DTDs that have been completed are available to the public via this listserver.
Meetings take place approximately twice a year, to coordinate participants activities. The meetings alternate between the United States and Europe.
Given the very large scale of the TEI Guidelines, tutorial and introductory material was needed. Two documents in particular define TEI subsets which aim to gently introduce users to the full TEI tag set. First a version called "BareBones SGML" which is a skeletal but clean subset of the full TEI encoding scheme. Familiarity with this minimal tag set should lead to the second subset called "TEI Lite."
TEI Lite: An Introduction to Text Encoding for Interchange, by Lou Burnard and C.M. Sperberg-McQueen is a document (on the TEI Web site) describing a useful subset of TEI. It states:
The present document describes a manageable selection from the extensive set of SGML elements and recommendations resulting from those design goals, which is called TEI Lite.
In selecting from the several hundred SGML elements defined by the full TEI scheme, we have tried to identify a useful 'starter set', comprising the elements which almost every user should know about. Experience working with TEI Lite will be invaluable in understanding the full TEI DTD and in knowing which optional parts of the full DTD are necessary for work with particular types of text.
Getting a large collection of researchers to agree on anything is a monumental task. While there may be disagreements on the content of the Guidelines, it is clear that very substantial progress has been made toward the goals initially set at the start of the project. SGML was used, as it should be, as an enabling technology. There was no reason to reinvent a markup scheme when a rich extensible one already existed. Furthermore, the TEI community was able to take advantage of commercially available tools. In fact, donation of SGML parsing tools by the SEMA Group and Software Exoterica significantly aided the development of the Guidelines.
To date, the TEI Guidelines are the most significant largescale effort at developing a semantically meaningful tag set. The use of the tags is up to the users. The Guidelines are, in effect, an extension of markup technology. The meaningful use of these tags is not dictated by the tags. They enable meaningful analysis.
The project and its results should both serve as a models for other projects involved in aspects of text analysis. The open participatory methods used by the project, along with a number of key funded positions, enabled the project to produce real results in a reasonable time frame. The tags and DTDs form a good basis for discussing the issues of textual analyses.
[SECTION 9.3] [TABLE OF CONTENTS]
Skip to chapter[1][2][3][4][5][6][7][8][9]
© Prentice-Hall, Inc.
A Simon & Schuster Company Upper Saddle River, New Jersey 07458
9 . 3 SGML: The Standard andHandbook
This case study is about the production of two intimately related documents. The first document is SGML, the ISO standard itself. The second document is The SGML Handbook, a book about the ISO standard.
The product of any standardsmaking activity is a document. Since SGML is a standard for the electronic publishing industry, it was only fitting (and elegant) that the ISO standard document be produced with the aid of SGML. The SGML Handbook (hereafter called the Handbook) is a book published six years after the ISO standard that contains the entire text of the SGML standard. The Handbook, presents the SGML standard in two complete ways: in its original order with annotations and in a logically structured overview. In addition it presents SGML with tutorial material explaining basic and advanced concepts.
SGML started out as an ANSI project called "Computer Languages for the Processing of Text" back in 1978. IBM's Charles Goldfarb made a presentation to the project committee, which eventually changed focus to the development of a language for the description, rather than processing, of text. The project was eventually taken over by the ISO committee TC97/SC18/WG8.
The technical development and evolution of SGML was the result of the work of many people. However, during the entire development cycle, Charles Goldfarb was the principal designer and editor. Coordination of the document was handled through conventional mailings of printed output, phone conversations, and working group meetings. The actual document processing system used for most of the process was IBM's Document Composition Facility (DCF), which used GML (Generic Markup Language), the predecessor to SGML. GML provided a head start in the development of SGML, but it was by no means the same thing. Toward the end of the development process, it was possible for a conversion program to translate GML into SGML.
Approximately one year before completion of the standard, Goldfarb went to the ISO Central Secretariat to clarify some language issues and formatting concerns. This meeting laid the groundwork for changing the way ISO does business. Goldfarb, Anders Berglund, of CERN, and Keith Brannon, of ISO, clearly demonstrated the true value of the SGML approach. They formatted the markedup text of the standard in five different ways in a period of about 30 minutes without touching the contents. Today, ISO is committed to producing all its standards using SGML.
The famous proof-of-the-pudding SGML story goes roughly as follows: Goldfarb shipped the unformatted standard to Anders Berglund at ISO via a computer network (Internet). Three days later, Goldfarb boarded a plane to Geneva. Upon arriving at the Geneva Hilton, he was greeted with, "Oh, Dr. Goldfarb, there is a package here for you." It was the complete, SGML standard, formatted to ISO specifications.
Although the production of the SGML standard was an SGML first, many of the Handbook’s formatting requirements were more demanding and interesting.
The SGML Handbook was published in 1990 by Oxford University Press. The books editing and formatting were done mainly by Yuri Rubinsky, the late president of SoftQuad Inc., a major SGML vendor. The Handbook is unusual in several ways. First and foremost, it contains the entire text of ISO 8879, the SGML standard.
Second, it contains a novel cross-referencing system, which consists of roughly 3000
_and are quite effective. In the button
visual "buttons" that look like
illustrated here, the [128] refers to a syntax production (parts of the definition of the standard), the 410 is a page number, and the 17 is a line number of ISO text. ISO text is text that originated in the actual standard and is typographically distinct from the new material in the Handbook. The cross-referencing mechanism is made more usable by the book's physical construction, which includes two ribbon bookmarks. The original "syntax production with built in cross-reference" idea came from Harvey Bingham of Interleaf Corporation. David Slocombe of SoftQuad, Inc., improved it with formatting-for-readability of the syntax productions.
The integration of several ISOowned texts with new material is unique. The SGML standard was originally tagged with SGML, and the same tagged text was imported into this book.
A new DTD for the book was developed that contained all the structure used by the standard and the new structure needed. The ISO text was not modified in any way, with the exception of the addition of the cross-reference buttons. Most of these buttons were generated automatically from the SGML markup; Yuri Rubinsky the editor, added the rest.
Standards are explicit codifications of technology or a set of knowledge. They are written with precision and are generally difficult to read. The strict legalese in which they are written is necessary because standards can become legally binding. In addition, the standards-making process forces a certain structure and style upon the document. The Handbook is a reorganized version of SGML with readable explanations and commentary throughout. The primary contribution of this book is to make a difficult topic accessible.
The Handbook includes three sources of ISOowned information, which are interwoven with new material.
1. The entire text of ISO 8879.
2. The 1988 amendment to ISO 8879.
3. The document "ISO/IEC/JTC1/SC18/WG8/N1035: Recommendations for a Possible Revision of ISO 8879."
From a legal point of view, the text of an ISO standard is copyrighted by ISO. The legal nature and use of standards require that ISO be protect its standards to ensure that no unauthorized changes of content or intent are introduced. ISO also must continue to obtain revenue from the sale of standards. Like all organizations it must have resources.
Two major legal issues were resolved to bring the book into existence. First ISO receives a significant royalty from Handbook sales. This seems reasonable; since the book contains the entire text and commentary, why would anyone buy a copy of the standard directly from ISO? Second, and technically much more interesting, is the use and formatting of the three ISO documents. ISO had to be convinced that the content of the standard was not going to be modified in any way when it appeared in the Handbook. Technically, this was accomplished by using the same SGML-marked-up electronic files used to create the standard.
The 1988 amendment to the standard, which contains a small number of replacement sections, and the recommended changes from the N1035 document, are used in the structured overview and annotations. Change bars indicate that the material came from the amendment or N1035, rather than the 1986 standard.
The standard and much of the Handbook were keyed in using IBM's DCF system. A program written by Wayne Wohler of IBM was used to convert the original GML documents into SGML documents.
The Handbook was actually produced by Yuri Rubinsky. Rubinsky was sent dictation tapes for much of the new material for the Handbook. These were transcribed into SoftQuad's Author/Editor (an SGMLsensitive word processor). The resulting files, a collection of the original ISO 8879 standard, the amendment and N1035, and the new material were merged. The merged files were validated both by Goldfarb, using a parser he wrote during the development of the SGML, and by SoftQuad's SGML parser. This ensured that the files were conformant to SGML.
The final formatted version of the Handbook was produced using SoftQuad's Publishing Software. The publishing software provides the connection between the structural representation captured in SGML and the visual representation of the document. In fact this software converts SGML markup into troff (See Section 4. 1. 2 Language Characteristics in Chapter 4 Form and Function of Document Processors for more information on troff.)
The SGML Handbook sets a new standard for the publication of standards. One aspect of the Handbook is clear; it demonstrates the way all standards should be treated when printed. The republication of standards by knowledgeable participants should be encouraged by the standardsmaking organizations. ISO, Goldfarb, and Rubinsky all deserve congratulations for breaking new ground with publication of the The SGML Handbook.
[SECTION 9.4] [TABLE OF CONTENTS]
Skip to chapter[1][2][3][4][5][6][7][8][9]
© Prentice-Hall, Inc.
A Simon & Schuster Company Upper Saddle River, New Jersey 07458
Now this is a project with a mission. Almost singlehandedly, Michael Hart, an English professor at the University of Illinois campus in Urbana-Champaign (UIUC), has assembled and continues to assemble a collection of electronic texts for simple and wide distribution.
According to a an article about the project:(5)
In describing the genesis of Project Gutenberg, Mr. Hart waxes lyrical. It was 1971, he was a student at the University of Illinois, and through computer-operator friends, gained access to its mainframe from midnight to 8 a.m. The old computer rooms had an aura of mystery, church and magic,' he recalls. You were a computer god.' But what to do with all those millions of microseconds ticking away? Fishing around in his backpack, he found a copy of the Declaration of Independence, began typing, and Project Gutenberg was born.
The purpose of Project Gutenberg is to encourage the creation and distribution of English language electronic texts. We prefer the texts to be made available in pure ASCII formats so they would be most easily converted to use in various hardware and software. A file of this nature will also be made available in various markup formats as it is used in various environments. However, we accept files in ANY format, and will do our best to provide them in all.
Our goal is to provide a collection of 10,000 of the most used books by the year 2000, and to reduce, and we do mean reduce, the effective costs to the user to a price of approximately one cent per book, plus the cost of media and of shipping and handling. Thus, we hope the entire cost of libraries of this nature will be about $100 plus the price of the disks and CDROMS and mailing.
This project makes use of a list server that is at LISTSERV@UIUCVMD.BITNET. (See_ Section 7. 4. 4 Electronic Mail in Chapter 7 Applying Standards for more information on list servers.)
Of course, a number of Web sites now point to the texts, and several FTP sites make the texts available.(6)
Project Gutenberg has evolved into a mini crusade. There is no real organization, just one determined individual leading the rest. When it comes to resources, most of the work has been done over the years by volunteers either typing or scanning the material. In some cases, Hart paid some people out of pocket. The equipment has been scrounged, and computer time has been borrowed from UIUC.
All text entered into the system must be free of copyright restrictions. This is a fairly stiff constraint; in fact, several almostcompleted books had to be canceled due to copyright restrictions.
DOCUMENT PROCESSINGAND WORK FLOW
In general, the flow of work consisted of finding the work, checking copyright, input via typing or scanning spell checking, proofing, and proofing again.
A publicly accessible collection of copyrightfree text. Some representative examples of the texts are:
. Alice’s Adventures in Wonderland - Lewis Carroll . Through the Looking Glass - Lewis Carroll . The World Factbook 1990 - CIA
. The Night Before Christmas (A Visit from St. Nicholas) - Clement Clarke Moore . The U.S. Constitution in troff format
Project Gutenberg uses shareware and public domain distributions. Speeches also spread the word. Over the years the distribution has, along with the technology become more sophisticated. Walnut Creek CDROM, a CDROM publisher, sells a CD of all the Gutenberg texts. It's also updated twice a year. You can find them at: http://www.cdrom. com. Many texts are in scattered places on the Web. Simply use your favorite Internet search mechanism to find a few. You could start with the Walnut Creek FTP site, which has just about everything at: ftp://ftp.cdrom.com/pub/gutenberg/.
The world insists that you go sloooooowly. The "NOT INVENTED HERE" syndrome is HUGE. Lots of people resist for other reasons.
One interesting lesson of this project is that sometimes persistence and missionary zeal do bring about converts. After 20 years of typing away in obscurity, Michael Hart's 15 minutes of fame came with a recent front page article in the Wall Street Journal. That article marked the first time the Internet appeared on the front page of a major national or international paper. His conviction that lots of text should be available free seems to be catching on. The press has had dozens of articles about the project. In fact an online file lists 125 articles.(7)
[SECTION 9.5] [TABLE OF CONTENTS]
Skip to chapter[1][2][3][4][5][6][7][8][9]
© Prentice-Hall, Inc.
A Simon & Schuster Company Upper Saddle River, New Jersey 07458
9 . 5 Oxford English Dictionary
The Oxford English Dictionary (OED) is the definitive English dictionary used by scholars and linguists. In 1989, the longawaited Second Edition of the OED was completed. Production of the dictionary, which is a 20volume set, is a monumental task. It is now available both in hard copy and in two electronic forms, a CD-ROM and as a manipulable database.
Most of the information for this case study comes from an interview with Frank Tompa, a professor at the University of Waterloo in Canada. Tompa is intimately familiar with the software used in the development of the Second Edition of the OED.
Back in 1983, the folks at Oxford University Press (OUP), which owns the OED, put out a request for proposals to create the Second Edition of the OED. The goal of the New Oxford English Dictionary Project was to integrate and update the Oxford English Dictionary and Supplement. The Supplement was a fourvolume set of new material. Printed and electronic forms of the dictionary were to be produced. The project was eventually broken out into three parts all coordinated by OUP: data capture, integration, and database management.
The entire project involved the manipulation of three sets of data: (1) the original 1st edition, (2) a supplement to the first edition, and (3) the new material that would form the changes for the second edition.
The initial task was to place the three sets of data into an electronic form. International Computaprint Corporation (ICC), a subsidiary of Reed International, won the contract for data capture. The entire text was keyed in, and the staff at OUP handled the editing, proofreading, and overall quality control. The keyboarding work took 18 months and more than 120 keyboarders. The proofreading effort was massive. The text was tagged with an ICC propriety procedural markup form, which contained a lot of information concerning the look and layout of the document.
The second portion of the project was to take the three separate data sets and integrate
them into a single document, the 2nd edition OED. This task was accomplished primarily through a significant donation of hardware, software, and people by IBM. Database tools and more than 20 people sorted, merged, and massaged the data into the final form of the Second Edition over a period of two to three years.
Until this point, the data capture and document integration parts of the project were concerned primarily with the production of a printed product, OED's 20 volumes.
Database management consisted of turning the procedurally marked up text into a more maintainable descriptive and structurally markedup database. The product of this phase was not simply a single product, but a set of data and tools with which to maintain and manage the data.
Staff members of the University of Waterloo were primarily responsible for the database management. Waterloo has a research staff devoted to text databases and a long history of research in this domain.
The University of Waterloo staff created and used three software tools. These are PAT, a textsearching utility; LECTOR a textdisplay utility; and TTK, a translation utility.
In fact, during the project's first stage OUP used the PAT textsearching tool to help proofread the material. For example, readers could enter the correct abbreviation for obscure as Obs. and ensure that the abbreviation was spelled correctly throughout a section of text, clearly a tedious and errorprone task if conducted purely by eye.
TTK, the Transduction ToolKit utility, is a tool that allowed the staff at Waterloo and OUP to convert the procedural markup into a structural/descriptive markup. The structural/ descriptive markup of the Second Edition is the source of the other electronic products.
Some view the textmanipulation tools as useful enough to sell. A small spin-off company has been created in Waterloo called the Open Text Corporation (OTC). You can purchase PAT, LECTOR, TTK, and the PAT text database management system from the OTC.(8)
The staff at OUP maintain the database, and now they have the flexibility to check out and make changes to sections of the text. Simple and straightforward principles of good configuration management can now be applied to the dictionary. The dictionary is now manageable as a database. In the future, additional printed and electronic products can be
easily derived from the data.
Currently, the OED is available in two electronic forms. The first, a CD-ROM with an MS WINDOWS front end, was created using the descriptive markedup text as its source. Second the raw descriptive marked up text is available and is a widely used ' source of material for linguistic researchers and lexicographers.
A new product based on the database is, of course, a Web site. The site is under development at the time of writing, but you can visit and find out what's going on at http:// www.oed.com. You will be able to subscribe to the OED, which will allow you to search for entries in the dictionary using sophisticated linguistic tools such as PAT.
From the experimental Oxford English Dictionary, server, under development. LESSONS LEARNED
Clearly, Oxford University Press had the courage and foresight to take the plunge and convert a venerable product into an electronic form. Production of the Second Edition of the OED was a massive job, culminating in a 20volume set containing 22,000 pages of material, and defining over a half million words. I would also bet that the CD-ROM version of the OED is a popular product and that the ease with which new products can be generated from a maintainable database will pay the investment many times. Access via the Web is the next logical extension of the electronic database and provides new business opportunities.
[SECTION 9.6] [TABLE OF CONTENTS]
Skip to chapter[1][2][3][4][5][6][7][8][9]
© Prentice-Hall, Inc.
i 1 i 1 i L A Simon & Schuster Company
Upper Saddle River, New Jersey 07458 Legal Statement
.1 v . ■ . ■ .1 v .' i 1V i 1
http://www.prenhall.com/electronic_publishing/html/chapter9/09_5.html (4 of 4)9/26/2005 10:54:28 PM
The Voyager Company in Santa Monica, California has created a product line called Expanded Books. This effort is tightly coupled with the introduction, success, and "feel" of the Apple's laptop computer, the PowerBook. The material for this case study comes from the article "Books in a New Light" by Joe Matazzoni in the October 1992 issue of Publish magazine and from press release information provided by The Voyager Company. Although this was quite some time ago, Voyager remains the only producer of mass market electronic books.
In the spring of 1991, Apple Computer sent The Voyager Company a prototype of its PowerBook portable computer, which completely changed the equation," says producer Michael Cohen. As Cohen remembers, someone "put a few pages of a favorite novel up on screen. As he passed it around, people were saying, 'Damn if that doesn't look kind of like a book.' Within an hour, the Expanded Books project was born.
Voyager, in the early years produced them at the rate of three per month. You really have to wonder why anyone want to read a book on a computer screen. Voyager folks say, "To be honest, we're not entirely sure yet."
One reason is what he [Cohen] calls the "argument from gravity," He explains: If you travel with a PowerBook, "you can carry ten books with you as easily as one."
The principal software technology used is Apple's HyperCard program. HyperCard presents the user with a "Card" that is on a stack. You can think of a stack of cards as a book. Paging from one card to another, by clicking with the mouse, is similar to flipping the pages of a book. The staff at Voyager has created an elegant user interface on top of the foundation HyperCard provided. Most of all, the page is simple. Unlike many other HyperCard stacks and hypertext programs, there are not a lot of cryptic looking symbols and buttons that often perform unexpected actions.
The book metaphor is maintained in several ways. You can dog-ear the page corners to mark a page and place electronic paper clips for fast access to specific pages. You can scribble (type) in the margins also, a valuable for writing your own commentary. A small palette of controls is clear and simple. It lets the user page forward and backward, search for entries, and retrace where in the books they have visited. Every single word can be selected. Selecting a word brings up a menu indicating whether you want to search for the next or all occurrences of that word.
Voyager's Expanded Books are aimed at Apple's PowerBook but will work on any Macintosh with the all-purpose multimedia language HyperCard. ... As a page layout tool, though, HyperCard leaves a lot to be desired. For example, text doesn't flow from one card to another when you add or delete. And forget about more advanced functions like the ability to detect paragraph widows.
In true frontier fashion, the staff at The Voyager Company built what it couldn't buy. The designers created a HyperCard program called the Toolkit that provides text flow and widow detection as well as extensive features for building annotations and functions like printing a font report (useful if you must include fonts with your documents).
The Toolkit is available as a separate product. Clearly this company "does things right." The Toolkit can be used by organizations wishing to create their own Expanded Book series or, more importantly, as a relatively simple way to create on-line hypertext documents that will function on any Macintosh.
The Expanded Book series is a business venture. Products must sell in order to continue.
Voyager sells Expanded Books for about the same price as hardbound books. Most are $19.95, and a few multiple-titled volumes cost $24.95.
It is still an unknown if there is a real market or if these are just a novelty.
Bob Stein, another Voyager partner, cites success of another type when evaluating the project to date. "I never thought it would get such broad acceptance among traditional publishers." he says. "A year ago, they didn't get it at all. Now every major publisher is either planning to publish something or has a high-level task force looking into it."
Stein's biggest frustration is the lack of a retail channel for electronic books. Most of Voyager's sales come through the firm's mail-order catalog. A few bookstores in the Los Angeles area sell Expanded Books, but, says Stein, selling electronic books to the general public "doesn't make much sense until Apple wants to support their efforts with some kind of hardware in the bookstores or some sort of point-of-sale display it could even be cardboard."
Clearly, however, the distribution aspects is picking up. It is not uncommon to see a rack of Voyager Expanded Books in a bookstore. You can also see sections of and purchase Expanded Books, at the Voyager Web site at: http://www.voyagerco.com.
"Book publishing has a reputation for being conservative when it comes to new technology. But Voyager partner Jonathan Turrell says he had no problem negotiating Expanded Book deals with about a dozen' rights holders ranging from independent authors who hold one book to Random House and everything in between."
The reason for his success, says Turrell, is that although the form his books take may be novel, the contracts he negotiates are not: "Basically, these deals look very familiar to people in the publishing industry. They're all royalty deals, so if they're successful, everyone shares in the success." Turrell says that publishers have been "very understanding about this being a beginning market" and have not demanded excessive advances. The royalties that Voyager pays, he says, are standard for the book industry.
"It's wrong to think that this is such a departure for publisher," Turrell observes. "They've been fragmenting their rights for a long time. I also think the publishing industry believes that, at some point, electronic publishing is going to be real. Expanded Books give them an interesting way to explore that market."
The fortuitous match of the PowerBook with Expanded Books is a classic case of 1 + 1 =
3. The clean integration of software and a friendly portable computer is a winning system. The Voyager staff clearly recognizes the value of traditional paper books and is not out to replace them.
"You'd be surprised at how many people assume that we are on a satanic mission to destroy the libraries and bookstores of the world." says Cohen [product producer Michael Cohen]. "In fact, we love books. This place is full of books. We're just trying to provide another way to enjoy them."