GENERALIZED MARKUP

Specific markup tells the system what to do with a document. In contrast, generalized markup describes the document to the system. Also called descriptive markup, generalized markup tells the system about the document elements. It does not tell the system what to do with that information. SGML is a standard for placing descriptive generalized markup in a document.

The act of placing markup into a document is time consuming and unwieldy. However, several good tools make this process reasonable. (See Appendix A Resources for some tools.) Ideally, you would like to automate the markup process to the point of invisibility.

What is Markup? One of the editors of the Text Encoding Initiative (TEI) (see Section 9. 2 Text Encoding Initiative in Chapter 9 Case Studies), a project developing markup specifications for humanities scholars, says this about markup:

Why does the TEI encoding scheme matter?... It is a tool for scholars, but it has many applications, some of them commercial as when it helps to reduce the documentation for a fighter plane from three tons of printed information to a disc of easily retrievable, crossreferenced electronic data. Markup, if one needs a fancy word, is a branch of hermeneutics, a system of explication. Markup makes explicit what was not so clearly arranged before. It allows huge amounts of data to become parsed character data, that is meaningfully arranged data with tags that can help collect or arrange the data according to the needs of the retrieving user.(13)

CONTENT MARKUP

Content markup is the . use of generalized markup to describe the semantic elements of a document. Strictly speaking, this is an application of generalized markup and, indeed, of SGML.

K    "K    ■»» ,

For example, you might have a recipe marked up with tags . such as <INGREDIENTS>, <TEMPERATURE>, and <SERVINGS>. These tags describe the content, not the structure. You can imagine having hundreds of recipes in this form and integrating the information with a database. You could ask questions of this data base to produce, for example, a shopping list for a particular set of recipes.

*.    K

This type of markup is the subject of a great deal of research. It gets very complicated very quickly. Often, it is difficult to clearly and unambiguously identify actions and objects in the real world. , ;    '

Take the issue of naming an item. In a description ' of a new ' porch you're about to build, you could refer to a joist as (1) the 10th support from the left end, (2) . the joist 150 inches from the left end, (3) the 3rd loadbearing support, ' (4) the corner support, or (5) the pink . joist. Descriptions of objects often mix naming conventions; as . a result, the markup of content is very difficult unless the text is highly structured and almost legalistic in ' nature. To deal with this issue, you must try to anticipate the way the document and content markup will be used. Alternatively, you must be willing to highly restrict the markup to do meaningful content markup. These issues are also part of the work of the Text Encoding Initiative. (See Section 9. 2 Text Encoding Initiative in Chapter 9 Case Studies.)

4 . 3 . 2 Markup Creation

As you can easily imagine, the act of marking up text can be arduous. There are a number of ways to attack the problem. Tools to aid the markup process range from no automation to fully automated.

The first markup method is brute force. You simply use a text editor to embed the markup at appropriate places in the text. In the next method, markup is entered by hand, but by an editing tool that knows about allowable markup. In the case of SGML generalized markup, there are a number of structure editors. (See section Document Processors in the appendix Resources for a list of these editors.) These editors "know" what kind of markup is allowed at any particular place in the text. The user is allowed only to enter legal types of markup entities. This approach has several benefits. The mental overhead is greatly reduced, and you're assured of producing legally markedup documents. Sometimes, however, having an electronic checker looking over your shoulder can be overly intrusive. Inevitably, you need to turn off the checking. From a user interface point of view, the better systems balance markup validation with ease of use.

A semiautomated approach is another way to enter markup. A document already in one publishing system's format is used as the input to an automatic markup process. For example, a FrameMaker document could be translated into HTML by using a converter to translate FrameMaker (MIF) markup into HTML tags.(14)

Although useful, this conversion approach has limits. You must start with a highly structured document. More problematic is the reality that document structure is often implicit in specific markup systems. The fact that a sentence is bold and all capitalized may imply that it is the start of a section, but it does not state so explicitly. The structure must be inferred based on the particular style used to format the document, rather than on an explicit command that says: <THIS IS THE START OF A SECTION>. If a figure or caption also contains a sentence that is bold and all capitalized, the markup system would misinterpret it as the start of a section.

Generalized markup using SGML defines the structure of a document. Troff documents use a form of specific markup. Often you must use implicit assumptions about troff documents to complete the translation to SGML. The same is true for documents in TeX or MS Word.

Finally, you can use automated markup systems that use document images from scanners

as input. The software is told what to expect and creates the markup based on those expectations. As in the previous FrameMaker example, highly structured documents can be successfully translated, but poorly structured documents are much more difficult.(15) A newsletter cannot be fed into the scanner when the software is expecting a particular kind of technical report, at least not if you expect meaningful results.

[CHAPTER 5.0] [TABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]


© Prentice-Hall, Inc.

A Simon & Schuster Company Upper Saddle River, New Jersey 07458

Legal Statement

Chapter 5: Document Standards

"A committee is a group that keeps the minutes and loses hours." -Milton Berle

Standards for electronic publishing can profoundly affect the publishing process. All aspects of the process, from design to authoring through production, can be influenced by the use or misuse of standards.

First, let's define the key terms. A standard is a set of agreedupon procedures or data formats that you use to accomplish a task. Standards become part of the software tools you use to get your work done. This chapter will examine both defacto (informal) and formal publishing standards. It will also explore document exchange, the motivation behind a great deal of document standards work. Document interchange seeks to answer the question: How can I give you my electronic document and know that you can use it?

Publishing is evolving into one of many forms of information dissemination. On-line reading and browsing, also known as Web surfing, hypertext, hypermedia, and CD-ROM based delivery mechanisms are realities when the proper standards are implemented. Thus, the standards themselves should be viewed as enabling technologies. They lower the risk of trying a new technology. If everyone uses a particular standard, you take less

risk publishing a document that uses that standard.

One longterm goal of publishers is to create customized publications from repositories of textual information. The content, stored in a database, is the raw material that is refined into information products.

The term information refineries, used to describe this process, is an apt analogy. Raw "crude" text is poured in the top of a processing chain and out flow numerous products. CD-ROMs, on-line information services, Web sites, customized textbooks, personalized newspapers, and more are all potential products of these repositories of information.

The widespread use of formal publishing standards may permit the establishment of such information refineries. However, it is also clear that not all types of text are suited to be sliced, diced, mixed, and tossed to produce an arbitrary salad bowl of products. While it is tempting to use the refinery paradigm, we must not be seduced into inappropriate applications. Sometimes, the author intended a book to be read as a whole. Similarly, sometimes the content must be read in context.

It is useful to consider two types of standards: document standards and the graphics standards commonly used to represent the graphics included as part of a document. Often, standards refer to existing standards rather than to the reinvention of an area already covered by a standard. Combinations of appropriately chosen document and graphics standards can provide powerful solutions to the many complex problems of electronic publishing.

5 . 1 DeFacto Standards

"There is no monument dedicated to the memory of a committee.” - Lester J. Pourciau

Sometimes, when a product becomes very popular and widespread, the data formats for text or graphics used by that product become a defacto standard. When appropriate for your particular application, a defacto standard format is an easy and convenient way to exchange information. For example, in the Computer Aided Design (CAD) world, AutoCAD's format DXF is the defacto standard for CAD data on PC's. Adobe's PostScript is another defacto standard.

These specifications should not be confused with formal, official standards. True standardsformal standardsare generally developed over periods of 2 to 10 years by committees of technical experts. The committees work under the sponsorship of national or international standardsmaking bodies such as American National Standards Institute (ANSI), International Telegraph and Telephone Consultative Committee or Comite Consultatif International Telegraphique et Telephonique (CCITT), European Computer

Manufacturers Association (ECMA), or International Organization for Standardization (ISO). The formal standardsmaking process is excruciatingly painstaking and slow, but it's the best way to address all concerns. (For more discussion about formal standards, see_ Section 5. 2 Formal Standards later in this chapter.)

One significant difference between defacto and formal standards is that defacto standards are often proprietary. The exact structure of a data format a defacto standardmay well be a trade secret. PostScript Type 1 fonts were a tightly held secret until 1990. In that year, Apple and Microsoft announced the TrueType font format, a shot across the bow at Adobe's PostScript monopoly.(1) Subsequently, the Type 1 font specification was made public. (Seems as though a little competition is useful.) Yet, keep in mind that PostScript would never have come into existence through the formal standardsmaking process with the political and technical compromises that so often are a reality.

5 . 1 . 1 Document Processors

The classic document processing systems, some of which are still quite popular, are batchlanguage oriented. The intuitive appeal of WYSIWYG systems must sometimes give way to the sheer volume of processing necessary for documents consisting of thousands of pages. In fact, sometimes documents are so mundane and routine that you don't want to look at them (for example, documentation for hundreds of similarly structured subroutines). Let's briefly examine three systems: Scribe, troff, and TeX.

Scribe was a groundbreaking document processor, the creation of Brian Reid formerly of CarnegieMellon University. He singlehandedly revolutionized the field of document processing with his doctoral dissertation, Scribe.(2) Along with the overall ability to format text according to markup instructions, Scribe introduced the notion of styles. A Scribe document does not contain detailed formatting instructions. Documents can be created and printed according to a particular format such as "Thesis," "Report," and "Letter."

From an early Scribe manual:

To use Scribe, you prepare a manuscript file using a text editor. You process this manuscript file through Scribe to generate a document file, which you then print on some convenient printing machine to get paper copy.

Scribe controls the words, lines, pages, spacing, headings, footings, footnotes, numbering, tables of contents, indexes and more. It has a data base full of document format definitions, which tell it the rules for formatting a document in a particular style. Under normal circumstances, writers need not concern themselves with the details of formatting,

because Scribe does it for them.

The manuscript document an author creates has markup statements throughout. These statements describe the various components of the document to the Scribe processor. The descriptive markup the author places in the document is interpreted and formatted by the Scribe document processor. Scribe has generally been superseded by TeX and troff. Nevertheless, it remains an important document processing system.

In the UNIX world of document processing, troff is king. Actually, document processing applications were one of the first serious UNIX applications and one of the motivations behind its creation.(3) Created by Joseph Ossana, troff is first and foremost a typesetting system. Troff processes the markup that an author must embed into a document as formatting instructions. The modular nature of UNIX, coupled with the power of troff, has led to a number of troff preprocessors: Eqn for typesetting equations, tbl for tables, and pic for line drawings. Grap, a little language to specify graphs, is actually a pic preprocessor. Each of these preprocessors is a little language in and of itself. (See Section

4. 1. 3 Specialized Languages in Chapter 4 Form and Function of Document Processors for illustrations and a discussion of these preprocessors.)

It is common to see a command line such as

cat doc.txt | pic | tbl | eqn | troff -mm

to produce the printed copy of a paper. (In UNIX, the | symbol is a "pipe," which directs the output of the commands to its left to the input of the commands on its right.) cat doc. txt sends the file (doc.txt) as input to pic, which interprets drawing commands; to tbl, which interprets table making commands; and to eqn, which interprets equations. The output of all three of these preprocessors is input to troff, which does the actual typesetting according to the mm macro package.

TeX is one of the premier document processing systems in existence. It is arguably the most popular batchlanguage oriented document processing systems. It is available on virtually any computing platform and can be legally obtained for free. An extensive series of books by Donald Knuth (the author of TeX) documents the source code and functionality of TeX.(4) Commercially supported implementations can also be purchased for platforms such as the IBM PC.

LaTeX, a macro preprocessing system used with TeX, is the primary way documents are authored(5). LaTeX uses the concept of style files to encapsulate commands and for processing instructions to format particular document elements.

Troff and TeX are used as the basis for internal publishing standards by a number of large organizations. AT&T's UNIX and OSF's (Open Software Foundation) software documentation originates as troff documents. TeX is used by the American Mathematical Society for a number of publications, and the electronic publishing magazine EP-ODD uses TeX and troff as the principal means for electronic submissions.

5 . 1 . 2 PostScript

PostScript is THE defacto standard page description language because of its extremely wide market penetration. It has evolved into more than simply a way of describing marks on a page. The thorough way in which it handles graphics and fonts, along with the consistency and quality of its implementations, has led PostScript into many areas. Document exchange and on-line document displays are two of the more prominent ones. PostScript, in combination with Apple's LaserWriter, effectively started the desktop publishing phenomenon.

For several years, PostScript was available only as a language that ran inside a printer.

The printer's manufacturer had to license PostScript from Adobe. Close conformance to the PostScript specifications was guaranteed, because Adobe made sure that a particular implementation of PostScript worked correctly for a particular printer. This proprietary conformance testing is one way to ensure consistent implementations of software. However, it depends on the honor of a particular vendor (not that I'm implying that any vendor would lead us astray, of course).

As more and more PostScript printers became available, PostScript became a reasonable medium for document exchange. For example, if I send you a PostScript document, I have a high degree of confidence that the document you print will be correct. However, a PostScript document is not generally considered a revisable form of the document and is difficult to edit. (For a more through discussion of the issues involved with document exchange, see Section 7 . 2 Document Exchange in Chapter 7 Applying Standards )

Like any commercial product, PostScript is evolving to meet new requirements and fix old problems. PostScript Level 2 addresses many past complaints, such as poor memory management and limited color support. PostScript Level 2 also offers several other interesting features. One of its significant improvements over its predecessor is in the area of color manipulation.(6) Full support for the CMYK (4-color printing) color model should make life easier for color printing.

The fundamental change of PostScript Level 2 is the incorporation of the CIE color mode (see Section 6. 3. 1 Pure Color Models in Chapter 6 Media and Document Integration). The CIE color space specifies a mathematical relationship of color to human perception and is, therefore, independent of any output device. PostScript Level 2 provides a mechanism (called CIE based ABC) that enables developers to map the CIE color space to a particular output device.

The extensions needed for Display PostScript have been included in PostScript Level 2, allowing the same PostScript interpreter to be used for either printing or display applications. True WYSIWYG displays are all the more likely if the same software is used both to display a document on the screen and to print it on paper.

In the area of data compression, PostScript Level 2 also offers significant improvements. Level 2 includes a new operation that accepts a compression algorithm. This includes the JPEG (Joint Photographic Experts Group) and LZW (Lempel-Ziv-Welch) compressions algorithms.

A second generation PostScript is called PDF (Portable Document Format) and is the core of the Acrobat product line from Adobe. Its principal difference is optimization for display, as its primary function is the on-line display of documents. One key to this technology is the use of a new font technology called "Multiple Master." The new fonts work with Adobe Type Manager (ATM) to "mimic the style, weight, and spacing of the document's original faces automatically." PDF also stores a document as a series of randomly accessible pages, facilitating hypertext links. The overall product line that uses this technology is called Acrobat. (See section 7. 4. 2 Electronic Page Delivery in the chapter Applying Standards for more information on Acrobat.)

5 . 1 . 3 Lots O Formats

Software vendors have defined many document and graphics formats. They have made many, but not all, of the specifications public. Vendors of open specifications correctly reason that publicizing their formats will encourage the creation of new software products that use their formats and more of their products. A few of the more popular formats are discussed next.

DCA/RFT

DCA/RFT, the Document Content Architecture/Revisable Form Text, commonly referred to simply as DCA, is the format used by IBM's DisplayWrite. It is capable of representing a document with one or two master formats.

Graphics are possible inside a DCA document via a special Inserted Escaped Graphic identifier. This identifier lets a document treat a graphic as a block located in the text.

DCA has an automatic numbering scheme that can be used to specify the numbering style of footnotes. It is possible, using this feature, to allow the user to define custom numbering sequences.

RTF

Microsoft has defined the RTF, Rich Text Format, for use by its principal publishing product MS Word. On the Macintosh, it is the most commonly used document exchange format. Many products include import filters to allow input of text in this format.

MS Word has grown up to be the king of the hill of word processors. As such, RTF is a widely used interchange format. However, many word processors and document publishing systems seem to have trouble reading and writing proper RTF files. MS Word, not surprisingly, is clearly the most reliable program to read and write RTF files. There are also a number of RTF to HTML converters around. For example check out rtftohtml at http://www.sunpack.com/RTF/rtftohtml overview.html.

WORDPERFECT

Oh, how the mightly have fallen. A mere 3 or 4 years ago, WordPerfect was the undisputed leader of word processing packages. Now after a buyout by Novell, who then sold it to Corel, WordPerfect is struggling to keep market share and is playing catchup with the formidible marketing clout of Microsoft and its MS Word.

One particularly effective aspect of WordPerfect's design is the use of multiple views. A user can view the document in three ways. The normal view shows mainly the textual content with minor highlighting, color changes for font variations, and a few other visual cues. The show-codes view lets you see and edit all the hidden control codes used by the system. In this view, users can get into the nittygritty when required. The third view is the print-preview mode, which is most useful for displaying the relationship of inset graphics to the text. Of course, a fourth view is implicit: the printed document itself.

5 . 1 . 4 Dealing with Formats

Can I get document X into system Y? Will my WordPerfect system accept this vital MSWord document?

The answers to these questions depend on a variety of factors. The specific document processor may or may not support the import or export of a number of formats. Even if a format is reportedly supported, the import/export function often does not do a complete translation. The result of a successful translation will almost always produce a document that must undergo extensive editing. Style and paragraph tags are usually lost, even if the overall formatting was translated successfully. Unfortunately, it is necessary to understand more than you might care to know when transferring a document from one format into another.

ASCII is the most interchangeable format for documents. Unfortunately, the one lowest common denominator of document interchange, the textonly option, has some problems. The difficulties are not with ASCII but with the different ways in which computing platforms treat lines of text. There is no standard for the end-of-line (EOL) character. For example, UNIX computers use a line feed as the EOL. PCs use a carriage return (CR) and line feed (LF) in that order. Macintoshes use a CR while VMS systems use a character count rather than a particular character.

These different EOL characters are usually not that much trouble; however, in this age of networked distributed computing, with disks on server machines shared across many computing platforms, things can get ugly. Text on one platform will often not display correctly when the text file came from another platform. The networking software often takes care of these disparities, but not always. This issue becomes more significant when dealing with "write once" media or CD-ROMs, which are intended to provide data to many platforms.

When dealing with formats, it is crucial to be conscious of certain basic categories of information. Graphics (vector and bitmapped), font usage, style usage, global information, and properties are some of the basic functional categories of information that must be converted by a translator. Also, keep in mind that document fidelity is a difficult goal to attain.

[SECTION 5.2] ITABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]


© Prentice-Hall, Inc.

A Simon & Schuster Company Upper Saddle River, New Jersey 07458

5 . 2 Formal Standards

”The only thing that saves us from the bureaucracy is its inefficiency.” - Eugene McCarthy

There are two major formal electronic publishing standards, Standard Generalized Markup Language (SGML) and Office Document Architecture (ODA). There is a world of misunderstanding about both. This section reduces that misunderstanding. With the arrival of the Web, SGML has relegated ODA to another bit player. HTML, the language of the Web, is an application of SGML.

Along with these major standards are a number of associated supporting standards for fonts and character encodings. Consider SGML and HTML to be the leading actor and actress in an electronic publishing play. The character bits are played by ISO 646 (7-bit coded character set).(7)

A Pinf 4+WllBblrrin i i*H ^ fete •kCiil I IWHUHh fthLJ.kCHt' 1HM hr!    fru | i‘m ■ 11 him

■    J-M HiriWi IHI

f-i..iMPia ih PH |V>. ■■ PfllFi'IP.

■    IN.'    -lh    *-■■■.

p ■    M.p

P4."“a-■:11+"--1

■miirif^f^T i mr'>:--Tl ■

■^n hpiTra^rra r pnwHM

Formal standards are open and not under the control of one company or, for international standards, one country. Reaching international consensus takes time. The intent is to give all participating parties a fair say in the technical results. Individual companies and countries all jockey to shape the standards in the mold of their own products and interests. This would give them an advantage when bringing products to market.

The process of making a standard is long and incredibly painstaking. The time and difficulty are the prices for a truly open process. Defacto standards are often technologically more sophisticated and come to market faster than formal standards. Unfortunately, everyone pays a price for the monopoly sometimes gained by the holder of the defacto standard.

One other important difference between defacto and formal standards is in the area of conformance testing. Conformance testing(8) is the way to ensure that a particular vendor's claim for a product that meets the functionality of a standard is indeed valid. The conformance tests themselves are performed by an accredited neutral party. True interoperability can happen only when conforming implementations of a standard exist. Defacto graphic standards such as TIFF and PICT exist in dozens of slightly different variations. My TIFF file won't necessarily be usable by your software that supposedly

reads TIFF files. Formal standards implemented by conforming systems are the best way to guarantee interoperable functionality.

Aside from its inherently open nature, the real clout of a formal standard comes from its use in procurement specifications. By itself, a standard is not legally binding, However, when a procurement specification states that standard XYZ must be used, lots of people start paying attention to standard XYZ. Of course, defacto standards could be used for a procurement specification, but formal standards provide a more competitive open basis for the procurement.

Considering the enormous difficulty of creating a formal standard, you might think that they would be far and few between. Nothing could be further from the truth, however.

Another ISO Standard is the so-called Font Standard, ISO 9541. It was greatly influenced by Adobe. Publishing of the Adobe Type 1 specification has enabled it to become a prime technical contributor to the formal standard.

[SECTION 5.3] [TABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]

© Prentice-Hall, Inc.

A Simon & Schuster Company Upper Saddle River, New Jersey 07458

The most important formal standard in electronic publishing is the Standard Generalized Markup Language (SGML). For years, people invented different mechanisms for marking up text with such things as typesetter control commands, printer codes, structural hints, and probably anything else you can imagine. SGML' is a standard way of embedding tags within the body of a document to create all sorts of markup.

SGML is also one of the more misunderstood standards. It allows the creation of a set of tags with which you can clearly and unambiguously define the structure of a document. It does not, however, address the issues of document layouthow it looks. Let's examine this notion of document structure, because it is fundamental to understanding SGML.

Take this bookplease! It consists of three major elements: the front matter, the main body, and the back matter. The front matter contains the title page, the preface, and the table of contents. The main body contains the chapters. The back matter contains the appendixes, and index. Chapters consist of sections, and many sections consist of subsections. Most books have this familiar structure. This structure and the many variations can be precisely described by SGML.

Structure and style are different aspects of a document. SGML addresses the structure only. This is one of its great strengths and, some say, its great weakness. ' SGML does a wonderful job of capturing the content of a document and enabling ' that content to be manipulated at a later time. However, the information a document carries due to its visual design and layout is essentially lost unless careful, additional steps are taken. Additional style standards and specifications exist and are being implemented to address this issue. (See Section 5. 3. 3 DSSSL for a discussion of the major style standard.) One particular document structure can be visually represented in many different styles.

Let's take a look at how the structure of a typical office letter might appear. The names associated with each item in the letter are ' tags.


SGML is many things to many people. In the preface to The SGML Handbook, Charles F. Goldfarb,(9) the "father" of SGML, states that:

It is a tagging language.

It handles logical structures.

It is a file linking and addressing scheme.

It is a data base language for text.

It is a foundation for multimedia and hypertext.

It is a syntax for text processing style sheets.

It allows coded text to be reused in ways not anticipated by the coder.

It is a document representation language for any architecture.

It is a notation for any kind of structure.

It is a metalanguage for defining document types.

It represents hierarchies.

It is an extensible document description language.

It is a standard for communicating among different hardware platforms and software applications.

SGML is and does all of these things, but the whole is much more than the sum of the parts. Moreover, it is only by understanding the whole that one can make the best use of SGML.

5 . 3 . 1 Speaking of Metalanguages

SGML is an important standard, and one worth understanding. Most important a moderate understanding of the basic concepts will lead you to appropriate uses of SGML.


Now it's time for a little jargon. Metalanguages: what are they and why should you care? First, as the name implies, a metalanguage is a language. Languages are used to describe all sorts of things. A metalanguage describes another language. SGML has the ability to describe languages that describe document structures. A complete SGML file contains not only the markedup text, but also the definition of that markup. This markup definition, or language, is called a document type definition or DTD. Another very powerful aspect of metalanguages in the computing world is the fact that programs can be automatically constructed using metalanguages.

UUtf •WUH d-LV■ •UttiJFrWIBH M    M HjHU

i iiriBt ■-riKAi-.ii-*! -■ ■ hi r

■H IJUMIJ PHUHKRiKI I* PJ hU ■r n n 11 ■ ^. i vtvvnn Ld i ► »j ™ y]>:<+i 11 ^ h fw *+H F"P.

An analogous, realworld example is the universal remotecontrol device that can be taught to control any VCR. You teach it the language of your particular VCR so that it can speak your VCR's language. The act of teaching the universal remote to speak one particular VCR's language is analogous to creating a program from a metalanguage. The result is a specialized language that performs the required functions. It's compact and exactly what you need. Most importantly, metalanguages (and universal remote controls) are flexible. Change your VCR, and you simply reteach the remote. Change the markup language, and regenerate your markup processor.

Back in computer land, let's say that I want to write a new computer language. Let's also say that this new language could be expressed using a metalanguage. It might then be possible to use a metalanguage tool to create interpreters or compilers for my new language. This type of language technology exists and is widely used. The metalanguage is called BNF (Backus-Nauer Form), and the tools to generate new interpreters are called

At this point you may rightly ask, "So what? Who cares?" Well, in the SGML domain, some clever people have written a computer tool called the Amsterdam SGML Parser (ASP-SGML).(10) It is analogous to YACC, but it creates an SGML parser for a specific type of document, the one described by the SGML metalanguage. The motivation for all this work is flexibility. The easier it is to modify a parser, the easier it is for end users to tailor the language to their particular needs. In addition, this approach opens up the creation of inexpensive specialized SGML parsers for specific applications.

parser generators or compiler-compilers. One common tool in this domain is called YACC (Yet Another Compiler Compiler).


5 . 3 . 2 Document Type Definition (DTD)

There are two parts to an SGML representation of a document. The first is called a document type definition (DTD). The second is the content of the document itself, with all the markup tags defined in the DTD. A key and elegant feature of SGML is that it is extensible. A DTD can define new tags and tags based on other tags.

The DTD defines elements, and the document content provides specific instances of those elements. Put another way, the DTD is like a mold, and the text is like the metal poured into that mold. An element is a thing, a placeholder, a name. An instance of an element is the object itself. For example, a DTD may define "chap" as the element for a chapter. In this book, World Wide Web is a particular instance of a "chap."

DTDs themselves are becoming standardized. Standard DTDs are the key to meaningful document interchange. SGML without a DTD is like listening to someone speaking a foreign language. When you listen, you can recognize that a language is being spoken and that words, phrases, and sentences are in the speech, but you just can't understand

anything. The DTD enables SGML parser to understand the stream of text being processed. Of course, this analogy shouldn't be taken to far. The parser doesn't "understand" anything, and the only "meaning" that can be extracted is a valid document structure.

MJidii hi FtalKa I    PHrf-

HI IhlWWUT VII IIHlili piJd1

i*¥lh'lllPir^‘E    b|n0L'll>^ 4

■H Uf ■■ ^.‘■allJ^ITT SlH>^ irwi IH«I    ■■■

.•pp.lU

This discussion should make one thing clear: putting all this markup in a document is a lot of work. Techniques and commercial products are available for converting documents into SGML markedup documents. SGML conversion is also THE area in which a clear understanding of what SGML is and is not is vital.

"SGML-Like"And We Could Get "Sort of Married" Too..., an article(11) by Bill Zoellick of Avalanche Development Company, now Interleaf, aptly points out the dangers of systems and consultants who ignore the fundamentals of SGML markup. He states that SGML markup without a DTD, as some systems attempt, is not really valid markup. The DTD "is the coin of the realm. If you do not have a DTD, you cannot play the SGML game." The DTD imposes a set of rules and a structure that a document must follow. It will also allow another SGML system to interpret (parse) your document according to valid SGML rules. Marking up a document without following the rules specified by a DTD is problematic and should not be called SGML. SGML parser/verifiers can analyze a document and its markup for conformance to a DTD. Of course, fixing the document may be no simple task; that's all the more reason to do it the right way from the start.

5 . 3 . 3 DSSSL

The Document Style Semantics and Specification Language (DSSSL, pronounced dissel)

is the style sheet standard. The formal DSSSL standard is known as ISO/IEC 10179:1995. DSSSL is intended to work with SGML or other document formats "for which a property set can be defined according to the Property Set Definition Requirements of ISO/IEC 10744."(12) DSSSL consists primarily of two mini-languages: a transformation language and a style language.

DSSSL's main function is to associate formatting information with document elements. DSSSL includes a transformation language used to describe the association of formatting information to the logical elements of a hierarchically structured document, such as an SGMLtagged document. It has a general mechanism for mapping the tree structure represented by an SGML DTD into another tree structure representing the visual composition of a document.

Like any other computer language, DSSSL is a language with a specific syntax and a collection of keywords. It will allow a designer to associate visual style descriptions with SGML elements. For example, using DSSSL, you can instruct a formatting processor to print all elements with the tag <SECTION> using Futura oblique bold with a 14-point size. The logical element <SECTION> would be associated, via a DSSSL specification, with the formatting information of the font Futura oblique bold 14 point.

DSSSL's second principal part is the style language. It will allow you to associate visual styles with collections of elements that are not specifically identified as logical elements. For example, the first paragraphs of a chapter may require different typography from all the other paragraphs. You could format such a paragraph with a larger font size and with a large initial cap. The knowledge that such a paragraph is still a paragraph can be maintained, however. At a later time, when you insert a new first paragraph, DSSSL takes care of the formatting.

DSSSL Conceptual Model(13)

DSSSL's complexity and generality make implementations difficult. In recognition of this complexity, a new effort called DSSSL Online is beginning to reach fruition. DSSSL is being used as the basis of a Web style sheet. The complexity of DSSSL will also make complete implementations difficult and scarce. However, the future standards team of SGML and DSSSL should prove immensely powerful.

The DSSSL Transformation Process(14)

One method being used to create usable forms of DSSSL is to implement a subset known as DSSSL Online. DSSSL Online is being proposed as the basis of a common style sheet language for the Web.

5 . 3 . 4 HyTime

"If computer's are the wave of the future, then hypertext is the surfboard." -Ted Nelson

HyTime, ISO/IEC standard 10744, is the international standard to represent linked and time-based information. It is the hypertext standard and is an extension of SGML.

Hypertext is a maturing technology that allows you to jump from one piece of text to another using an on-line reading system.(15) For a good overview of this field check out Hypertext & Hypermedia by Jakob Nielsen (Academic Press 1990). Formally, HyTime is called Hypermedia/Time-based Structuring Language (HyTime).

The HyTime standard is an application of SGML. It uses the capabilities of SGML to formalize the representation of a linking capability, music, and time.

If this sounds like an odd combination, you're right. The initial HyTime work was motivated by an interest in representing music; therefore, time was very important. Musical events led to a connection with hypermedia, and now both a music description and hypermedia standards are being developed by closely related groups. Eventually, the work was split into HyTime and SMDL (Standard Music Description Language). HyTime is an application of SGML, and SMDL is an application of HyTimeyou got that clear now? (16)

Some of the objectives of HyTime are to support music publishing, business presentations, and computer-assisted instruction with music and sound effects. HyTime has three key concepts Events, Links, and Time. An event is some action, possibly external to the application; that is, a user. A link is a connection between one or more "anchors" in the information. Time is represented as durations of various sorts within a very flexible framework.

One key conceptual contribution of HyTime is its model of time. Time can be represented

in two ways, musical and real. Musical time is abstract and analogous to the logical structure of a document. Real time is concrete and analogous to the physical formatted document. These two time representations can be synchronized using HyTime.

LibbI VpLi Uin

'Wwlilihvr^ard ■ Lyl ardbn bi ^rmarl    if r4m ■

■    ^nbirf.bp bdlhudir

■    —•[.■■ii    Mil -■-^7 ml J "■ fci-rrJdi^r

I lj I l1»*

I1 n*fhU Mial^VI...........

VA    IT|H 1UM "At+t'r U

h:-> liWI1 M UiK-’riBUUH^Wp H. .ftl-VI|y-*+HI W • = L ■ ■ ■l#+p*-fH III Wf r-II-H+fL

HyTime's model of time is intended for nonmusical as well as musical applications. It uses the concept of a unit of music time called a virtual time unit (VTU). A VTU is not inherently equivalent to any real number of seconds or any other concrete measure of time. For example, if an application were videobased, the VTU might be defined as the time for one video frame. For a musical application, the unit might be an eighth note.

Keep in mind that these concepts are nothing revolutionary. However, HyTime formalizes the representation of these concepts using SGML. It should be possible to represent and interchange the information represented in HyTime with systems that interpret HyTime information. Hypermedia is desperate for such an interchange standard. The Web has somewhat hijacked this more rigorous hypertext model, and it remains for the future to see if HyTime will have a significant impact on the Web.