Appraising your data
Searching for schemas
Separating your data into categories
Developing a strategy for data
Testing your data design
I t’s important to make sure that your markup fits your content the way (a) puzzle pieces fit together, (b) peas and carrots go together, or (c) a hand fits in a glove. (Choose your metaphor.)
You can create perfectly written XML, but if your perfect XML doesn’t fit your content, all that work isn’t going to do diddly for you. This chapter is devoted to helping you get a handle on the content that you’re creating so you can use XML to describe it well. Content analysis isn’t nearly as scary as it sounds; a little analysis early on (tell us what you see in these ink blots) can save you from going loco later.
After you assess your content, you can create a taxonomy — no, not the part where you mount deer heads on the wall, but rather a naming scheme: You break your content down into categories and subcategories according to a well-thought-out plan.
The process of becoming best friends with your content is often called content analysis or information analysis. Whatever name it goes by, analysis requires breaking down content into bite-size chunks to see exactly what pieces are going to become key components when you describe the data with a markup language (in this case, XML).
Taking a close look at the flow of information in your business will help you identify the components of your content. For example, what data is collected when a customer places an order? What kind of inventory information do you maintain? Do you use a catalog of your products? Do you use a database? What happens to all this information you are amassing? Each different process is a specialized use of information.
If you’re already familiar with the information that qualifies as content, then you’ve already got a leg up on the process. If you’re unfamiliar with the content, however, take some time to talk to those people who create or frequently process the data. Find out
What users do with individual pieces of information.
What data users think is impossible to live without (and why).
What data is unnecessary or optional (and why).
Gather enough information to sufficiently understand what the key components of the content are, why the content was created, and what’s needed to make the content useful to the people who created it.
To get started analyzing data, you need to gather up several samples of the data content to work with so that you can create as complete a composite (a collection made up of distinct parts) of the key data components as possible.
The more complete your collection of samples is, the better chance you have of creating markup that fits all your content. Here are some ideas:
Get data from multiple sources: If you’re working with data for a business, be sure to gather invoices, receipts, and other data from multiple vendors or customers. One vendor may exclude vital info that another vendor includes.
Get a lot of data: If you need to describe data that will eventually go into an existing database, see whether you can get sample data that’s already in the database so that you can be sure that your markup and the database’s requirements match.
You may have to make modifications to the database to make sure that all the available information is gathered and used to its fullest extent.
Get a lot of data from multiple sources: If you need to describe complex reports, lay your hands on several different reports, written by different people if possible.
You’re getting the drift, aren’t you?
Because your content is ultimately destined for a processing system of some kind, you should talk with the people building that system to see what their data requirements are for it (assuming there’s no predefined DTD or schema already in place). You want your markup to work with their system; a little communication up front about their needs and expectations goes a long way toward avoiding a complete rework of your DTD or schema.
It’s important that you look around for predefined schemas and DTDs before you try to create your own. If you find one that meets your needs, you can save yourself a lot of time by building on existing markup that at least one other person or group is using — and you know that much of your new markup already works. (If you’re trying to work with an established system such as ASP.NET, for example, you won’t have a choice; you have to use that particular DTD to make your instructions work with that system.)
http://msdn.microsoft.com/netframework/programming/fundamentals/default.aspx
Lots and lots of DTDs and schemas are already available for your use. For example, the DTD used by the Open Financial Exchange (OFX) is freely available online. OFX enables online exchange of financial information between banks, businesses, and consumers. OFX accomplishes this goal by using XML to describe bank data and then transfer that data electronically via the Internet. OFX came about through an alliance among CheckFree, Inuit, and Microsoft. Because these three major players — and the banking organizations — can agree on a single format to describe banking data, information exchange is as easy as pie. They chose XML because it’s a standard and is becoming the de facto format for data exchange. To discover more juicy stuff about OFX, check out www.ofx.net.
When you create a document according to a DTD or schema, you use a pre-defined structure that specifies how the components of markup (elements, attributes, and such) should be used to describe a particular kind of content. Predefined DTDs and schemas usually come from a couple of different sources:
Industry groups or organizations that want to establish a common format for standard data — OFX is a perfect example of this source. Another good example is the Chemical Markup Language (CML), created by chemists to describe chemical equations.
Application builders who created their systems to run with content described by a particular set of markup. For example, the ColdFusion Markup Language (CFML), created by Allaire/Macromedia, defines a particular set of markup for describing applications written to run in the ColdFusion system. ASP.NET from Microsoft also uses a similar predefined flavor of XML for creating Active Server Pages (ASP).
In the “early days” — in terms of XML, that means a few years ago — several schema repositories were available online at sites such as www.Biztalk.org and www.schema.net. You could search for a schema or DTD, or add one of your own to the repository. Microsoft’s BizTalk schema repository ended in 2002 and is no longer available — and at least for now, schema.net is no longer active.
That doesn’t mean public schemas and DTDs aren’t obtainable — it’s just harder to find them. There is one still existing schema repository hosted by OASIS (the Organization for the Advancement of Structured Information Standards) at www.xml.org/xml/registry.jsp. In addition, OASIS provides a very comprehensive list of proposed XML applications and industry initiatives at www.oasis-open.org/cover/xml.html#applications — also a great resource for finding schemas.
When you’re trying to decide whether you need to build a new DTD or schema for your content or use an existing one, remember that the most important issue is the way that the markup fits your content. The whole point of using XML is to make your content as accessible to a system as possible. That goal is thwarted when you force your content into an existing markup scheme because the markup doesn’t accurately reflect the content.
When we developed our hypothetical book-selling business, we went through the same data-analysis process we’re sharing with you. After we gathered our documents (invoices, inventory reports, mailing lists) and familiarized ourselves with them, we took a good hard look at what we learned about our content. Here’s what we came up with:
Books can be categorized in a number of different ways, including:
• Author
• Title
• Publication date
• Publisher
• Edition
• Language
• Number of pages
• Size
• Type: Fiction, Nonfiction
• Genre: Historical, Fantasy, Biography, Mystery . . . and so forth
• Special features: illustrations, color plates, ornate end papers, leather binding . . . and so on
• Format: Paperback, Hardback, Audio, Large Print, New, Used
• Price: Retail, Wholesale
• ISBN
The customer information we collect includes:
• First Name
• Last Name
• Address
• City
• State
• Zip Code
• E-mail Address
• Phone Number
The sales information we gather in addition to customer information includes:
• Date
• Item Number
• Price
• Total Cost
We also do (at least in our hypothetical world) both direct retail sales online (from our online catalog) and traditional wholesale to four brick-and-mortar department stores.
When we analyzed our content, we made some judgments about what information we needed to collect. Many possible categories — genre, number of pages, size — were not useful information for our specific book business, so we chose to exclude them from our taxonomy strategy.
In the end, we discovered that the book business can be very complex and have a variety of component types. Some components are consistent across all books (such as author, title, publisher), but others are found only in some (such as illustrations). We created our book business to help you understand XML — not to produce an overly elaborate markup language that covered all the bases. (We left special features out of the fray, for example.) That decision was as much of the content-analysis process as discovering that illustrations are a possible content element. Knowing the purpose of your markup can help you keep your goals in sight — and in check.
XML content can be divided into two main groups: data-intensive and document- or text-intensive.
On the data end of the spectrum, you find collections of data like those that reside in a database. Each collection consists of a more or less arbitrary number of record structures, in which each record contains
A unique identifier or key: This value, unique to each record, is to help locate individual records. For example, an ISBN could serve as a unique identifier for each book in a book collection.
A common collection of named, organized values: Think of an address book, a card catalog in a library, or a set of medical records in your doctor’s office. For example, each card in a card catalog contains the same categories of information: title, author, publisher, publication date, keywords, and description.
On the document or text end of that continuum, the content to be captured and represented fits typical notions of text or hypertext materials — that is, a collection of words, graphics, and other information meant to be read or viewed as a structured object. Examples on this end of the spectrum include books, articles, magazines, narratives, training materials, and so forth.
Then, too, XML can capture and represent data that describes other collections of data — for example, start and stop dates for time-sensitive files, status information, modification data, and so forth. That handy capability makes all kinds of helpful information easy to describe and use — whether stored in a document or data collection.
After you look at your content, you can start breaking it down into categories and subcategories. (If you haven’t already made decisions about what content to include, this process will also help you make those judgments.)
Here’s how we broke it down for our hypothetical book business:
Book
• Item Number
• Title
• Author
• Publisher
• Price
• Content Type
• Format
• ISBN
Sales
• Item Number
• Price
• Shipping
• Total Cost
• Date
• Source
Customer
• Customer Number
• First Name
• Last Name
• Address
• City
• State
• Zip Code
• E-mail Address
• Phone Number
As you can see, some subcategories show up under more than one major category. In particular, Item Number appears as a subcategory in both the Book and the Sales categories. The Item Number is unique to each copy of a book, which makes it easy to keep track of sales and inventory.
You might be surprised by this tidbit, but one of the best ways to start testing your taxonomy is to jump in and write some markup that describes how it should be used — after you have a good understanding of what it takes to create and use the content, of course. What you start with may only slightly resemble your finished markup language, but you do have to start somewhere.
To create your markup, pick an invoice and start creating elements. Every XML document has one root element that contains all the other elements in the document. In our own initial round of markup, we used book as the root element because we thought that each book would have its own document. After giving it some thought, we realized that we might want to include several books in one document (such as an invoice for more than one book). Thus we made books the root element and set the book element to delineate each individual book in a document.
We’re not going to lie to you: A lot of this stuff is plain old-fashioned trial and error. As you work with your markup, experiment with using combinations of elements and attributes until you get the best results. For example, initially, we used two nested elements to specify the content type for a book:
<book>
<contentType>Fiction</contentType>
</book>
This option would work very well if we thought that a book could have more than one type of content to work with. The markup would use as many contentType elements within the book element as there were categories, with at least one required.
In the end, we decided to go with contentType as an attribute of the book element instead, as shown here:
<book contentType=”Fiction”/>
We decided on this route because we thought that we’d want to predefine the category names and require that valid documents choose one of the names from the list in our DTD or schema. This choice narrows the category to one but allows us to enforce category names.
As you become more comfortable with content analysis, you’ll know instinctively that some data components work best as attributes and other data components work better as elements. As you discover the details of the XML syntax for elements and attributes — and how they work together (see Part III) — you develop a firm basis for deciding what should be an element and what should be an attribute.
The best way to test your final (or final draft) markup is to apply it to as many content samples as you can lay your hands on. With each test, you may find something that you need to tweak or change outright. However, after much testing, you’ll end up with a final product that serves you well.
In a perfect world, you would have talked with the system’s developer early in the process to find out what content the system needs to work with, using that knowledge while conducting data analysis. (We’ll pretend that’s exactly what you did.) Show your markup to the system developers and make sure it has the information they were expecting; expect more tweaks and changes. Feed sample documents into the system and see what happens. Tweak and change some more. Listing 3-1 shows the final draft of our bookstore markup.
Listing 3-1: bookstore.xml
<?xml version=”1.0” standalone=”yes”?>
<books>
<book contentType=”Fiction” format=”Hardback”>
<bookInfo>
<title>The Da Vinci Code</title>
<author>Brown, Dan</author>
<publisher>Doubleday</publisher>
<isbn>0385504209</isbn>
</bookInfo>
<salesInfo>
<price priceType=”Retail”>$24.95</price>
<itemNumber>0385504209-1</itemNumber>
<date>January 12, 2005</date>
<source sourceType=”Retail” />
<shipping>$5.00</shipping>
<cost>$29.95</cost>
</salesInfo>
</book>
<totalCost>$29.95</totalCost>
<customer custType=”newRetail”>
<custNumber>5594</custNumber>
<lastName>Blow</lastName>
<firstName>Joe</firstName>
<address>52 Joetta Lane</address>
<city>Cottage Grove</city>
<state>OR</state>
<zip>97424</zip>
<phone>767-3333</phone>
<email>jblow@pacinfo.com</email>
</customer>
</books>
Our document went through lots of changes from our initial look at categories to our final-draft version of the markup. We deleted some subcategories and added some new ones. And you can expect even more changes as you test out your markup and design a DTD or schema for validating it.
If you want to play the eXtensible Markup Language (XML) game, you have to know the rules. But the X in XML means eXtensible ; the element names you can use and define are unlimited. That is, you get to make up as many (or as few) rules as you want or need to make the markup do what you want it to. For example, you can create a document definition for a bookstore to define precisely what kind of data can go into any future XML documents that adhere to your definition.
The rules that you create with XML can dictate which elements make up an XML document, which kinds of content these elements can contain, and how such elements may be ordered. Document descriptions even support rules about which elements are optional, which ones are required, and how many times that certain elements can (or must) appear.
Creating XML document descriptions enables you to state the rules that a whole class of documents must follow.
The two main forms of XML document descriptions in use today are DTDs and XML schemas — and there’s more about both in Part III.
When you’ve got a pretty firm handle on all the ins and outs of content analysis, it’s time to tackle the rules for creating XML markup. Chapter 4 makes that transition into XML syntax via another markup language, XHTML.