Chapter 3

Slicing and Dicing Data Categories: The Art of Taxonomy

In This Chapter

bullet Appraising your data

bullet Searching for schemas

bullet Separating your data into categories

bullet Developing a strategy for data

bullet Testing your data design

I t’s important to make sure that your markup fits your content the way (a) puzzle pieces fit together, (b) peas and carrots go together, or (c) a hand fits in a glove. (Choose your metaphor.)

You can create perfectly written XML, but if your perfect XML doesn’t fit your content, all that work isn’t going to do diddly for you. This chapter is devoted to helping you get a handle on the content that you’re creating so you can use XML to describe it well. Content analysis isn’t nearly as scary as it sounds; a little analysis early on (tell us what you see in these ink blots) can save you from going loco later.

After you assess your content, you can create a taxonomy — no, not the part where you mount deer heads on the wall, but rather a naming scheme: You break your content down into categories and subcategories according to a well-thought-out plan.

Taking Stock of Your Data

The process of becoming best friends with your content is often called content analysis or information analysis. Whatever name it goes by, analysis requires breaking down content into bite-size chunks to see exactly what pieces are going to become key components when you describe the data with a markup language (in this case, XML).

Tip

When we use the term components, we’re referring to types of data that run throughout a document. (Titles and authors are two key components of a book description, for example.) Until you have a good handle on the components of your content, you can’t create markup that fits it — or even use an existing markup language to describe it.

Looking at business practices and partners

Taking a close look at the flow of information in your business will help you identify the components of your content. For example, what data is collected when a customer places an order? What kind of inventory information do you maintain? Do you use a catalog of your products? Do you use a database? What happens to all this information you are amassing? Each different process is a specialized use of information.

If you’re already familiar with the information that qualifies as content, then you’ve already got a leg up on the process. If you’re unfamiliar with the content, however, take some time to talk to those people who create or frequently process the data. Find out

bullet What users do with individual pieces of information.

bullet What data users think is impossible to live without (and why).

bullet What data is unnecessary or optional (and why).

Gather enough information to sufficiently understand what the key components of the content are, why the content was created, and what’s needed to make the content useful to the people who created it.

Gathering some content

To get started analyzing data, you need to gather up several samples of the data content to work with so that you can create as complete a composite (a collection made up of distinct parts) of the key data components as possible.

The more complete your collection of samples is, the better chance you have of creating markup that fits all your content. Here are some ideas:

bullet Get data from multiple sources: If you’re working with data for a business, be sure to gather invoices, receipts, and other data from multiple vendors or customers. One vendor may exclude vital info that another vendor includes.

bullet Get a lot of data: If you need to describe data that will eventually go into an existing database, see whether you can get sample data that’s already in the database so that you can be sure that your markup and the database’s requirements match.

Remember

You may have to make modifications to the database to make sure that all the available information is gathered and used to its fullest extent.

bullet Get a lot of data from multiple sources: If you need to describe complex reports, lay your hands on several different reports, written by different people if possible.

You’re getting the drift, aren’t you?

Tip

To create a complete picture, try to find five or six samples, at least, to work with.

Because your content is ultimately destined for a processing system of some kind, you should talk with the people building that system to see what their data requirements are for it (assuming there’s no predefined DTD or schema already in place). You want your markup to work with their system; a little communication up front about their needs and expectations goes a long way toward avoiding a complete rework of your DTD or schema.

Tip

For more information on DTDs and schemas, see Part III (Chapters 8 – 11) of this book.

Checking whether a DTD or schema already exists

It’s important that you look around for predefined schemas and DTDs before you try to create your own. If you find one that meets your needs, you can save yourself a lot of time by building on existing markup that at least one other person or group is using — and you know that much of your new markup already works. (If you’re trying to work with an established system such as ASP.NET, for example, you won’t have a choice; you have to use that particular DTD to make your instructions work with that system.)

TechnicalStuff

ASP.NET is the next generation of ASP (Active Server Pages) and is part of Microsoft’s .NET framework (a programming model for developing and using XML Web services). For more details on XML and Web services, see Chapter 15. For more information on the .NET framework, see

http://msdn.microsoft.com/netframework/programming/fundamentals/default.aspx

Lots and lots of DTDs and schemas are already available for your use. For example, the DTD used by the Open Financial Exchange (OFX) is freely available online. OFX enables online exchange of financial information between banks, businesses, and consumers. OFX accomplishes this goal by using XML to describe bank data and then transfer that data electronically via the Internet. OFX came about through an alliance among CheckFree, Inuit, and Microsoft. Because these three major players — and the banking organizations — can agree on a single format to describe banking data, information exchange is as easy as pie. They chose XML because it’s a standard and is becoming the de facto format for data exchange. To discover more juicy stuff about OFX, check out www.ofx.net.

When you create a document according to a DTD or schema, you use a pre-defined structure that specifies how the components of markup (elements, attributes, and such) should be used to describe a particular kind of content. Predefined DTDs and schemas usually come from a couple of different sources:

bullet Industry groups or organizations that want to establish a common format for standard data — OFX is a perfect example of this source. Another good example is the Chemical Markup Language (CML), created by chemists to describe chemical equations.

bullet Application builders who created their systems to run with content described by a particular set of markup. For example, the ColdFusion Markup Language (CFML), created by Allaire/Macromedia, defines a particular set of markup for describing applications written to run in the ColdFusion system. ASP.NET from Microsoft also uses a similar predefined flavor of XML for creating Active Server Pages (ASP).

Searching for a schema repository

In the “early days” — in terms of XML, that means a few years ago — several schema repositories were available online at sites such as www.Biztalk.org and www.schema.net. You could search for a schema or DTD, or add one of your own to the repository. Microsoft’s BizTalk schema repository ended in 2002 and is no longer available — and at least for now, schema.net is no longer active.

That doesn’t mean public schemas and DTDs aren’t obtainable — it’s just harder to find them. There is one still existing schema repository hosted by OASIS (the Organization for the Advancement of Structured Information Standards) at www.xml.org/xml/registry.jsp. In addition, OASIS provides a very comprehensive list of proposed XML applications and industry initiatives at www.oasis-open.org/cover/xml.html#applications — also a great resource for finding schemas.

Remember

Industry groups and associations are good sources of information about what schemas or DTDs are used in specific industries.

When you’re trying to decide whether you need to build a new DTD or schema for your content or use an existing one, remember that the most important issue is the way that the markup fits your content. The whole point of using XML is to make your content as accessible to a system as possible. That goal is thwarted when you force your content into an existing markup scheme because the markup doesn’t accurately reflect the content.

Tip

Content analysis with XML in mind is much easier when you have a handle on the ins and outs of XML Schemas and DTDs and how to put them together. Once again, keep what you read here in mind as you check out DTDs and schemas in Part III.

Breaking Down Data in Different Ways

When we developed our hypothetical book-selling business, we went through the same data-analysis process we’re sharing with you. After we gathered our documents (invoices, inventory reports, mailing lists) and familiarized ourselves with them, we took a good hard look at what we learned about our content. Here’s what we came up with:

bullet Books can be categorized in a number of different ways, including:

• Author

• Title

• Publication date

• Publisher

• Edition

• Language

• Number of pages

• Size

• Type: Fiction, Nonfiction

• Genre: Historical, Fantasy, Biography, Mystery . . . and so forth

• Special features: illustrations, color plates, ornate end papers, leather binding . . . and so on

• Format: Paperback, Hardback, Audio, Large Print, New, Used

• Price: Retail, Wholesale

• ISBN

bullet The customer information we collect includes:

• First Name

• Last Name

• Address

• City

• State

• Zip Code

• E-mail Address

• Phone Number

bullet The sales information we gather in addition to customer information includes:

• Date

• Item Number

• Price

• Total Cost

We also do (at least in our hypothetical world) both direct retail sales online (from our online catalog) and traditional wholesale to four brick-and-mortar department stores.

Winnowing out the wheat from the chaff

When we analyzed our content, we made some judgments about what information we needed to collect. Many possible categories — genre, number of pages, size — were not useful information for our specific book business, so we chose to exclude them from our taxonomy strategy.

In the end, we discovered that the book business can be very complex and have a variety of component types. Some components are consistent across all books (such as author, title, publisher), but others are found only in some (such as illustrations). We created our book business to help you understand XML — not to produce an overly elaborate markup language that covered all the bases. (We left special features out of the fray, for example.) That decision was as much of the content-analysis process as discovering that illustrations are a possible content element. Knowing the purpose of your markup can help you keep your goals in sight — and in check.

Types of data that can be stored in XML

XML content can be divided into two main groups: data-intensive and document- or text-intensive.

On the data end of the spectrum, you find collections of data like those that reside in a database. Each collection consists of a more or less arbitrary number of record structures, in which each record contains

bullet A unique identifier or key: This value, unique to each record, is to help locate individual records. For example, an ISBN could serve as a unique identifier for each book in a book collection.

bullet A common collection of named, organized values: Think of an address book, a card catalog in a library, or a set of medical records in your doctor’s office. For example, each card in a card catalog contains the same categories of information: title, author, publisher, publication date, keywords, and description.

On the document or text end of that continuum, the content to be captured and represented fits typical notions of text or hypertext materials — that is, a collection of words, graphics, and other information meant to be read or viewed as a structured object. Examples on this end of the spectrum include books, articles, magazines, narratives, training materials, and so forth.

Then, too, XML can capture and represent data that describes other collections of data — for example, start and stop dates for time-sensitive files, status information, modification data, and so forth. That handy capability makes all kinds of helpful information easy to describe and use — whether stored in a document or data collection.

Tip

As you explore the kinds of data and documents that XML can capture and represent, remember that the term XML document embraces a whole lot more than text. XML can handle many kinds of data. In particular, it can accommodate (or point to) binary information — and that means it can supply data to other computer applications outside XML’s control. Thus, an XML document can reference anything that a computer can represent — including video, graphics, multimedia, and other specialized kinds of data!

Developing Your Taxonomy

After you look at your content, you can start breaking it down into categories and subcategories. (If you haven’t already made decisions about what content to include, this process will also help you make those judgments.)

Here’s how we broke it down for our hypothetical book business:

bullet Book

• Item Number

• Title

• Author

• Publisher

• Price

• Content Type

• Format

• ISBN

bullet Sales

• Item Number

• Price

• Shipping

• Total Cost

• Date

• Source

bullet Customer

• Customer Number

• First Name

• Last Name

• Address

• City

• State

• Zip Code

• E-mail Address

• Phone Number

As you can see, some subcategories show up under more than one major category. In particular, Item Number appears as a subcategory in both the Book and the Sales categories. The Item Number is unique to each copy of a book, which makes it easy to keep track of sales and inventory.

Testing Your Taxonomy

You might be surprised by this tidbit, but one of the best ways to start testing your taxonomy is to jump in and write some markup that describes how it should be used — after you have a good understanding of what it takes to create and use the content, of course. What you start with may only slightly resemble your finished markup language, but you do have to start somewhere.

Remember

During this process of writing markup, you’re really doing a detailed analysis of the content, which means that at the end of the day you’re going to have two (count ’em, two) results: a solid content analysis and a working draft of the markup that you need to describe it.

To create your markup, pick an invoice and start creating elements. Every XML document has one root element that contains all the other elements in the document. In our own initial round of markup, we used book as the root element because we thought that each book would have its own document. After giving it some thought, we realized that we might want to include several books in one document (such as an invoice for more than one book). Thus we made books the root element and set the book element to delineate each individual book in a document.

Using trial and error for the best fit

We’re not going to lie to you: A lot of this stuff is plain old-fashioned trial and error. As you work with your markup, experiment with using combinations of elements and attributes until you get the best results. For example, initially, we used two nested elements to specify the content type for a book:

<book>

  <contentType>Fiction</contentType>

</book>

This option would work very well if we thought that a book could have more than one type of content to work with. The markup would use as many contentType elements within the book element as there were categories, with at least one required.

In the end, we decided to go with contentType as an attribute of the book element instead, as shown here:

<book contentType=”Fiction”/>

We decided on this route because we thought that we’d want to predefine the category names and require that valid documents choose one of the names from the list in our DTD or schema. This choice narrows the category to one but allows us to enforce category names.

As you become more comfortable with content analysis, you’ll know instinctively that some data components work best as attributes and other data components work better as elements. As you discover the details of the XML syntax for elements and attributes — and how they work together (see Part III) — you develop a firm basis for deciding what should be an element and what should be an attribute.

Remember

While creating your initial markup, you may find that you have new questions about the content that you need to answer before going on. That’s okay. (We might even say that’s a good thing, but that’s because we’re perfectionists.) Just keep in mind that analysis is part science and part intuition.

Testing your content analysis

The best way to test your final (or final draft) markup is to apply it to as many content samples as you can lay your hands on. With each test, you may find something that you need to tweak or change outright. However, after much testing, you’ll end up with a final product that serves you well.

In a perfect world, you would have talked with the system’s developer early in the process to find out what content the system needs to work with, using that knowledge while conducting data analysis. (We’ll pretend that’s exactly what you did.) Show your markup to the system developers and make sure it has the information they were expecting; expect more tweaks and changes. Feed sample documents into the system and see what happens. Tweak and change some more. Listing 3-1 shows the final draft of our bookstore markup.

Listing 3-1: bookstore.xml

<?xml version=”1.0” standalone=”yes”?>

<books>

 <book contentType=”Fiction” format=”Hardback”>

  <bookInfo>

   <title>The Da Vinci Code</title>

   <author>Brown, Dan</author>

   <publisher>Doubleday</publisher>

   <isbn>0385504209</isbn>

  </bookInfo>

  <salesInfo>

   <price priceType=”Retail”>$24.95</price>

   <itemNumber>0385504209-1</itemNumber>

   <date>January 12, 2005</date>

   <source sourceType=”Retail” />

   <shipping>$5.00</shipping>

   <cost>$29.95</cost>

  </salesInfo>

 </book>

 <totalCost>$29.95</totalCost>

 <customer custType=”newRetail”>

  <custNumber>5594</custNumber>

  <lastName>Blow</lastName>

  <firstName>Joe</firstName>

  <address>52 Joetta Lane</address>

  <city>Cottage Grove</city>

  <state>OR</state>

  <zip>97424</zip>

  <phone>767-3333</phone>

  <email>jblow@pacinfo.com</email>

 </customer>

</books>

TechnicalStuff

The first line in our code <?xml version=”1.0” standalone=”yes”?> is an XML declaration. You’ll learn all about XML declarations and all the other details of XML syntax in Chapter 5.

Our document went through lots of changes from our initial look at categories to our final-draft version of the markup. We deleted some subcategories and added some new ones. And you can expect even more changes as you test out your markup and design a DTD or schema for validating it.

Looking Ahead to Validation

If you want to play the eXtensible Markup Language (XML) game, you have to know the rules. But the X in XML means eXtensible ; the element names you can use and define are unlimited. That is, you get to make up as many (or as few) rules as you want or need to make the markup do what you want it to. For example, you can create a document definition for a bookstore to define precisely what kind of data can go into any future XML documents that adhere to your definition.

The rules that you create with XML can dictate which elements make up an XML document, which kinds of content these elements can contain, and how such elements may be ordered. Document descriptions even support rules about which elements are optional, which ones are required, and how many times that certain elements can (or must) appear.

Creating XML document descriptions enables you to state the rules that a whole class of documents must follow.

The two main forms of XML document descriptions in use today are DTDs and XML schemas — and there’s more about both in Part III.

Tip

DTDs work well for validating XML with text-intensive content, while XML schemas work well for validating XML with data-intensive content.

Remember

Before you can actually validate your XML document, you need to make sure it’s well formed — in other words, does it follow the rules of XML syntax? You’ll learn these rules in Chapter 4 and 5. After your XML document is well formed, you can then validate it against your XML document description (i.e., your DTD or schema) to make sure that your document follows the rules in your document description. There are pros and cons to validating your documents, and you’ll find out about all the angles to consider in Part III.

When you’ve got a pretty firm handle on all the ins and outs of content analysis, it’s time to tackle the rules for creating XML markup. Chapter 4 makes that transition into XML syntax via another markup language, XHTML.