Chapter 4. FDs and BCNF (Informal)

It is downright sinful to teach the abstract before the concrete

—Z. A. Melzak: Companion to Concrete Mathematics

As we saw in the previous chapter, Boyce/Codd normal form (BCNF for short) is defined in terms of functional dependencies. In fact, BCNF is really the normal form with respect to functional dependencies (just as—to get ahead of ourselves for a moment—5NF is really the normal form with respect to join dependencies). The overall purpose of the present chapter is to explain this observation; as the chapter title indicates, however, the various explanations and associated definitions are all (intentionally, of course) a little informal at this stage. (Informal, but not inaccurate; I won’t tell any deliberate lies.) A more formal treatment of the material appears in the next chapter.

FIRST NORMAL FORM

To begin at the beginning: Let relation r have attributes A1, ..., An, of types T1, ..., Tn, respectively. By definition, then, if tuple t appears in relation r, then the value of attribute Ai in t is of type Ti (i = 1, ..., n). For example, if r is the current value of the shipments relvar SP (see Figure 1-1 in Chapter 1), then every tuple in r has an SNO value that’s of type CHAR, a PNO value that’s also of type CHAR, and a QTY value that’s of type INTEGER.

Now I can give a precise definition of first normal form:^[34]

Definition: Let relation r have attributes A1, ..., An, of types T1, ..., Tn, respectively. Then r is in first normal form (1NF) if and only if, for all tuples t appearing in r, the value of attribute Ai in t is of type Ti (i = 1, ..., n).

In other words, every relation is in 1NF, by definition! To say it in different words, 1NF just means each tuple in the relation contains exactly one value, of the appropriate type, for each attribute. Observe in particular that 1NF places no limitations on what those attribute types are allowed to be.^[35] They can even be relation types; that is, relations with relation valued attributes—RVAs for short—are legal (you might be surprised to hear this, but it’s true). An example is given in Figure 4-1.

Figure 4-1. A relation with a relation valued attribute

I’ll have more to say about RVAs in just a moment, but first I need to get a couple of small points out of the way. To start with, I need to define what it means for a relation to be normalized:

Definition: Relation r is normalized if and only if it’s in 1NF.

In other words, normalized and first normal form mean exactly the same thing—all normalized relations are in 1NF, all 1NF relations are normalized. The reason for this slightly strange state of affairs is that normalized was the original (historical) term; the term 1NF wasn’t introduced until people started talking about 2NF and higher levels of normalization, when a term was needed to describe relations that weren’t in one of those higher normal forms. Of course, it’s common nowadays for the term normalized to be used to mean some higher normal form (often 3NF specifically); indeed, I’ve used it that way myself in earlier chapters, as you might have noticed. Strictly speaking, however, that usage is sloppy and incorrect, and it’s probably better avoided unless there’s no chance of confusion.

Turning to my second “small point”: Observe now that all of the discussions in this section so far (the definitions in particular) have been framed in terms of relations, not relvars. But since every relation that can ever be assigned to a relvar is in 1NF by definition, no harm is done if we extend the 1NF concept in the obvious way to apply to relvars as well—and it’s desirable to do so, because (as we’ll see) all of the other normal forms are defined to apply to relvars, not relations. In fact, it could be argued that the reason 1NF is defined in terms of relations and not relvars has to do with the fact that it was, regrettably, many years before that distinction (i.e., the distinction between relations and relvars) was explicitly drawn, anyway.

Back to RVAs. I’ve said, in effect, that relvars with RVAs are legal; but now I need to add that from a design point of view, at least, such relvars are usually (not always) contraindicated. Now, this fact doesn’t mean you should avoid RVAs entirely (in particular, there’s no problem with query results that include RVAs)—it just means we don’t usually want RVAs “designed into the database,” as it were. I don’t want to get into a lot of detail on this issue in this book; let me just say that relvars with RVAs tend to look very much like the hierarchic structures found in older, nonrelational systems like IMS,^[36] and all of the old problems that used to arise with hierarchies therefore raise their head once again. Here for reference is a list of some of those problems:

The fundamental point is that hierarchies are asymmetric; thus, while they might make some tasks “easier,” they certainly make others more difficult.
As a specific illustration of the previous point, queries in particular are asymmetric, as well as being more complicated than their symmetric counterparts. For example, consider what’s involved in formulating the queries “Get part numbers for parts supplied by supplier S2” and “Get supplier numbers for suppliers who supply part P2” against the relation of Figure 4-1. The natural language versions of these queries are symmetric with respect to each other, but their formulations in SQL—or Tutorial D, or some other formal language—most certainly aren’t (exercise for the reader).
Similar remarks apply to integrity constraints.
Similar remarks apply to updates, but more so.
There’s no guidance, in general, as to how to choose the “best” hierarchy.
Even “natural” hierarchies like organization charts and bill of materials structures are still best represented, usually, by nonhierarchic designs.

Well, by now you might be wondering, if all relvars are in 1NF by definition, what it might possibly mean not to be in 1NF. Perhaps surprisingly, this question does have a sensible answer. The point is, today’s commercial DBMSs don’t properly support relvars (or relations) at all—instead, they support a construct that for convenience I’ll call a table, though by that term I don’t necessarily mean to limit myself to the kinds of tables found in SQL systems specifically. And tables, as opposed to relvars, might indeed not be in 1NF. To elaborate:

Definition: A table is in first normal form (1NF)—equivalently, such a table is normalized—if and only if it’s a direct and faithful representation of some relvar.

So of course the question is: What does it mean for a table to be a direct and faithful representation of a relvar? There are five basic requirements, all of which are immediate consequences of the fact that the value of a relvar at any given time is (of course) always a relation specifically:

There’s no top to bottom ordering to the rows.
There’s no left to right ordering to the columns.
There are no duplicate rows.
Every row and column intersection contains exactly one value of the applicable type, and nothing else.
All columns are regular (see below).

Requirements 1-3 are self-explanatory,^[37] but the other two merit a little more explanation, perhaps. Here’s an example of a table that violates Requirement 4:

The violation occurs because values in the PNO column aren’t individual part numbers as such but, rather, groups of part numbers (the group for S2 contains two part numbers, that for S3 contains one, and that for S4 contains three).

Note: The violation of Requirement 4 in the foregoing example would perhaps be clearer if column PNO, instead of being defined to be of type CHAR, was defined to be of some user defined type, perhaps also called PNO. Then it might be more obvious that values in that column weren’t of that type per se but were rather of type “PNO group.” Such considerations point the way to a reasonably precise definition of the term repeating group:

Definition: Column C is a repeating group column (also known as a multivalued attribute) if and only if it’s defined to be of type T but the values that appear in that column aren’t values of type T but are, rather, groups (in other words, sets or lists or arrays or ...) of values of type T.

If you’re still confused over the difference between repeating group columns and RVAs, take another look at Figure 4-1. The RVA in that figure—viz., attribute PQ—is not a repeating group column (relations don’t allow repeating groups!). Rather, it’s an attribute whose type happens to be a certain relation type—to be specific, and using Tutorial D syntax, the relation type

     RELATION { PNO CHAR , QTY INTEGER }

—and values of that attribute are, precisely, relations of this type. So the relation in that figure does abide by the definition of 1NF.

Incidentally, Requirement 4 also means nulls are prohibited (nulls aren’t values).

Turning now to Requirement 5 (“All columns are regular”): What this requirement means is, first, that every column has a name, unique among the column names that apply to the table in question; second, that no row is allowed to contain anything extra, over and above the regular column values prescribed under Requirement 4. For example, there are no “hidden” columns that can be accessed only by special operators instead of by regular column references (i.e., by column name), and there are no columns that cause invocations of regular operators on rows to have irregular effects. In particular, therefore, there are no identifiers other than regular key values (no hidden row IDs or “object IDs,” as are unfortunately found in some SQL products today), and no hidden timestamps as are found in certain “temporal database” proposals in the literature.

To sum up: If any of the five requirements are violated, the table in question doesn’t “directly and faithfully” represent a relvar, and all bets are off. In particular, relational operators such as join are no longer guaranteed to work as expected (as you’ll already know if, as I assume, you’re familiar with SQL). The relational model deals with relations (meaning, more precisely, relation values and relation variables), and relations only.

^[34] One reviewer accused me of rewriting history with this definition. Guilty as charged, perhaps—but I do have my reasons; to be specific, earlier “definitions” of the concept were all, in my opinion, either too vague to be useful or flat out wrong. See SQL and Relational Theory for further discussion.

^[35]The relational model does, though. To paraphrase the answer to Exercise 2.2 in Appendix D, there are two small exceptions, both of which I’ll simplify just slightly here: First, no relation in the database can have an attribute of any pointer type; second, if relation r is of type T, then no attribute of r can itself be of type T (think about it!). However, these exceptions have nothing to do with 1NF as such.

^[36]And, perhaps more to the point, newer ones like XML (see Exercise 4.12).

^[37]Though I note in passing that Requirement 2 in particular effectively means SQL tables are never normalized—except, possibly, if they happen to have just one column. However, the disciplines recommended in SQL and Relational Theory allow you, among other things, to treat such tables (for the most part) as if they were normalized after all.