If you can’t beat them, join them
Just as Boyce/Codd normal form is defined in terms of functional dependencies, so fifth normal form (5NF) is defined in terms of join dependencies (JDs);[87] as noted in Chapter 4, in fact, 5NF is the normal form with respect to JDs, just as BCNF is the normal form with respect to FDs. And the treatment of these ideas in this part of the book therefore parallels the treatment of BCNF and FDs in Part II. In other words, I plan to treat the material both formally, in Chapter 10, and informally in the present chapter.
Let me immediately add that although 5NF is indeed “the” normal form with respect to JDs, this state of affairs shouldn’t be taken to mean that 5NF is the ultimate goal of the normalization process. Au contraire, in fact: There are at least two other normal forms that have a better claim to that title, as we’ll see in Chapter 13. From a pedagogical point of view, however—as well as from a historical one—I think it’s desirable to discuss 5NF in detail first. (I mention this point simply in order to avoid giving a false impression; one of my reviewers felt I should have presented the material in a different sequence, but I don’t agree.)
Now, in previous writings I’ve tended to regard JDs as if they were just a generalized kind of FD. I now think this perception is wrong, or at least misleading; I now think it’s better to regard JDs as a completely different phenomenon. Of course, FDs and JDs are both dependencies (i.e., constraints), and they do resemble each other in certain respects; in particular, the fact that a certain JD holds in relvar R implies that R can be nonloss decomposed in certain ways, just as the fact that a certain FD holds in relvar R also implies that R can be nonloss decomposed in certain ways. It’s also true that every FD implies a JD, so that if some FD F holds in relvar R, a certain JD J holds in R also. But not all JDs are implied by FDs; in fact, to speak very loosely—but I must emphasize that what I’m about to say is extremely imprecise—we might say that 5NF has to do with JDs that aren’t implied by FDs. That is, it’s if some relvar R is in BCNF, but is subject to some JD that’s not implied by FDs, that the notion of 5NF might be relevant.
Now, a relvar is in BCNF if and only if all FDs to which it’s subject are implied by keys. As you’d probably expect, therefore, a relvar is in 5NF if and only if all JDs to which it’s subject are implied by keys.[88] However, this latter notion—i.e., the notion of JDs being implied by keys—is a bit trickier to pin down than its FD counterpart; in fact, there’s a very rich theory surrounding these ideas, as we’ll soon see, and some of that theory can be a little overwhelming (not to say confusing) at first. You need to keep a clear head! As someone much more knowledgeable in these matters than I am once said to me: JDs are very mysterious.
So much for the preamble; now let’s get down to specifics.
Most of the time in this book so far, I’ve been making a tacit assumption: namely, I’ve been assuming that when we decompose some relvar, we always do so by replacing that relvar by exactly two of its projections.[89] (Note that Heath’s Theorem, which provides the formal underpinning for most of what I’ve had to say so far regarding nonloss decomposition, specifically addresses decomposition into exactly two projections.) What’s more, that assumption was fully warranted, so long as our target was only BCNF—in other words, it successfully carried us as far as that specific target. So you might be surprised to learn that there exist relvars that can’t be nonloss decomposed into two projections but can be nonloss decomposed into three (or maybe more than three).
As an aside, I note that, remarkably enough, Codd gave an example in 1969, in his very first paper on the relational model (see Appendix C), that showed he was aware of this possibility. However, that example was apparently overlooked by most of the paper’s original readers; certainly it seems to have come as a surprise to the research community when the possibility was rediscovered several years later (in 1977, to be precise).
Now, I said earlier, albeit loosely, that 5NF had to do with JDs that aren’t implied by FDs. I can now add, though again speaking very loosely, that it has to do with relvars that can’t be nonloss decomposed into two projections but can be nonloss decomposed into three or more. In other words, it’s when these circumstances arise—i.e., when there are JDs that aren’t implied by FDs, and relvars that can only be nonloss decomposed into more than two projections—that you do really have to come to grips with JDs and 5NF.
So what exactly do we mean when we say some JD holds in some relvar? Here’s a definition:
Definition: Let X1, ..., Xn be subsets of the heading H of relvar R; then the join dependency (JD)
holds in R if and only if R can be nonloss decomposed into its projections on X1, ..., Xn (i.e., if and only if every legal value r of R is equal to the join of the corresponding projections r1, ..., rn). X1, ..., Xn are the components of the JD, and the JD overall can be read as “star X1, ..., Xn” or “join X1, ..., Xn”—though I hasten to add that “join” really isn’t the mot juste here, because the join operator (at least as usually understood) joins relations, and X1, ..., Xn aren’t relations but headings.
By way of a simple example, consider the suppliers relvar S once again. As we know, that relvar is subject to the FD {CITY} → {STATUS}, and so Heath’s Theorem tells us it can be nonloss decomposed into its projections on {SNO,SNAME,CITY} and {CITY,STATUS}. In other words, the following JD holds in that relvar:
{ { SNO , SNAME , CITY } , { CITY , STATUS } }
Points arising:
Note that it follows from the definition that the union of the components X1, ..., Xn must be equal to H (i.e., every attribute of H must appear in at least one of those components), for otherwise R couldn’t possibly be equal to the join of the projections that correspond to those components.
Different writers use different symbols to denote a JD; I use a special kind of star, but the symbol ⋈ (“bow tie”) is more frequently encountered in the research literature.
It might help to point out that to say some JD holds is equivalent to saying that if we join the indicated projections together, we’ll never get any “spurious” tuples (as Exercise 3.2 called them).
The following observation might also be helpful. I’ll explain it in terms of a simple, though slightly abstract, example. Let relvar R have attributes A, B, C, and D (only), and let the JD {AB,BC,CD} (“Heath notation”—see Chapter 7) hold in R. Also, let me use the symbol “∈” to mean “appears in” (as in the answer to Exercise 5.4 in Appendix D). Then to say the given JD holds in R is equivalent to saying the following:[90]
if
EXISTSc1
( EXISTSd1
( (a
,b
,c1
,d1
) ∈R
) ) AND EXISTSa2
( EXISTSd2
( (a2
,b
,c
,d2
) ∈R
) ) AND EXISTSa3
( EXISTSb3
( (a3
,b3
,c
,d
) ∈R
) )
then
Explanation: Let there be a tuple in R with A = a and B = b and a tuple in R with B = b and C = c and a tuple in R with C = c and D = d. Then the tuples (a,b), (b,c), and (c,d) will appear in the projections of R on AB, BC, and CD, respectively, and so the tuple (a,b,c,d) will appear when we join these three projections together. Moreover, the converse is clearly true as well: If the tuple (a,b,c,d) appears in R, then the tuples (a,b), (b,c), and (c,d) will certainly appear in those three projections (and so that If in the foregoing formal statement could in fact be replaced by If and only if).
As a simple illustration of this last point, to say the following JD holds in relvar S—
{ { SNO , SNAME , CITY } , { CITY , STATUS } }
—is to say the tuple (s,n,t,c) appears in S if and only if there’s a tuple in S with SNO = s and SNAME = n and CITY = c and there’s a tuple in S with CITY = c and STATUS = t.
To continue with this example for a moment, the fact that the specified JD holds in relvar S is a logical consequence of Heath’s Theorem, as we know. In fact, we can now restate Heath’s Theorem as follows:
As stated earlier, therefore, FDs imply JDs—but not all JDs are implied by FDs, as we’ll see. Before I elaborate on this point, however, let me stress the requirement that the union of the components of a given JD must be equal to the pertinent heading. No analogous requirement applies to FDs; with FDs, the left and right sides don’t have to be such that their union is equal to the pertinent heading, they only have to be subsets of that heading. This distinction might help to illustrate the point (at least intuitively) that JDs and FDs really are different in kind, in a sense.
Now, the JD in the foregoing example—
{ { SNO , SNAME , CITY } , { CITY , STATUS } }
—is binary: It has two components, and it corresponds to a nonloss decomposition into two projections. Here by contrast is another JD that holds in relvar S:
{ { SNO , SNAME } , { SNO , CITY } , { CITY , STATUS } }
This one is ternary, but it’s derived, in effect, by “cascading” two binary ones:
First, we already know the binary JD {{SNO,SNAME,CITY},{CITY,STATUS} holds in S.
But the FD {SNO} → {SNAME} holds in the projection of S on {SNO,SNAME,CITY} (corresponding to one of the components of that binary JD),[91] and so the binary JD {{SNO,SNAME},{SNO,CITY}}, holds in that projection.
It follows that the given ternary JD holds in the original relvar. By contrast, in the section immediately following, I’ll give an example of a ternary JD that’s not derived by cascading binary ones, and hence an example of a relvar that can be nonloss decomposed into three projections and not into two.
[87] So is 4NF, but I’m going to ignore 4NF (mostly) until Chapter 12.
[88] This very informal definition is the only one I’ll be giving for 5NF in the present chapter.
[89] Of course, I’m referring here to an individual step in the overall process. Clearly, repeated steps—i.e., repeated individual decompositions—will yield a result consisting of more than two projections, in general, even if each individual decomposition is into just two.
[90] The keyword EXISTS in the following definition denotes the existential quantifier. Refer to SQL and Relational Theory if you need a tutorial on such matters.
[91] I’m appealing here to an easily proved theorem: viz., that the FD X → Y holds in some projection of relvar R if and only if it holds in R itself (see Exercise 12.5 in Chapter 12).