“We shape our buildings: thereafter, they shape us.”
—SIR WINSTON CHURCHILL
Most of you reading this book are probably working with an existing database structure implemented on your favorite (I hope) RDBMS program. It’s hard for me to assume, at this point, whether or not you—or the person who developed the database—really had the necessary knowledge and skills or the time to design the database properly. Assuming the worst, you probably have a number of tables that could use some fine-tuning. Fortunately, you’re about to learn some techniques that will help you get your database in shape and will ensure that you can easily retrieve the information you need from your tables.
You might wonder why I’m discussing database design topics in this book and why they’re included in a beginning chapter. The reason is simple: If you have a poorly designed database structure, many of the SQL statements you’ll learn to build in the remainder of the book will be, at best, difficult to implement or, at worst, relatively useless. However, if you have a well-designed database structure, the skills you learn in this book will serve you well.
This chapter will not teach you the intricacies of database design, but it will help you get your database in relatively good shape. I highly recommend that you read through this chapter so that you can make certain your table structures are sound.
Note: It is important to understand that I am about to discuss the logical design of the database. I’m not teaching you how to create or implement a database
in a database management system that supports SQL because, as I mentioned in the Introduction,
these subjects are beyond the scope of this book.
If your database structure isn’t sound, you’ll have problems retrieving seemingly simple information from your database, it will be difficult to work with your data, and you’ll cringe every time you need to add or delete columns in your tables. Other aspects of the database, such as data integrity, table relationships, and the ability to retrieve accurate information, are affected when you have poorly designed structures. These issues are just the tip of the iceberg. And it goes on! Make sure you have sound structures to avoid all this grief.
You can avoid many of these problems if you properly design your database from the beginning. Even if you’ve already designed your database, all is not lost. You can still apply the following techniques and gain the benefits of a sound structure. However, you must be aware that the quality of your final structures is in direct proportion to the amount of time you invest in fine-tuning them. The more care and patience you give to applying the techniques, the more you can guarantee your success.
Let’s now turn to the first order of business in shaping up your structures: working with the columns.
Because columns are the most basic structures in a database, you must ensure that they are in tip-top shape before you begin fine-tuning the tables as a whole. Fixing the columns usually will eliminate some existing problems with a given table and help you avoid any potential problems that might have arisen.
As you learned in the previous chapter, a column represents a characteristic of the subject of the table to which it belongs. If you give the column an appropriate name, you should be able to identify the characteristic it’s supposed to represent. A name that is ambiguous, vague, or unclear is a sure sign of trouble and suggests that the purpose of the column has not been carefully thought out. Use the following checklist to test each of your column names:
• Is the name descriptive and meaningful to your entire organization? If users in several departments are going to work with this database, make certain you choose a name that is meaningful to everyone who accesses this column. Semantics is a funny thing, and if you use a word that has a different meaning to different groups of people, you’re just inviting trouble.
• Is the column name clear and unambiguous? PhoneNumber is a column name that can be very misleading. What kind of phone number is this column supposed to represent? A home phone? A work phone? A cellular phone? Learn to be specific. If you need to record each of these types of phone numbers, then create HomePhone, WorkPhone, and CellPhone columns.
Note: An argument could be made that HomePhone, WorkPhone, and CellPhone are actually a
repeating group that should be moved to a separate table that could hold multiple
phone numbers for the related customer or employee. Such a table would also have a
column to indicate the type of phone, and it would be possible for each related person
or company to have an unlimited list of phone numbers. You’ll see more about this
when I discuss table structures in the next section.
In addition to making your column names clear and unambiguous, be sure that you don’t use the same column name in several tables. Let’s say you have three tables called Customers, Vendors, and Employees. No doubt you will have City and State columns in each of these tables, and the columns will have the same names in all three tables. There isn’t a problem with this until you have to refer to one particular column. How do you distinguish between, say, the City column in the Vendors table, the City column in the Customers table, and the City column in the Employees table? The answer is simple: Add a short prefix to each of the column names. For example, use the name VendCity in the Vendors table, CustCity in the Customers table, and EmpCity in the Employees table. Now you can easily make a clear reference to any of these columns. (You can use this technique on any generic column such as FirstName, LastName, and Address.)
Here’s the main thing to remember: Make sure that each column in your database has a unique name and that it appears only once in the entire database structure. The only exception to this rule is when a column is being used to establish a relationship between two tables.
Note: The degree to which you use prefixes within a table is a matter of style. When a
table contains generic column names, some database designers will choose to prefix
the generic names only, while others elect to prefix all of the column names within the table. Regardless of the prefix method you use, it
is very important that you use it consistently throughout the database structure.
• Did you use an acronym or abbreviation as a column name? If you did, change it! Acronyms can be hard to decipher and are easily misunderstood. Imagine seeing a column named CAD_SW. How would you know what the column represents? Use abbreviations sparingly, and handle them with care. Use an abbreviation only if it supplements or positively enhances the column name. It shouldn’t detract from the meaning of the column name.
• Did you use a name that implicitly or explicitly identifies more than one characteristic? These types of names are easy to spot because they typically use the words and or or. Column names that contain a back slash (\), a hyphen (-), or an ampersand (&) are dead giveaways as well. If you have columns with names such as Phone\Fax or Area or Location, review the data that they store and determine whether you need to deconstruct them into smaller, distinct columns.
Note: The SQL Standard defines a regular identifier as a name that must begin with a letter and can contain only letters, numbers, and
the underscore character. Spaces are not allowed. It also defines a delimited identifier as a name—surrounded by double quotes—that must start with a letter and can contain
letters, numbers, the underscore character, spaces, and a very specific set of special
characters. I recommend that you use the regular identifier naming convention exclusively
for your column names because many SQL implementations support only the regular identifier
naming convention.
After using this checklist to revise your column names, you have one task left: Make certain you use the singular form of the column name. A column with a plural name such as Categories implies that it might contain two or more values for any given row, which is not a good idea. A column name is singular because it represents a single characteristic of the subject of the table to which it belongs. A table name, on the other hand, is plural because it represents a collection of similar objects or events. You can distinguish table names from column names quite easily when you use this naming convention.
Note: Although I recommended that you use the SQL Standard naming convention, keep in mind
that the column names might change when you (or the database developer in charge of
implementing the database) begin implementing the database into a specific RDBMS application.
The names will need to conform to the naming convention that developers commonly use
for the RDBMS.
Now that you’ve straightened out the column names, let’s focus on the structure of the column itself. You might be fairly sure that your columns are sound, but you can still do a few things to make certain they’re built as efficiently as possible. Test your columns against the following checklist to determine whether or not your columns need a little more work:
• Make sure the column represents a specific characteristic of the subject of the table. The idea here is to determine whether the column truly belongs in the table. If it isn’t germane to the table, remove it, or perhaps move it to another table. The only exceptions to this rule occur when the column is being used to establish a relationship between this table and other tables in the database or when it has been added to the table in support of some task required by a database application. For example, in the Classes table in Figure 2-1, the StaffLastName and StaffFirstName columns are unnecessary because of the presence of the StaffID column. StaffID is being used to establish a relationship between the Classes table and the Staff table, and you can view data from both tables simultaneously by using a view or an SQL SELECT query. If you have unnecessary columns in your tables, you can either remove them completely or use them as the basis of a new table if they don’t appear anywhere else in the database structure. (I’ll show you how to do this later in this chapter.)
• Make certain that the column contains only a single value. A column that can potentially store several instances of the same type of value is known as a multivalued column. (A column that contains multiple phone numbers is an example of a multivalued column.) Likewise, a column that can potentially store two or more distinct values is known as a multipart column. (A column that contains both an item number and an item description is an example of a multipart column.) Multivalued and multipart columns can wreak havoc in your database, especially when you try to edit, delete, or sort the data. When you ensure that each column stores only a single value, you go a long way toward guaranteeing data integrity and accurate information. But for the time being, just try to identify any multivalued or multipart columns and make a note of them. You’ll learn how to resolve them in the next section.
• Make sure the column does not store the result of a calculation or concatenation. Calculated columns are not allowed in a properly designed table. The issue here is the value of the calculated column itself. A column, unlike a cell in a spreadsheet, does not store an actual calculation. When the value of any part of the calculation changes, the result value stored in the column is not updated. The only ways to update the value are to do so manually or to write some procedural code that will do it automatically. Either way, it is incumbent on the user or you, the developer, to make certain the value is updated. The preferred way to work with a calculation, however, is to incorporate it into a SELECT statement. You’ll learn the advantages of dealing with calculations in this manner when you get to Chapter 5, “Getting More Than Simple Columns.”
• Make certain the column appears only once in the entire database. If you’ve made the common mistake of inserting the same column (for example, CompanyName) into several tables within the database, you’re going to have a problem with inconsistent data. This occurs when you change the value of the column in one table and then you forget to make the same modification wherever else the same column appears. Avoid this problem entirely by ensuring that a column appears only once in the entire database structure. (The only exception to this rule is when you’re using a column to establish a relationship between two tables.)
Note: The most recent versions of some commercially available database management systems
allow you to define a column that is the result of a calculated expression. You can
define calculated columns if your database system has this feature, but be aware that
the database system requires additional resources to keep the calculated value current
any time the value of one of the columns in the expression changes or you fetch a
row containing a calculated column.
As I mentioned earlier, multipart and multivalued columns will wreak havoc with data integrity, so you need to resolve them to avoid any potential problems. Deciding which to resolve first is purely arbitrary, so I’ll begin with multipart columns.
You’ll know if you have a multipart column by answering some very simple questions: “Can I take the current value of this column and break it up into smaller, more distinct parts?” “Will I have problems extracting a specific piece of information because it is buried in a column containing other information?” If your answer to either question is “Yes,” you have a multipart column. Figure 2-2 shows a poorly designed table with several multipart columns.
The Customers table shown in the figure contains two multipart columns: CustomerName, and StreetAddress. There’s also one column that is potentially multipart: PhoneNumber. How can you sort the data by last name or ZIP Code or search on state? You can’t because these values are embedded in columns that contain other information. You can see that each column can be broken into smaller columns. For example, CustomerName can be broken into two distinct columns—CustFirstName and CustLastName. (Note that I’m using the naming convention discussed earlier in this chapter when I add the prefix Cust to the FirstName and LastName columns.) When you identify a multipart column in a table, determine how many parts there are to the value it stores, and then break the column into as many smaller columns as appropriate. Figure 2-3 shows how to resolve two of the multipart columns in the Customers table.
Note: Along with breaking down CustomerName and StreetAddress, it might also be a good
idea in a database storing phone numbers in North America to break PhoneNumber into
two distinct columns—area code and the local phone number. In other countries, separating
out the city code portion of the phone number might be useful. In truth, most business
databases store a phone number as one column, but separating out the area or city
code might be important for databases that analyze demographic data. Unfortunately,
I couldn’t demonstrate this in Figure 2-3 due to space limitations.
Sometimes you might have difficulty recognizing a multipart column. Take a look at the Instruments table shown in Figure 2-4. At first glance, there do not seem to be any multipart columns. On closer inspection, however, you will see that InstrumentID is actually a multipart column. The value stored in this column represents two distinct pieces of information: the category to which the instrument belongs—such as AMP (amplifier), GUIT (guitar), and MFX (multi-effects unit)—and its identification number. You should separate these two values and store them in their own columns to ensure data integrity. Imagine the difficulty of updating this column if the MFX category changed to MFU. You would have to write code to parse the value in this column, test for the existence of MFX, and then replace it with MFU if it does exist within the parsed value. It’s not so much that you couldn’t do this, but you’d definitely be working harder than necessary, and you shouldn’t have to go through this at all if your database is properly designed. When you have columns such as the one in this example, break them into smaller columns so that you will have sound, efficient column structures.
Resolving multipart columns is not very hard at all, but resolving multivalued columns can be a little more difficult and will take some work. Fortunately, identifying a multivalued column is easy. Almost without exception, the data stored in this type of column contains some commas, semicolons, or other common separator characters. The separator characters are used to separate the various values within the column itself. Figure 2-5 shows an example of a multivalued column.
In this example, each pilot is certified to fly any number of planes, and those certifications are stored in a single column called Certifications. The manner in which the data is stored in this column is very troublesome because you are bound to encounter the same type of data integrity problems associated with multipart columns. When you look at the data more closely, you’ll see that it will be difficult for you to perform searches and sorts on this column in an SQL query. Before you can resolve this column in the appropriate manner, you must first understand the true relationship between a multivalued column and the table to which it is originally assigned.
The values in a multivalued column have a many-to-many relationship with every row in its parent table: One specific value in a multivalued column can be associated with any number of rows in the parent table, and a single row in the parent table can be associated with any number of values in the multivalued column. In Figure 2-5, for example, a specific aircraft in the Certifications column can be associated with any number of pilots, and a single pilot can be associated with any number of aircraft in the Certifications column. You resolve this many-to-many relationship as you would any other many-to-many relationship within the database—with a linking table.
You can create the linking table by using the multivalued column and a copy of the primary key column from the original table as the basis for the new table. Give the new linking table an appropriate name, and designate both columns as a composite primary key. (In this case, the combination of the values of both columns will uniquely identify each row within the new table.) Now you can associate the values of both columns in the linking table on a one-to-one basis. Figure 2-6 shows an example of this process using the Pilots table shown in Figure 2-5.
Contrast the entries for Sam Alborous (PilotID 25100) in both the old Pilots table and the new Pilot_Certifications table. The major advantage of the new linking table is that you can now associate any number of certifications with a single pilot. Asking certain types of questions is now much easier as well. For example, you can determine which pilots are certified to fly a Boeing 747 aircraft or retrieve a list of certifications for a specific pilot. You’ll also find that you can sort the data in any order you wish, without any adverse effects.
Note: Some database management systems—most notably Microsoft Office Access 2007 and later—allow
you to explicitly define multivalued columns. The database system does this, however,
by creating a hidden system table similar to the linking table shown in Figure 2-6. Frankly, I like to see and control my table designs, so I recommend that you create
the correct data structures yourself rather than depend on a feature in your database
system.
Your columns will be in good shape when you follow the procedures presented in this section. Now that you’ve refined the columns, let’s turn to our second order of business and take a look at the table structures.
Tables serve as the basis for any SQL query you create. You’ll soon find that poorly designed tables pose data integrity problems and are difficult to work with when you create multi-table SQL queries. As a result, you must make certain that your tables are structured as efficiently as possible so that you can easily retrieve the information you need.
In the section on columns, you learned how important it is for a column to have an appropriate name and why you should give serious thought to naming your columns. You’ll soon learn that the same applies to tables as well. By definition, a table should represent a single subject. If it represents more than one subject, it should be divided into smaller tables. The name of the table must clearly identify the subject the table represents. You can be confident that the subject of the table has not been carefully thought out if a table name is ambiguous, vague, or unclear. Make sure your table names are sound by checking them against the following checklist:
• Is the name unique and descriptive enough to be meaningful to your entire organization? Giving your table a unique name ensures that each table in the database represents a different subject and that everyone in the organization will understand what the table represents. Defining a unique and descriptive name does take some work on your part, but it’s well worth the effort in the long run.
• Does the name accurately, clearly, and unambiguously identify the subject of the table? When the table name is vague or ambiguous, you can bet that the table represents more than one subject. For example, Dates is a vague table name. It’s hard to determine exactly what this table represents unless you have a description of the table at hand. Let’s say this table appears in a database used by an entertainment agency. If you inspect this table closely, you’ll probably find that it contains dates for client meetings and booking dates for the agency’s stable of entertainers. This table clearly represents two subjects. You can resolve this issue by dividing the table into two new tables and give each table an appropriate name, such as Client_Meetings and Entertainer_Schedules.
• Does the name contain words that convey physical characteristics? Avoid using words such as File, Record, and Table in the table name because they introduce a level of confusion that you don’t need. A table name that includes this type of word is very likely to represent more than one subject. Consider the name Employee_Record. On the surface, there doesn’t appear to be any problem with this name. When you think about what an employee record is supposed to represent, however, you’ll realize that there are potential problems. The name contains a word that we’re trying hard to avoid, and it potentially represents three subjects: employees, departments, and payroll. With this in mind, split the original table (Employee_Record) into three new tables, one for each of the three subjects.
• Did you use an acronym or abbreviation as a table name? If the answer to this question is “Yes,” change the name right now! Abbreviations rarely convey the subject of the table, and acronyms are usually hard to decipher. Suppose your company database has a table named SC. How do you know what the table represents without knowing the meaning of the letters themselves? The fact is that you can’t easily identify the subject of the table. What’s more, you might find that the table means different things to different departments in the company. (Now, this is scary.) The folks in Personnel think it stands for Steering_Committees; the Information Systems staff believes it to be System_Configurations, and the people in Security insist that it represents Security_Codes. This example clearly illustrates why you should avoid using abbreviations and acronyms in a table name.
• Did you use a name that implicitly or explicitly identifies more than one subject? This is one of the most common mistakes you can make with a table name, and it is relatively easy to identify. This type of name typically contains the words and or or and characters such as the back slash (\), hyphen (-), or ampersand (&). Facility\Building and Department or Branch are typical examples. When you name a table in this manner, you must clearly identify whether it truly represents more than one subject. If it does, deconstruct it into smaller tables, and then give the new tables appropriate names.
Note: Remember that the SQL Standard defines a regular identifier as a name that must begin with a letter and can contain only letters, numbers, and
the underscore character. Spaces are not allowed. It also defines a delimited identifier as a name—surrounded with double quotes—that must start with a letter and can contain
letters, numbers, the underscore character spaces, and a very specific set of special
characters. I recommend that you use the regular identifier naming convention exclusively
for your table names because many SQL implementations support only the regular identifier
naming convention.
After you’ve finished revising your table names, you have one more task to perform: Check each table name again and make certain you used the plural form of the name. You use the plural form because a table stores a collection of instances of the subject of the table. For example, an Employees table stores the data for many employees, not just one employee. Using the plural form also helps you to distinguish a table name from a column name.
Note: The guideline for using a plural form for a table name is a particularly good one
while you’re working on the logical design of the database. It makes it very easy
to differentiate table names from column names, especially when you’re displaying
them on a projection screen or when you’ve written them all across a whiteboard in
a conference room.
Keep in mind, however, that the table names might change when you (or the database developer in charge of implementing the database) begin implementing the database into a specific RDBMS application. The names will then need to conform to the naming convention that developers commonly use for the RDBMS.
Let’s focus on the table structures now that you’ve revised the table names. It’s imperative that the tables are properly designed so that you can efficiently store data and retrieve accurate information. The time you spend ensuring your tables are well built will pay dividends when you need to create complex multi-table SQL queries. Use the following checklist to determine whether your table structures are sound:
• Make sure the table represents a single subject. Yes, I know, I’ve said this a number of times already, but I can’t overemphasize this point. As long as you guarantee that each of your tables represents a single subject, you greatly reduce the risk of potential data integrity problems. Also, remember that the subject represented by the table can be an object or event. By “object” I mean something that is tangible, such as employees, vendors, machines, buildings, or departments, whereas an “event” is something that happens at a given point in time that has characteristics you want to record. The best example of an event that everyone can relate to is a doctor’s appointment. Although you can’t explicitly touch a doctor’s appointment, it does have characteristics that you need to record, such as the appointment date, the appointment time, the patient’s blood pressure, and the patient’s temperature.
• Make certain each table has a primary key. You must assign a primary key to each table for two reasons. First, the primary key uniquely identifies each row within a table, and second, it is used in establishing table relationships. If you do not assign a primary key to each table, you will eventually have data integrity problems and problems with some types of multi-table SQL queries. You’ll learn some tips on how to define a proper primary key later in this chapter.
• Make sure the table does not contain any multipart or multivalued columns. Theoretically, you should have resolved these issues when you refined the column structures. Nonetheless, it’s still a good idea to review the columns one last time to ensure that you’ve completely removed every multipart or multivalued column.
• Make sure there are no calculated columns in the table. Although you might believe that your current table structures are free of calculated columns, you might have overlooked one or two during the column refinement process. This is a good time to take another look at the table structures and remove any calculated columns you might have missed.
• Make certain the table is free of any unnecessary duplicate columns. One of the hallmarks of a poorly designed table is the inclusion of duplicate columns from other tables. You might feel compelled to add duplicate columns to a table for one of two reasons: 1) to provide reference information or 2) to indicate multiple occurrences of a particular type of value. Remember that earlier I talked about HomePhone, WorkPhone, and CellPhone potentially being repeating or duplicate columns. These duplicate columns raise various difficulties when you work with the data and attempt to retrieve information from the table. Let’s now take a look at how to deal with duplicate columns.
Possibly the hardest part of ensuring well built structures is dealing with duplicate columns. Here are a couple of examples that demonstrate the proper way to resolve tables that contain duplicate columns.
Figure 2-7 illustrates an example of a table containing duplicate columns that supply reference information.
In this case, StaffLastName and StaffFirstName appear in the Classes table so that a person viewing the table can see the name of the instructor for a given class. These columns are unnecessary because of the one-to-many relationship that exists between the Classes and Staff tables. (A single staff member can teach any number of classes, but a single class is taught by one staff member.) StaffID establishes the relationship between these tables, and the relationship itself lets you view data from both tables simultaneously in an SQL query. With this in mind, you can confidently remove the StaffLastName and StaffFirstName columns from the Classes table without any adverse effects. Figure 2-8 shows the revised Classes table structure.
Keeping these unnecessary columns in the table automatically introduces a major problem with inconsistent data. You must ensure that the values of the StaffLastName and StaffFirstName columns in the Classes table always match their counterparts in the Staff table. For example, say a female staff member marries and decides to use her married name as her legal name from that day forward. Not only do you have to be certain to make the appropriate change to her row in the Staff table, but you must ensure that every occurrence of her name in the Classes table changes as well. Again, it’s possible to do this (at least, technically), but you’re working much harder than is necessary. Besides, one of the major premises behind using a relational database is that you should enter a piece of data only once in the entire database. (The only exception to this rule is when you’re using a column to establish a relationship between two tables.) As always, the best course of action is to remove all duplicate columns from the tables in your database.
Figure 2-9 shows another clear example of a table containing duplicate columns. This example illustrates how duplicate columns are mistakenly used to indicate multiple occurrences of a particular type of value. In this case, the three Committee columns are ostensibly used to show the names of the committees in which the employee participates.
Figure 2-9 A table with duplicate columns used to indicate multiple occurrences of a particular type of value
It’s relatively easy to see why these duplicate columns will create problems. One problem concerns the actual number of Committee columns in the table. What if a few employees end up belonging to four committees? For that matter, how can you tell exactly how many Committee columns you’re going to need? If it turns out that several employees participate in more than three committees, you’ll need to add more Committee columns to the table.
A second problem pertains to retrieving information from the table. How do you retrieve those employees who are currently in the ISO 9000 committee? It’s not impossible, but you’ll have difficulty retrieving this information. You must execute three separate queries (or build a search condition that tests three separate columns) in order to answer the question accurately because you cannot be certain in which of the three Committee columns the value ISO 9000 is stored. Now you’re expending more time and effort than is truly necessary.
A third problem concerns sorting the data. You cannot sort the data by committee in any practical fashion, and there’s no way that you’ll get the committee names to line up correctly in alphabetical order. Although these might seem like minor problems, they can be quite frustrating when you’re trying to get an overall view of the data in some orderly manner.
If you study the Employees table in Figure 2-9 closely, you’ll soon realize that there is a many-to-many relationship between the employees and committees to which they belong. A single employee can belong to any number of committees, and a single committee can be composed of any number of employees. You can, therefore, resolve these duplicate columns in the same manner that you would resolve any other many-to-many relationship—by creating a linking table. In the case of the Employees table, create the linking table by using a copy of the primary key (EmployeeID) and a single Committee column. Give the new table an appropriate name, such as Committee_Members, designate both the EmployeeID and Committee columns as a composite primary key, remove the Committee columns from the Employees table, and you’re done. (You’ll learn more about primary keys later in this chapter.) Figure 2-10 shows the revised Employees table and the new Committee_Members table.
You’ve resolved the duplicate columns that were in the original Employees table, but you’re not quite finished yet. Keeping in mind that there is a many-to-many relationship between the employees and the committees to which they belong, you might very well ask, “Where is the Committees table?” There isn’t one—yet! Chances are that a committee has some other characteristics that you need to record, such as the name of the room where the committee meets and the day of the month that the meeting is held. It would be a good idea for you to create a real Committees table that includes columns such as CommitteeID, CommitteeName, MeetingRoom, and MeetingDay. When you finish creating the new table, replace the Committee column in the Committee_Members table with the CommitteeID column from the new Committees table. The final structures appear in Figure 2-11.
You gain a real advantage by structuring the tables in this manner because you can now associate a single member with any number of committees or a single committee with any number of employees. You can then use an SQL query to view information from all three tables simultaneously.
Let’s revisit the problem I mentioned earlier about multiple, though uniquely named, columns potentially being a set of duplicate columns. Consider the table shown in Figure 2-12.
What potential problems do you see? First, there’s wasted space in the table when an employee doesn’t have a particular type of phone. But there’s an even bigger problem. Can you guess what it is? What do you do if an employee has two home phones or a fax line? What about key employees who not only have a personal cell phone but also are given a cell phone by the company? The solution is to create a separate Phone_Numbers table and relate it back to the Employees table as shown in Figure 2-13.
With this new design, I can store an unlimited set of phone numbers for each employee. If I need to store a new phone type, all I need to do is define a data value for the PhoneType column. And notice that there’s no wasted storage space for a home phone number for employee 7008—there simply is no Home phone row for that employee. Notice also that each row in the Phone_Numbers table has a PhoneID with a unique value for each row. You’ll learn more about the importance of uniquely identifying each row in the next section.
You’re now close to completing the process of fine-tuning your table structures. The last order of business is to make certain that each row within a table can be uniquely identified and that the table itself can be identified throughout the entire database.
You learned in Chapter 1, “What Is Relational?” that the primary key is one of the most important keys in a table because it uniquely identifies each row within a table and officially identifies that table throughout the database. It also establishes a relationship between a pair of tables. You cannot underestimate the importance of the primary key—every table in your database must have one!
By definition, a primary key is a column or group of columns that uniquely identifies each row within a table. A primary key is known as a simple primary key (or just primary key for short) when it is composed of a single column. A primary key is known as a composite primary key when it is composed of two or more columns. Define a simple primary key when you can because it’s more efficient and is much easier to use when establishing a table relationship. Use a composite primary key only when it’s appropriate, such as when you’re defining and creating a linking table.
You can use an existing column or a combination of columns as the primary key as long as they satisfy all the criteria in the following checklist. When the column or columns that you propose to use as the primary key do not conform to all the criteria, use a different column or define a new column to act as the primary key for the table. Take some time now and use this checklist to determine whether each primary key in your database is sound:
• Do the columns uniquely identify each row in the table? Each row in a table represents an instance of the subject of the table. A good primary key ensures that you have a means of accurately identifying or referencing each row in this table from other tables in the database. It also helps you to avoid having duplicate rows within the table.
• Does this column or combination of columns contain unique values? As long as the values of the primary key are unique, you have a means of ensuring that there are no duplicate rows in the table.
• Will these columns ever contain unknown values? This is a very important question because a primary key cannot contain unknown values. You should disqualify this column immediately if you think it has even the slightest possibility of containing unknown values.
• Can the value of these columns ever be optional? You cannot use this column as the primary key if the answer to this question is “Yes.” If the value of the column can be optional, it implies that it might be unknown at some point. As you learned in the previous item, a primary key cannot contain unknown values.
• Is this a multipart column? It’s a good idea to ask yourself this question, although you should have eliminated all your multipart columns by now. If you missed a multipart column earlier, resolve it now and try to use another column as the primary key, or use the new separate columns together as a composite primary key.
• Can the value of these columns ever be modified? The values of primary key columns should remain static. You should never change the value of a column in a primary key unless you have a truly compelling reason to do so. When the value of the column is subject to arbitrary changes, it is difficult for the column to remain in conformance with the other points in this checklist.
As I stated earlier, a column or combination of columns must pass all the points on this checklist with flying colors before it can be used as a primary key. In Figure 2-14, PilotID serves as the primary key of the Pilots table. But the question is this: Does PilotID conform to all the points on the previous checklist? The primary key is sound if it does, but if it doesn’t, you must either modify it to conform to all the points on the checklist or select a different column as the primary key.
As a matter of fact, PilotID is a sound primary key because it does conform to all the points on the checklist. But what happens when you don’t have a column that can act as a primary key? Take the Employees table in Figure 2-15, for example. Is there a column in this table that can act as a primary key?
It’s very clear that this table doesn’t contain a column (or group of columns) that can be used as a primary key. With the exception of EmpPhone, every column contains duplicate values. EmpZip, EmpAreaCode, and EmpPhone all contain unknown values. Although you might be tempted to use the combination of EmpLastName and EmpFirstName, there’s no guarantee that you won’t employ a new person who is also named Jim Wilson or David Smith. It’s evident that there is no column you can use as the primary key for this table because the value of every column in the table is subject to arbitrary change.
What do you do now? You might be tempted to use some sort of national identity number associated with each employee—for example, a Social Security number in the U.S. or the Social Insurance number in Canada. Be aware that although it is rare, it is possible for two or more people to have the same number. When in doubt, the solution is to create an artificial primary key. This is an arbitrary column you define and add to the table for the sole purpose of using it as the table’s primary key. The advantage of adding this arbitrary column is that you can ensure that it conforms to all the points on the checklist. After you’ve added the column to the table, designate it as the primary key, and you’re done! That’s all there is to it. Figure 2-16 shows the Employees table with an artificial primary key called EmployeeID.
Figure 2-16 The Employees table with the new artificial primary key
Note: Although artificial primary keys are an easy way to solve the problem, they don’t
really guarantee that you won’t get duplicate data in your table. For example, if
someone adds a new row for a person named John Kennedy and provides a new unique artificial
EmployeeID value, how do you know that this second John Kennedy isn’t the same as
the employee 98002 already in the table?
The answer is to add a verification routine to your application code that checks for a potentially duplicate name and warns the user. In many database systems, you can write such validation code as something called a trigger that your database system automatically runs each time a row is changed, added, or deleted. However, discussing triggers is far beyond the scope of this book. Consult your database system documentation for details.
At this point, you’ve done everything you can to strengthen and fine-tune your table structures. Now I’ll take a look at how you can ensure that all your table relationships are sound.
In Chapter 1, you learned that a relationship exists between a pair of tables if rows in the first table are in some way associated with rows in the second table. You also learned that the relationship itself can be designated as one of three types: one-to-one, one-to-many, and many-to-many. And you learned that each type of relationship is established in a specific manner. Let’s review this for a moment.
Note: The diagram symbols shown in this section are part of the diagramming method presented
in Mike Hernandez’s book Database Design for Mere Mortals, Third Edition (Addison-Wesley, 2013). PK indicates a primary key column. FK indicates
a foreign key column. CPK indicates a column that is part of a composite primary key.
• You establish a one-to-one relationship by taking the primary key from the primary table and inserting it into the subordinate table, where it becomes a foreign key. This is a special type of relationship because in many cases the foreign key will also act as the primary key of the subordinate table. Figure 2-17 shows how to diagram this relationship.
• You establish a one-to-many relationship by taking the primary key of the table on the “one” side and inserting it into the table on the “many” side, where it becomes a foreign key. Figure 2-18 shows how to diagram this type of relationship.
• You establish a many-to-many relationship by creating a linking table. Define the linking table by taking a copy of the primary key of each table in the relationship and using them to form the structure of the new table. These columns commonly serve two distinct roles: Together, they form the composite primary key of the linking table; separately, they each serve as a foreign key. You would diagram this relationship as shown in Figure 2-19.
A many-to-many relationship is always resolved by using a linking table. In this example, Pilot_Certifications is the linking table. A single pilot can have any number of certifications, and a single certification can be associated with any number of pilots.
In order to make certain that the relationships among the tables in your database are really solid, you must establish relationship characteristics for each relationship. The characteristics you’re about to define indicate what will occur when you delete a row, the type of participation a table has within the relationship, and to what degree each table participates within the relationship.
It’s important to note that the related columns that you use to link two tables must be the same data type. For example, you can link a primary key of Int (Integer) data type only to a foreign key that is also Int. You cannot link a number to character or date. The one exception to this rule involves primary keys that are automatically generated by your database system, known as AutoNumber, Identity, Serial, or Auto_Increment, depending on the database system. For each of these, there is an underlying numeric data type—Long Integer in Microsoft Access and Int in most others—so it’s perfectly okay to link a primary key created this way to a column that is simply the underlying data type. So you can link an AutoNumber in Microsoft Access to a Number / Long Integer foreign key in a related table, or an Identity in Microsoft SQL Server to a column that is Int.
Before my discussion on relationship characteristics begins, I must make one point perfectly clear: I present the following characteristics within a generic and logical frame of reference. These characteristics are important because they allow you to enforce relationship integrity (referred to by some database systems as referential integrity). The manner in which you implement them, however, will vary from one database software program to another. You will have to study your database software’s documentation to determine whether these characteristics are supported and, if so, how you can implement them.
A deletion rule dictates what happens when a user makes a request to delete a row in the primary table of a one-to-one relationship or in the table on the “one” side of a one-to-many relationship. You can guard against orphaned rows by establishing this rule. (Orphaned rows are those rows in the subordinate table of a one-to-one relationship that don’t have related rows in the primary table or rows in the table on the “many” side of a one-to-many relationship that don’t have related rows in the table on the “one” side.)
You can set two types of deletion rules for a relationship: restrict and cascade.
• The restrict deletion rule does not allow you to delete the requested row when there are related rows in the subordinate table of a one-to-one relationship or in the table on the “many” side of a one-to-many relationship. You must delete any related rows before deleting the requested row. You’ll use this type of deletion rule as a matter of course. In database systems that allow you to define relationship rules, this is usually the default and sometimes the only option.
• When the cascade deletion rule is in force, deleting the row on the “one” side of a relationship causes the system to automatically delete any related rows in the subordinate table of a one-to-one relationship or in the table on the “many” side of a one-to-many relationship. Use this rule very judiciously, or you might wind up deleting rows you really wanted to keep! Not all database systems support cascade deletion.
Regardless of the type of deletion rule you use, always examine your relationship very carefully to determine which type of rule is appropriate. You can use a very simple question to help you decide which type of rule to use. First, select a pair of tables, and then ask yourself the following question: “If a row in [name of primary or ‘one’ side table] is deleted, should related rows in [name of subordinate or ‘many’ side table] be deleted as well?”
This question is framed in a generic sense so that you can understand the premise behind it. To apply this question, substitute the phrases within the square brackets with table names. Your question will look something like this: “If a row in the Committees table is deleted, should related rows in the Committee_Members table be deleted as well?”
Use a restrict deletion rule if the answer to this question is “No.” Otherwise, use the cascade deletion rule. In the end, the answer to this question greatly depends on how you use the data stored within the database. This is why you must study the relationship carefully and make certain you choose the right rule. Figure 2-20 shows how to diagram the deletion rule for this relationship. Note that you’ll use (R) for a restricted deletion rule and (C) for a cascade deletion rule.
When you establish a relationship between a pair of tables, each table participates in a particular manner. The type of participation assigned to a given table determines whether a row must exist in that table before you can enter a row into the other table. There are two types of participation:
• Mandatory—At least one row must exist in this table before you can enter any rows into the other table.
• Optional—There is no requirement for any rows to exist in this table before you enter any rows in the other table.
The type of participation you select for a pair of tables depends mostly on the business logic of your organization. For example, let’s assume you work for a large company consisting of several departments. Let’s also assume that you have an Employees table, a Departments table, and a Department_Employees table in the database you’ve created for your company. All relevant information about an employee is in the Employees table, and all relevant information about a department is in the Departments table. The Department_Employees table is a linking table that allows you to associate any number of departments with a given employee. Figure 2-21 shows these tables. (In this figure, I used simple arrows pointing to the “many” side of the relationship.)
In the last staff meeting, you were told to assign some of the staff to a new Research and Development department. Now here’s the problem: You want to make certain you add the new department to the Departments table so that you can assign staff to that department in the Department_Employees table. This is where the type of participation characteristic comes into play. Set the type of participation for the Departments table to mandatory and the type of participation for the Department_Employees table to optional. By establishing these settings, you ensure that a department must exist in the Departments table before you can assign any employees to that department in the Department_Employees table.
As with the deletion rule, study each relationship carefully to determine the appropriate type of participation setting for each table in the relationship. You would diagram the type of participation as shown in Figure 2-22.
Figure 2-22 Diagramming the type of participation for the Departments and Department_Employees tables
Now that you’ve determined how each table will participate in the relationship, you must figure out to what degree each will participate. You do this by determining the minimum and maximum number of rows in one table that can be related to a single row in the other table. This process is known as identifying a table’s degree of participation. The degree of participation for a given table is represented by two numbers that are separated with a comma and enclosed within parentheses. The first number indicates the minimum possible number of related rows, and the second number indicates the maximum possible number of related rows. For example, a degree of participation such as “(1,12)” indicates that the minimum number of rows that can be related is 1 and the maximum is 12.
The degree of participation you select for various tables in your database largely depends on how your organization views and uses the data. Let’s say that you’re a booking agent for a talent agency and that two of the tables in your database are Agents and Entertainers. Let’s further assume that there is a one-to-many relationship between these tables—one row in the Agents table can be related to many rows in the Entertainers table, but a single row in the Entertainers table can be related to only one row in the Agents table. In this case, I’ve ensured (in a general sense) that an entertainer is assigned to only one agent. (I definitely avoid the possibility of the entertainer playing one agent against another. This is a good thing.)
In nearly all cases, the maximum number of rows on the “many” side of a relationship will be infinite. However, in some cases your business rules might dictate that you limit this participation. One example would be to limit the number of students who can enroll in a class. In this example, let’s assume that the boss wants to ensure that all his agents have a fair shake at making good commissions and wants to keep the infighting between agents down to a bare minimum. So he sets a new policy stating that a single agent can represent a maximum of six entertainers. (Although he thinks it might not work in the long run, he wants to try it anyway.) In order to implement his new policy, he sets the degree of participation for both tables to the following:
Agents |
(1,1)—An entertainer can be associated with one and only one agent. |
Entertainers |
(0,6)—Although an agent doesn’t have to be associated with an entertainer at all, he or she cannot be associated with more than six entertainers at any given time. |
Figure 2-23 shows how to diagram the degree of participation for these tables.
After setting the degree of participation, you should decide how you want your database system to enforce the relationship. What you choose depends on the features provided by your database system. The simplest enforcement supported by most database systems is to restrict the values in the foreign key in the “many” table so that the user cannot enter a value that is not in the related “one” table. You can indicate this by placing the letter R in parentheses next to the relationship line pointing to the “one” table, as shown in Figure 2-24.
Figure 2-24 A diagram of all the relationship characteristics for the Agents and Entertainers tables
Some database systems allow you to define a rule that cascades (C) the key value from the “one” table to the “many” table if the user changes the value of the primary key in the “one” table. Essentially, the database system corrects the foreign key value in related rows in the “many” table when you change the value of the primary key in the “one” table. And some database systems provide a feature that automatically deletes (D) the rows in the “many” table when you delete a row in the “one” table. Check your database system documentation for details.
Note: To enforce degree of participation constraints, you’ll have to define one or more
triggers or constraints in your database definition (if your database system supports
these features).
By using the techniques you learned in this chapter, you make the necessary beginning steps toward ensuring a fundamental level of data integrity in your database. The next step is to begin studying the manner in which your organization views and uses its data so that you can establish and impose business rules for your database. But to really get the most from your database, you should go back to the beginning and run it through a thorough database design process using a good design methodology. Unfortunately, these topics are beyond the scope of this book. However, you can learn a good design methodology from books such as Database Design for Mere Mortals, Third Edition (Addison-Wesley, 2013) by Michael J. Hernandez or Database Systems: A Practical Approach to Design, Implementation, and Management, Sixth Edition (Addison-Wesley, 2014) by Thomas Connolly and Carolyn Begg. The point to remember is this: The more solid your database structure, the easier it will be both to extract information from the data in the database and to build applications programs for it.
I opened this chapter with a short discussion on why you should be concerned with having sound structures in your database. You learned that poorly designed tables can cause numerous problems, not the least of which concern data integrity.
Next, I discussed fine-tuning the columns in each table. You learned that giving your columns good names is very important because it ensures that each name is meaningful and actually helps you to find hidden problems with the column structure itself. You now know how to fine-tune your column structures by ensuring they conform to a few simple rules. These rules deal with issues such as guaranteeing that each column represents a single characteristic of the table’s subject, contains only a single value, and never stores a calculation. I also discussed the problems found in multipart and multivalued columns, and you learned how to resolve them properly.
Fine-tuning the tables was the next issue I addressed. You learned that the table names are just as important an issue as column names for many of the same reasons. You now know how to give your tables meaningful names and ensure that each table represents only a single subject. I then discussed a set of rules you can use to make certain each table structure is sound. Although some of the rules seemed to duplicate some of the efforts you made in fine-tuning your column structures, you learned that the rules used for fine-tuning the table structures actually add an extra level of insurance in making sure that the table structures are as absolutely sound as they can be.
The next subject I tackled was primary keys. You learned the importance of establishing a primary key for each table in your database. You now know that a primary key must conform to a specific set of characteristics and that the column that will act as the primary key of a table must be chosen very carefully. You also learned that you can create an artificial primary key if there is no column in the table that conforms to the complete set of characteristics for a primary key.
I closed this chapter with a discussion on establishing solid relationships. After reviewing the three types of relationships, you learned how to diagram each one. You then learned how to establish and diagram a deletion rule for the relationship. This rule is important because it helps you guard against orphaned rows. The last two topics I discussed were the type of participation and degree of participation for each table within the relationship. You learned that a table’s participation can be mandatory or optional and that you can set a specific range for the number of related rows between each table.
In the next chapter, you’ll learn a little bit about the history of SQL and how it evolved into the current version at press time, SQL:2016.