2 Ensuring Your Database Structure Is Sound

Most of you reading this book are probably working with an existing database structure implemented on your favorite (I hope) RDBMS program. It’s hard for me to assume, at this point, whether or not you—or the person who developed the database—really had the necessary knowledge and skills or the time to design the database properly. Assuming the worst, you probably have a number of tables that could use some fine-tuning. Fortunately, you’re about to learn some techniques that will help you get your database in shape and will ensure that you can easily retrieve the information you need from your tables.

This chapter will not teach you the intricacies of database design, but it will help you get your database in relatively good shape. I highly recommend that you read through this chapter so that you can make certain your table structures are sound.

You can avoid many of these problems if you properly design your database from the beginning. Even if you’ve already designed your database, all is not lost. You can still apply the following techniques and gain the benefits of a sound structure. However, you must be aware that the quality of your final structures is in direct proportion to the amount of time you invest in fine-tuning them. The more care and patience you give to applying the techniques, the more you can guarantee your success.

Let’s now turn to the first order of business in shaping up your structures: working with the columns.

• Is the name descriptive and meaningful to your entire organization? If users in several departments are going to work with this database, make certain you choose a name that is meaningful to everyone who accesses this column. Semantics is a funny thing, and if you use a word that has a different meaning to different groups of people, you’re just inviting trouble.

• Is the column name clear and unambiguous? PhoneNumber is a column name that can be very misleading. What kind of phone number is this column supposed to represent? A home phone? A work phone? A cellular phone? Learn to be specific. If you need to record each of these types of phone numbers, then create HomePhone, WorkPhone, and CellPhone columns.

In addition to making your column names clear and unambiguous, be sure that you don’t use the same column name in several tables. Let’s say you have three tables called Customers, Vendors, and Employees. No doubt you will have City and State columns in each of these tables, and the columns will have the same names in all three tables. There isn’t a problem with this until you have to refer to one particular column. How do you distinguish between, say, the City column in the Vendors table, the City column in the Customers table, and the City column in the Employees table? The answer is simple: Add a short prefix to each of the column names. For example, use the name VendCity in the Vendors table, CustCity in the Customers table, and EmpCity in the Employees table. Now you can easily make a clear reference to any of these columns. (You can use this technique on any generic column such as FirstName, LastName, and Address.)

Here’s the main thing to remember: Make sure that each column in your database has a unique name and that it appears only once in the entire database structure. The only exception to this rule is when a column is being used to establish a relationship between two tables.

• Did you use an acronym or abbreviation as a column name? If you did, change it! Acronyms can be hard to decipher and are easily misunderstood. Imagine seeing a column named CAD_SW. How would you know what the column represents? Use abbreviations sparingly, and handle them with care. Use an abbreviation only if it supplements or positively enhances the column name. It shouldn’t detract from the meaning of the column name.

• Did you use a name that implicitly or explicitly identifies more than one characteristic? These types of names are easy to spot because they typically use the words and or or. Column names that contain a back slash (\), a hyphen (-), or an ampersand (&) are dead giveaways as well. If you have columns with names such as Phone\Fax or Area or Location, review the data that they store and determine whether you need to deconstruct them into smaller, distinct columns.

After using this checklist to revise your column names, you have one task left: Make certain you use the singular form of the column name. A column with a plural name such as Categories implies that it might contain two or more values for any given row, which is not a good idea. A column name is singular because it represents a single characteristic of the subject of the table to which it belongs. A table name, on the other hand, is plural because it represents a collection of similar objects or events. You can distinguish table names from column names quite easily when you use this naming convention.

• Make sure the column represents a specific characteristic of the subject of the table. The idea here is to determine whether the column truly belongs in the table. If it isn’t germane to the table, remove it, or perhaps move it to another table. The only exceptions to this rule occur when the column is being used to establish a relationship between this table and other tables in the database or when it has been added to the table in support of some task required by a database application. For example, in the Classes table in Figure 2-1, the StaffLastName and StaffFirstName columns are unnecessary because of the presence of the StaffID column. StaffID is being used to establish a relationship between the Classes table and the Staff table, and you can view data from both tables simultaneously by using a view or an SQL SELECT query. If you have unnecessary columns in your tables, you can either remove them completely or use them as the basis of a new table if they don’t appear anywhere else in the database structure. (I’ll show you how to do this later in this chapter.)

Figure 2-1 A table with unnecessary columns

• Make certain that the column contains only a single value. A column that can potentially store several instances of the same type of value is known as a multivalued column. (A column that contains multiple phone numbers is an example of a multivalued column.) Likewise, a column that can potentially store two or more distinct values is known as a multipart column. (A column that contains both an item number and an item description is an example of a multipart column.) Multivalued and multipart columns can wreak havoc in your database, especially when you try to edit, delete, or sort the data. When you ensure that each column stores only a single value, you go a long way toward guaranteeing data integrity and accurate information. But for the time being, just try to identify any multivalued or multipart columns and make a note of them. You’ll learn how to resolve them in the next section.

• Make sure the column does not store the result of a calculation or concatenation. Calculated columns are not allowed in a properly designed table. The issue here is the value of the calculated column itself. A column, unlike a cell in a spreadsheet, does not store an actual calculation. When the value of any part of the calculation changes, the result value stored in the column is not updated. The only ways to update the value are to do so manually or to write some procedural code that will do it automatically. Either way, it is incumbent on the user or you, the developer, to make certain the value is updated. The preferred way to work with a calculation, however, is to incorporate it into a SELECT statement. You’ll learn the advantages of dealing with calculations in this manner when you get to Chapter 5, “Getting More Than Simple Columns.”

• Make certain the column appears only once in the entire database. If you’ve made the common mistake of inserting the same column (for example, CompanyName) into several tables within the database, you’re going to have a problem with inconsistent data. This occurs when you change the value of the column in one table and then you forget to make the same modification wherever else the same column appears. Avoid this problem entirely by ensuring that a column appears only once in the entire database structure. (The only exception to this rule is when you’re using a column to establish a relationship between two tables.)

You’ll know if you have a multipart column by answering some very simple questions: “Can I take the current value of this column and break it up into smaller, more distinct parts?” “Will I have problems extracting a specific piece of information because it is buried in a column containing other information?” If your answer to either question is “Yes,” you have a multipart column. Figure 2-2 shows a poorly designed table with several multipart columns.

Figure 2-2 A table with multipart columns

The Customers table shown in the figure contains two multipart columns: CustomerName, and StreetAddress. There’s also one column that is potentially multipart: PhoneNumber. How can you sort the data by last name or ZIP Code or search on state? You can’t because these values are embedded in columns that contain other information. You can see that each column can be broken into smaller columns. For example, CustomerName can be broken into two distinct columns—CustFirstName and CustLastName. (Note that I’m using the naming convention discussed earlier in this chapter when I add the prefix Cust to the FirstName and LastName columns.) When you identify a multipart column in a table, determine how many parts there are to the value it stores, and then break the column into as many smaller columns as appropriate. Figure 2-3 shows how to resolve two of the multipart columns in the Customers table.

Figure 2-3 The resolution of the multipart columns in the Customers table

Sometimes you might have difficulty recognizing a multipart column. Take a look at the Instruments table shown in Figure 2-4. At first glance, there do not seem to be any multipart columns. On closer inspection, however, you will see that InstrumentID is actually a multipart column. The value stored in this column represents two distinct pieces of information: the category to which the instrument belongs—such as AMP (amplifier), GUIT (guitar), and MFX (multi-effects unit)—and its identification number. You should separate these two values and store them in their own columns to ensure data integrity. Imagine the difficulty of updating this column if the MFX category changed to MFU. You would have to write code to parse the value in this column, test for the existence of MFX, and then replace it with MFU if it does exist within the parsed value. It’s not so much that you couldn’t do this, but you’d definitely be working harder than necessary, and you shouldn’t have to go through this at all if your database is properly designed. When you have columns such as the one in this example, break them into smaller columns so that you will have sound, efficient column structures.

Figure 2-4 An example of a subtle multipart column

Figure 2-5 A table with a multivalued column

In this example, each pilot is certified to fly any number of planes, and those certifications are stored in a single column called Certifications. The manner in which the data is stored in this column is very troublesome because you are bound to encounter the same type of data integrity problems associated with multipart columns. When you look at the data more closely, you’ll see that it will be difficult for you to perform searches and sorts on this column in an SQL query. Before you can resolve this column in the appropriate manner, you must first understand the true relationship between a multivalued column and the table to which it is originally assigned.

The values in a multivalued column have a many-to-many relationship with every row in its parent table: One specific value in a multivalued column can be associated with any number of rows in the parent table, and a single row in the parent table can be associated with any number of values in the multivalued column. In Figure 2-5, for example, a specific aircraft in the Certifications column can be associated with any number of pilots, and a single pilot can be associated with any number of aircraft in the Certifications column. You resolve this many-to-many relationship as you would any other many-to-many relationship within the database—with a linking table.

You can create the linking table by using the multivalued column and a copy of the primary key column from the original table as the basis for the new table. Give the new linking table an appropriate name, and designate both columns as a composite primary key. (In this case, the combination of the values of both columns will uniquely identify each row within the new table.) Now you can associate the values of both columns in the linking table on a one-to-one basis. Figure 2-6 shows an example of this process using the Pilots table shown in Figure 2-5.

Contrast the entries for Sam Alborous (PilotID 25100) in both the old Pilots table and the new Pilot_Certifications table. The major advantage of the new linking table is that you can now associate any number of certifications with a single pilot. Asking certain types of questions is now much easier as well. For example, you can determine which pilots are certified to fly a Boeing 747 aircraft or retrieve a list of certifications for a specific pilot. You’ll also find that you can sort the data in any order you wish, without any adverse effects.

Figure 2-6 Resolving a multivalued column by using a linking table

Your columns will be in good shape when you follow the procedures presented in this section. Now that you’ve refined the columns, let’s turn to our second order of business and take a look at the table structures.

• Is the name unique and descriptive enough to be meaningful to your entire organization? Giving your table a unique name ensures that each table in the database represents a different subject and that everyone in the organization will understand what the table represents. Defining a unique and descriptive name does take some work on your part, but it’s well worth the effort in the long run.

• Does the name accurately, clearly, and unambiguously identify the subject of the table? When the table name is vague or ambiguous, you can bet that the table represents more than one subject. For example, Dates is a vague table name. It’s hard to determine exactly what this table represents unless you have a description of the table at hand. Let’s say this table appears in a database used by an entertainment agency. If you inspect this table closely, you’ll probably find that it contains dates for client meetings and booking dates for the agency’s stable of entertainers. This table clearly represents two subjects. You can resolve this issue by dividing the table into two new tables and give each table an appropriate name, such as Client_Meetings and Entertainer_Schedules.

• Does the name contain words that convey physical characteristics? Avoid using words such as File, Record, and Table in the table name because they introduce a level of confusion that you don’t need. A table name that includes this type of word is very likely to represent more than one subject. Consider the name Employee_Record. On the surface, there doesn’t appear to be any problem with this name. When you think about what an employee record is supposed to represent, however, you’ll realize that there are potential problems. The name contains a word that we’re trying hard to avoid, and it potentially represents three subjects: employees, departments, and payroll. With this in mind, split the original table (Employee_Record) into three new tables, one for each of the three subjects.

• Did you use an acronym or abbreviation as a table name? If the answer to this question is “Yes,” change the name right now! Abbreviations rarely convey the subject of the table, and acronyms are usually hard to decipher. Suppose your company database has a table named SC. How do you know what the table represents without knowing the meaning of the letters themselves? The fact is that you can’t easily identify the subject of the table. What’s more, you might find that the table means different things to different departments in the company. (Now, this is scary.) The folks in Personnel think it stands for Steering_Committees; the Information Systems staff believes it to be System_Configurations, and the people in Security insist that it represents Security_Codes. This example clearly illustrates why you should avoid using abbreviations and acronyms in a table name.

• Did you use a name that implicitly or explicitly identifies more than one subject? This is one of the most common mistakes you can make with a table name, and it is relatively easy to identify. This type of name typically contains the words and or or and characters such as the back slash (\), hyphen (-), or ampersand (&). Facility\Building and Department or Branch are typical examples. When you name a table in this manner, you must clearly identify whether it truly represents more than one subject. If it does, deconstruct it into smaller tables, and then give the new tables appropriate names.

After you’ve finished revising your table names, you have one more task to perform: Check each table name again and make certain you used the plural form of the name. You use the plural form because a table stores a collection of instances of the subject of the table. For example, an Employees table stores the data for many employees, not just one employee. Using the plural form also helps you to distinguish a table name from a column name.

• Make sure the table represents a single subject. Yes, I know, I’ve said this a number of times already, but I can’t overemphasize this point. As long as you guarantee that each of your tables represents a single subject, you greatly reduce the risk of potential data integrity problems. Also, remember that the subject represented by the table can be an object or event. By “object” I mean something that is tangible, such as employees, vendors, machines, buildings, or departments, whereas an “event” is something that happens at a given point in time that has characteristics you want to record. The best example of an event that everyone can relate to is a doctor’s appointment. Although you can’t explicitly touch a doctor’s appointment, it does have characteristics that you need to record, such as the appointment date, the appointment time, the patient’s blood pressure, and the patient’s temperature.

• Make certain each table has a primary key. You must assign a primary key to each table for two reasons. First, the primary key uniquely identifies each row within a table, and second, it is used in establishing table relationships. If you do not assign a primary key to each table, you will eventually have data integrity problems and problems with some types of multi-table SQL queries. You’ll learn some tips on how to define a proper primary key later in this chapter.

• Make sure the table does not contain any multipart or multivalued columns. Theoretically, you should have resolved these issues when you refined the column structures. Nonetheless, it’s still a good idea to review the columns one last time to ensure that you’ve completely removed every multipart or multivalued column.

• Make sure there are no calculated columns in the table. Although you might believe that your current table structures are free of calculated columns, you might have overlooked one or two during the column refinement process. This is a good time to take another look at the table structures and remove any calculated columns you might have missed.

• Make certain the table is free of any unnecessary duplicate columns. One of the hallmarks of a poorly designed table is the inclusion of duplicate columns from other tables. You might feel compelled to add duplicate columns to a table for one of two reasons: 1) to provide reference information or 2) to indicate multiple occurrences of a particular type of value. Remember that earlier I talked about HomePhone, WorkPhone, and CellPhone potentially being repeating or duplicate columns. These duplicate columns raise various difficulties when you work with the data and attempt to retrieve information from the table. Let’s now take a look at how to deal with duplicate columns.

Figure 2-7 illustrates an example of a table containing duplicate columns that supply reference information.

Figure 2-7 A table with duplicate columns added for reference information

In this case, StaffLastName and StaffFirstName appear in the Classes table so that a person viewing the table can see the name of the instructor for a given class. These columns are unnecessary because of the one-to-many relationship that exists between the Classes and Staff tables. (A single staff member can teach any number of classes, but a single class is taught by one staff member.) StaffID establishes the relationship between these tables, and the relationship itself lets you view data from both tables simultaneously in an SQL query. With this in mind, you can confidently remove the StaffLastName and StaffFirstName columns from the Classes table without any adverse effects. Figure 2-8 shows the revised Classes table structure.

Figure 2-8 Resolving the duplicate reference columns

Keeping these unnecessary columns in the table automatically introduces a major problem with inconsistent data. You must ensure that the values of the StaffLastName and StaffFirstName columns in the Classes table always match their counterparts in the Staff table. For example, say a female staff member marries and decides to use her married name as her legal name from that day forward. Not only do you have to be certain to make the appropriate change to her row in the Staff table, but you must ensure that every occurrence of her name in the Classes table changes as well. Again, it’s possible to do this (at least, technically), but you’re working much harder than is necessary. Besides, one of the major premises behind using a relational database is that you should enter a piece of data only once in the entire database. (The only exception to this rule is when you’re using a column to establish a relationship between two tables.) As always, the best course of action is to remove all duplicate columns from the tables in your database.

Figure 2-9 shows another clear example of a table containing duplicate columns. This example illustrates how duplicate columns are mistakenly used to indicate multiple occurrences of a particular type of value. In this case, the three Committee columns are ostensibly used to show the names of the committees in which the employee participates.

Figure 2-9 A table with duplicate columns used to indicate multiple occurrences of a particular type of value

It’s relatively easy to see why these duplicate columns will create problems. One problem concerns the actual number of Committee columns in the table. What if a few employees end up belonging to four committees? For that matter, how can you tell exactly how many Committee columns you’re going to need? If it turns out that several employees participate in more than three committees, you’ll need to add more Committee columns to the table.

A second problem pertains to retrieving information from the table. How do you retrieve those employees who are currently in the ISO 9000 committee? It’s not impossible, but you’ll have difficulty retrieving this information. You must execute three separate queries (or build a search condition that tests three separate columns) in order to answer the question accurately because you cannot be certain in which of the three Committee columns the value ISO 9000 is stored. Now you’re expending more time and effort than is truly necessary.

A third problem concerns sorting the data. You cannot sort the data by committee in any practical fashion, and there’s no way that you’ll get the committee names to line up correctly in alphabetical order. Although these might seem like minor problems, they can be quite frustrating when you’re trying to get an overall view of the data in some orderly manner.

If you study the Employees table in Figure 2-9 closely, you’ll soon realize that there is a many-to-many relationship between the employees and committees to which they belong. A single employee can belong to any number of committees, and a single committee can be composed of any number of employees. You can, therefore, resolve these duplicate columns in the same manner that you would resolve any other many-to-many relationship—by creating a linking table. In the case of the Employees table, create the linking table by using a copy of the primary key (EmployeeID) and a single Committee column. Give the new table an appropriate name, such as Committee_Members, designate both the EmployeeID and Committee columns as a composite primary key, remove the Committee columns from the Employees table, and you’re done. (You’ll learn more about primary keys later in this chapter.) Figure 2-10 shows the revised Employees table and the new Committee_Members table.

Figure 2-10 The revised Employees table and the new Committee_Members table

You’ve resolved the duplicate columns that were in the original Employees table, but you’re not quite finished yet. Keeping in mind that there is a many-to-many relationship between the employees and the committees to which they belong, you might very well ask, “Where is the Committees table?” There isn’t one—yet! Chances are that a committee has some other characteristics that you need to record, such as the name of the room where the committee meets and the day of the month that the meeting is held. It would be a good idea for you to create a real Committees table that includes columns such as CommitteeID, CommitteeName, MeetingRoom, and MeetingDay. When you finish creating the new table, replace the Committee column in the Committee_Members table with the CommitteeID column from the new Committees table. The final structures appear in Figure 2-11.

Figure 2-11 The final Employees, Committee_Members, and Committees structures

You gain a real advantage by structuring the tables in this manner because you can now associate a single member with any number of committees or a single committee with any number of employees. You can then use an SQL query to view information from all three tables simultaneously.

Let’s revisit the problem I mentioned earlier about multiple, though uniquely named, columns potentially being a set of duplicate columns. Consider the table shown in Figure 2-12.

Figure 2-12 Repeating phone number columns in an Employees table

What potential problems do you see? First, there’s wasted space in the table when an employee doesn’t have a particular type of phone. But there’s an even bigger problem. Can you guess what it is? What do you do if an employee has two home phones or a fax line? What about key employees who not only have a personal cell phone but also are given a cell phone by the company? The solution is to create a separate Phone_Numbers table and relate it back to the Employees table as shown in Figure 2-13.

With this new design, I can store an unlimited set of phone numbers for each employee. If I need to store a new phone type, all I need to do is define a data value for the PhoneType column. And notice that there’s no wasted storage space for a home phone number for employee 7008—there simply is no Home phone row for that employee. Notice also that each row in the Phone_Numbers table has a PhoneID with a unique value for each row. You’ll learn more about the importance of uniquely identifying each row in the next section.

Figure 2-13 Solving the phone number problem with a separate related table

You’re now close to completing the process of fine-tuning your table structures. The last order of business is to make certain that each row within a table can be uniquely identified and that the table itself can be identified throughout the entire database.

By definition, a primary key is a column or group of columns that uniquely identifies each row within a table. A primary key is known as a simple primary key (or just primary key for short) when it is composed of a single column. A primary key is known as a composite primary key when it is composed of two or more columns. Define a simple primary key when you can because it’s more efficient and is much easier to use when establishing a table relationship. Use a composite primary key only when it’s appropriate, such as when you’re defining and creating a linking table.

You can use an existing column or a combination of columns as the primary key as long as they satisfy all the criteria in the following checklist. When the column or columns that you propose to use as the primary key do not conform to all the criteria, use a different column or define a new column to act as the primary key for the table. Take some time now and use this checklist to determine whether each primary key in your database is sound:

• Do the columns uniquely identify each row in the table? Each row in a table represents an instance of the subject of the table. A good primary key ensures that you have a means of accurately identifying or referencing each row in this table from other tables in the database. It also helps you to avoid having duplicate rows within the table.

• Does this column or combination of columns contain unique values? As long as the values of the primary key are unique, you have a means of ensuring that there are no duplicate rows in the table.

• Will these columns ever contain unknown values? This is a very important question because a primary key cannot contain unknown values. You should disqualify this column immediately if you think it has even the slightest possibility of containing unknown values.

• Can the value of these columns ever be optional? You cannot use this column as the primary key if the answer to this question is “Yes.” If the value of the column can be optional, it implies that it might be unknown at some point. As you learned in the previous item, a primary key cannot contain unknown values.

• Is this a multipart column? It’s a good idea to ask yourself this question, although you should have eliminated all your multipart columns by now. If you missed a multipart column earlier, resolve it now and try to use another column as the primary key, or use the new separate columns together as a composite primary key.

• Can the value of these columns ever be modified? The values of primary key columns should remain static. You should never change the value of a column in a primary key unless you have a truly compelling reason to do so. When the value of the column is subject to arbitrary changes, it is difficult for the column to remain in conformance with the other points in this checklist.

As I stated earlier, a column or combination of columns must pass all the points on this checklist with flying colors before it can be used as a primary key. In Figure 2-14, PilotID serves as the primary key of the Pilots table. But the question is this: Does PilotID conform to all the points on the previous checklist? The primary key is sound if it does, but if it doesn’t, you must either modify it to conform to all the points on the checklist or select a different column as the primary key.

Figure 2-14 Is PilotID a sound primary key?

As a matter of fact, PilotID is a sound primary key because it does conform to all the points on the checklist. But what happens when you don’t have a column that can act as a primary key? Take the Employees table in Figure 2-15, for example. Is there a column in this table that can act as a primary key?

It’s very clear that this table doesn’t contain a column (or group of columns) that can be used as a primary key. With the exception of EmpPhone, every column contains duplicate values. EmpZip, EmpAreaCode, and EmpPhone all contain unknown values. Although you might be tempted to use the combination of EmpLastName and EmpFirstName, there’s no guarantee that you won’t employ a new person who is also named Jim Wilson or David Smith. It’s evident that there is no column you can use as the primary key for this table because the value of every column in the table is subject to arbitrary change.

Figure 2-15 Does this table have a primary key?

What do you do now? You might be tempted to use some sort of national identity number associated with each employee—for example, a Social Security number in the U.S. or the Social Insurance number in Canada. Be aware that although it is rare, it is possible for two or more people to have the same number. When in doubt, the solution is to create an artificial primary key. This is an arbitrary column you define and add to the table for the sole purpose of using it as the table’s primary key. The advantage of adding this arbitrary column is that you can ensure that it conforms to all the points on the checklist. After you’ve added the column to the table, designate it as the primary key, and you’re done! That’s all there is to it. Figure 2-16 shows the Employees table with an artificial primary key called EmployeeID.

Figure 2-16 The Employees table with the new artificial primary key

Note: Although artificial primary keys are an easy way to solve the problem, they don’t really guarantee that you won’t get duplicate data in your table. For example, if someone adds a new row for a person named John Kennedy and provides a new unique artificial EmployeeID value, how do you know that this second John Kennedy isn’t the same as the employee 98002 already in the table?

The answer is to add a verification routine to your application code that checks for a potentially duplicate name and warns the user. In many database systems, you can write such validation code as something called a trigger that your database system automatically runs each time a row is changed, added, or deleted. However, discussing triggers is far beyond the scope of this book. Consult your database system documentation for details.

At this point, you’ve done everything you can to strengthen and fine-tune your table structures. Now I’ll take a look at how you can ensure that all your table relationships are sound.

Establishing Solid Relationships

In Chapter 1, you learned that a relationship exists between a pair of tables if rows in the first table are in some way associated with rows in the second table. You also learned that the relationship itself can be designated as one of three types: one-to-one, one-to-many, and many-to-many. And you learned that each type of relationship is established in a specific manner. Let’s review this for a moment.

Note: The diagram symbols shown in this section are part of the diagramming method presented in Mike Hernandez’s book Database Design for Mere Mortals, Third Edition (Addison-Wesley, 2013). PK indicates a primary key column. FK indicates a foreign key column. CPK indicates a column that is part of a composite primary key.

• You establish a one-to-one relationship by taking the primary key from the primary table and inserting it into the subordinate table, where it becomes a foreign key. This is a special type of relationship because in many cases the foreign key will also act as the primary key of the subordinate table. Figure 2-17 shows how to diagram this relationship.

Figure 2-17 Diagramming a one-to-one relationship

• You establish a one-to-many relationship by taking the primary key of the table on the “one” side and inserting it into the table on the “many” side, where it becomes a foreign key. Figure 2-18 shows how to diagram this type of relationship.

Figure 2-18 Diagramming a one-to-many relationship

• You establish a many-to-many relationship by creating a linking table. Define the linking table by taking a copy of the primary key of each table in the relationship and using them to form the structure of the new table. These columns commonly serve two distinct roles: Together, they form the composite primary key of the linking table; separately, they each serve as a foreign key. You would diagram this relationship as shown in Figure 2-19.

A many-to-many relationship is always resolved by using a linking table. In this example, Pilot_Certifications is the linking table. A single pilot can have any number of certifications, and a single certification can be associated with any number of pilots.

Figure 2-19 Diagramming a many-to-many relationship

In order to make certain that the relationships among the tables in your database are really solid, you must establish relationship characteristics for each relationship. The characteristics you’re about to define indicate what will occur when you delete a row, the type of participation a table has within the relationship, and to what degree each table participates within the relationship.

It’s important to note that the related columns that you use to link two tables must be the same data type. For example, you can link a primary key of Int (Integer) data type only to a foreign key that is also Int. You cannot link a number to character or date. The one exception to this rule involves primary keys that are automatically generated by your database system, known as AutoNumber, Identity, Serial, or Auto_Increment, depending on the database system. For each of these, there is an underlying numeric data type—Long Integer in Microsoft Access and Int in most others—so it’s perfectly okay to link a primary key created this way to a column that is simply the underlying data type. So you can link an AutoNumber in Microsoft Access to a Number / Long Integer foreign key in a related table, or an Identity in Microsoft SQL Server to a column that is Int.

Before my discussion on relationship characteristics begins, I must make one point perfectly clear: I present the following characteristics within a generic and logical frame of reference. These characteristics are important because they allow you to enforce relationship integrity (referred to by some database systems as referential integrity). The manner in which you implement them, however, will vary from one database software program to another. You will have to study your database software’s documentation to determine whether these characteristics are supported and, if so, how you can implement them.

You can set two types of deletion rules for a relationship: restrict and cascade.

• The restrict deletion rule does not allow you to delete the requested row when there are related rows in the subordinate table of a one-to-one relationship or in the table on the “many” side of a one-to-many relationship. You must delete any related rows before deleting the requested row. You’ll use this type of deletion rule as a matter of course. In database systems that allow you to define relationship rules, this is usually the default and sometimes the only option.

• When the cascade deletion rule is in force, deleting the row on the “one” side of a relationship causes the system to automatically delete any related rows in the subordinate table of a one-to-one relationship or in the table on the “many” side of a one-to-many relationship. Use this rule very judiciously, or you might wind up deleting rows you really wanted to keep! Not all database systems support cascade deletion.

Regardless of the type of deletion rule you use, always examine your relationship very carefully to determine which type of rule is appropriate. You can use a very simple question to help you decide which type of rule to use. First, select a pair of tables, and then ask yourself the following question: “If a row in [name of primary or ‘one’ side table] is deleted, should related rows in [name of subordinate or ‘many’ side table] be deleted as well?”

This question is framed in a generic sense so that you can understand the premise behind it. To apply this question, substitute the phrases within the square brackets with table names. Your question will look something like this: “If a row in the Committees table is deleted, should related rows in the Committee_Members table be deleted as well?”

Use a restrict deletion rule if the answer to this question is “No.” Otherwise, use the cascade deletion rule. In the end, the answer to this question greatly depends on how you use the data stored within the database. This is why you must study the relationship carefully and make certain you choose the right rule. Figure 2-20 shows how to diagram the deletion rule for this relationship. Note that you’ll use (R) for a restricted deletion rule and (C) for a cascade deletion rule.

Figure 2-20 Diagramming the deletion rule for the Committees and Committee_Members tables

• Mandatory—At least one row must exist in this table before you can enter any rows into the other table.

• Optional—There is no requirement for any rows to exist in this table before you enter any rows in the other table.

The type of participation you select for a pair of tables depends mostly on the business logic of your organization. For example, let’s assume you work for a large company consisting of several departments. Let’s also assume that you have an Employees table, a Departments table, and a Department_Employees table in the database you’ve created for your company. All relevant information about an employee is in the Employees table, and all relevant information about a department is in the Departments table. The Department_Employees table is a linking table that allows you to associate any number of departments with a given employee. Figure 2-21 shows these tables. (In this figure, I used simple arrows pointing to the “many” side of the relationship.)

Figure 2-21 The Employees, Departments, and Department_Employees tables

In the last staff meeting, you were told to assign some of the staff to a new Research and Development department. Now here’s the problem: You want to make certain you add the new department to the Departments table so that you can assign staff to that department in the Department_Employees table. This is where the type of participation characteristic comes into play. Set the type of participation for the Departments table to mandatory and the type of participation for the Department_Employees table to optional. By establishing these settings, you ensure that a department must exist in the Departments table before you can assign any employees to that department in the Department_Employees table.

As with the deletion rule, study each relationship carefully to determine the appropriate type of participation setting for each table in the relationship. You would diagram the type of participation as shown in Figure 2-22.

Figure 2-22 Diagramming the type of participation for the Departments and Department_Employees tables

Setting the Degree of Participation

Now that you’ve determined how each table will participate in the relationship, you must figure out to what degree each will participate. You do this by determining the minimum and maximum number of rows in one table that can be related to a single row in the other table. This process is known as identifying a table’s degree of participation. The degree of participation for a given table is represented by two numbers that are separated with a comma and enclosed within parentheses. The first number indicates the minimum possible number of related rows, and the second number indicates the maximum possible number of related rows. For example, a degree of participation such as “(1,12)” indicates that the minimum number of rows that can be related is 1 and the maximum is 12.

The degree of participation you select for various tables in your database largely depends on how your organization views and uses the data. Let’s say that you’re a booking agent for a talent agency and that two of the tables in your database are Agents and Entertainers. Let’s further assume that there is a one-to-many relationship between these tables—one row in the Agents table can be related to many rows in the Entertainers table, but a single row in the Entertainers table can be related to only one row in the Agents table. In this case, I’ve ensured (in a general sense) that an entertainer is assigned to only one agent. (I definitely avoid the possibility of the entertainer playing one agent against another. This is a good thing.)

In nearly all cases, the maximum number of rows on the “many” side of a relationship will be infinite. However, in some cases your business rules might dictate that you limit this participation. One example would be to limit the number of students who can enroll in a class. In this example, let’s assume that the boss wants to ensure that all his agents have a fair shake at making good commissions and wants to keep the infighting between agents down to a bare minimum. So he sets a new policy stating that a single agent can represent a maximum of six entertainers. (Although he thinks it might not work in the long run, he wants to try it anyway.) In order to implement his new policy, he sets the degree of participation for both tables to the following:

Figure 2-23 Diagramming the degree of participation for the Agents and Entertainers tables

After setting the degree of participation, you should decide how you want your database system to enforce the relationship. What you choose depends on the features provided by your database system. The simplest enforcement supported by most database systems is to restrict the values in the foreign key in the “many” table so that the user cannot enter a value that is not in the related “one” table. You can indicate this by placing the letter R in parentheses next to the relationship line pointing to the “one” table, as shown in Figure 2-24.

Figure 2-24 A diagram of all the relationship characteristics for the Agents and Entertainers tables

Some database systems allow you to define a rule that cascades (C) the key value from the “one” table to the “many” table if the user changes the value of the primary key in the “one” table. Essentially, the database system corrects the foreign key value in related rows in the “many” table when you change the value of the primary key in the “one” table. And some database systems provide a feature that automatically deletes (D) the rows in the “many” table when you delete a row in the “one” table. Check your database system documentation for details.

Is That All?

By using the techniques you learned in this chapter, you make the necessary beginning steps toward ensuring a fundamental level of data integrity in your database. The next step is to begin studying the manner in which your organization views and uses its data so that you can establish and impose business rules for your database. But to really get the most from your database, you should go back to the beginning and run it through a thorough database design process using a good design methodology. Unfortunately, these topics are beyond the scope of this book. However, you can learn a good design methodology from books such as Database Design for Mere Mortals, Third Edition (Addison-Wesley, 2013) by Michael J. Hernandez or Database Systems: A Practical Approach to Design, Implementation, and Management, Sixth Edition (Addison-Wesley, 2014) by Thomas Connolly and Carolyn Begg. The point to remember is this: The more solid your database structure, the easier it will be both to extract information from the data in the database and to build applications programs for it.

Next, I discussed fine-tuning the columns in each table. You learned that giving your columns good names is very important because it ensures that each name is meaningful and actually helps you to find hidden problems with the column structure itself. You now know how to fine-tune your column structures by ensuring they conform to a few simple rules. These rules deal with issues such as guaranteeing that each column represents a single characteristic of the table’s subject, contains only a single value, and never stores a calculation. I also discussed the problems found in multipart and multivalued columns, and you learned how to resolve them properly.

Fine-tuning the tables was the next issue I addressed. You learned that the table names are just as important an issue as column names for many of the same reasons. You now know how to give your tables meaningful names and ensure that each table represents only a single subject. I then discussed a set of rules you can use to make certain each table structure is sound. Although some of the rules seemed to duplicate some of the efforts you made in fine-tuning your column structures, you learned that the rules used for fine-tuning the table structures actually add an extra level of insurance in making sure that the table structures are as absolutely sound as they can be.

The next subject I tackled was primary keys. You learned the importance of establishing a primary key for each table in your database. You now know that a primary key must conform to a specific set of characteristics and that the column that will act as the primary key of a table must be chosen very carefully. You also learned that you can create an artificial primary key if there is no column in the table that conforms to the complete set of characteristics for a primary key.

I closed this chapter with a discussion on establishing solid relationships. After reviewing the three types of relationships, you learned how to diagram each one. You then learned how to establish and diagram a deletion rule for the relationship. This rule is important because it helps you guard against orphaned rows. The last two topics I discussed were the type of participation and degree of participation for each table within the relationship. You learned that a table’s participation can be mandatory or optional and that you can set a specific range for the number of related rows between each table.

In the next chapter, you’ll learn a little bit about the history of SQL and how it evolved into the current version at press time, SQL:2016.

2
Ensuring Your Database Structure Is Sound

Topics Covered in This Chapter

Why Is this Chapter Here?

Why Worry about Sound Structures?

Fine-Tuning Columns

What’s in a Name? (Part One)

Smoothing Out the Rough Edges

Resolving Multipart Columns

Resolving Multivalued Columns