Relational database design

The design of a relational database is usually managed in several stages. The first stage is to identify the relational schema for the tables.

To clarify the process, consider the design of a small database of personally owned books. Here is a reasonable schema:

Authors(id, lastName, firstName, yob)
Books(title, edition, hardcover, publisher, pubYear, isbn, numPages)
Publishers(id, name, city, country, url)

Most of the fields specified here are text type (String, in Java). The Authors.yob (for year of birth), Books.pubYear, and Books.numPages fields should be integer type. The primary keys are Authors.id, Books.isbn, and Publishers.id. A publisher's ID would be a short string, such as A-W for Addison-Wesley, or Wiley for John Wiley and Sons.

The next step is to identify the foreign key constraints. One is obvious: Books.publisher references Publishers.id.

But we can't use foreign keys to link the Authors and Books tables because their relationship is many-to-many: an author could write several books, and a book may have several authors. Foreign key relationships must be many-to-one. The solution is to add a link table that uses foreign keys to connect to these other two tables: AuthorsBooks(author, book).

This table has a two-field key: {author, book}. This is necessary because neither field by itself will have unique values. For example, it might contain these three rows:

Author

Book

JRHubb

978-0-07-147698-0

JRHubb

978-0-13-093374-0

AHuray

978-0-13-093374-0

Table 5-3. Link table for Library database

Now we can define the remaining foreign keys for our schema:

Our complete schema now has four tables, each with a primary key, and three foreign keys:

This schema is illustrated in Figure 5-2. The primary keys are underlined, and the foreign keys are indicated by arrows.

Now we have decided upon our schema (Figure 5-2). It shows the four tables Authors, Books, Publishers, and AuthorsBooks. Each table lists its fields, with its primary key field(s) underlined. The foreign key constraints are shown by arrows. For example, Books.isbn is a foreign key for the AuthorsBooks.book field.

The next step is to create the database. To do that, we must first choose a relational database system (RDBMS).

There are many good RDBMSs available. If you have access to a proprietary RDBMS provided by your school or employer, that might be your best choice. Among the most popular are Oracle RDB, Microsoft SQL Server, and IBM DB2. There are also several good, open-access (free) RDBMSs available. The most popular among them is MySQL, which is described in the Appendix.

Since we have been using the NetBeans IDE for our Java examples in this book, we will use its Java DB database system in this chapter. However, all the SQL commands shown in the listings here should run the same in MySQL, Oracle, SQL Server, or DB2. The NetBeans Java DB website (https://netbeans.org/kb/docs/ide/java-db.html) provides more information on the Java DB database system. The website NetBeans MySQL explains how to connect to the MySQL database system from NetBeans.

Database developers communicate with a RDBMS by means of SQL (Structured Query Language). Unlike programming languages such as Java and Python, SQL is primarily a declarative language. That means that it responds to individual commands, like the language R or OS commands. Later in the chapter, we will see how to use JDBC to execute SQL commands from within Java programs.

The RDBMS that comes with NetBeans is called Java DB. Follow these steps to create your Library database:

  1. Click on the Services tab in the upper left panel.
  2. Expand the Databases node.
  3. Right-click on the Java DB node and select Start Server from the drop-down menu. The server should respond with a couple of messages in the output panel.
  4. Right-click again on the Java DB node and this time select Create Database….
    Creating a database

    Figure 5-4. Creating the database

  5. In the Create Java DB Database panel, enter Library for Database Name, and then pick a username and a simple password that you can remember. If you want to also specify the location for your database, click on the Properties… button.
  6. A new node should now appear under Databases. Its label should be jdbc:derby://localhost:1527/Library. This represents your connection to the database. Its icon should look broken, indicating that the connection is currently closed.
  7. Right-click on that connection icon and select Connect…. This should cause the icon now to appear unbroken, thereby allowing you to expand that node.
    Creating a database

    Figure 5-5. Connected to the database

  8. Now right-click on your Library connection icon and select Execute Command…. This will activate the SQL command editor, allowing you to enter and execute SQL commands.
  9. Enter the create table command, exactly as shown in Listing 5-1. Notice the commas at the ends of the four lines 2-5, but not at the end of line 6 (in SQL, the comma is a separator, not a terminator). Also, in contrast to Java, SQL uses parentheses (lines 1 and 7) instead of braces to delimit the field definition list.
    Creating a database

    Listing 5-1. SQL Create table command

  10. Now right-click in the SQL editor window and select Run Statement. If there are no errors, the Output window will report that it executed successfully. Otherwise, go back and check your code against the one in Listing 5-1, correct it, and then run it again (you can also download this code from the Packt website). After it runs successfully, an icon for your new PUBLISHERS table will appear under the Tables node of your database, as shown in the following screenshot:
    Creating a database

    Figure 5-6. Publishers table created

The SQL query language is big and rather complex. This is due partly to its age—it's almost 50 years old. It's also partly because the language has evolved through many iterations, heavily influenced by powerfully competitive RDBMS vendors: IBM, Microsoft, and Oracle. But fear not; we'll be using only a small part of the language in this book.

SQL commands are separated into two groups: those that are used to define a database, and those that are used to access its data. The first is called the Data Definition Language (DDL), and the second is called the Data Manipulation Language (DML). The create database command used in step 4 previously, and the create table command used in Listing 5-1, are DDL statements.

Here are some other common DDL commands:

Here are a few common DML commands:

Continuing the creation of your Library database, the next step is to execute the create table command for your Books table, as shown in Listing 5-2. This has some new features that we didn't have in the Publishers table.

First, note that we are using the int datatype (lines 3, 6, and 8). This is the same as the int type in Java. In SQL, it can also be written as integer.

Secondly, in the declaration of the cover field (line 4), we are using a check clause to limit the possible values of that field to only 'HARD' and 'SOFT'. This is essentially a nominal datatype, comparable to an enumerated (enum) type in Java.

Also note that in SQL, all strings are delimited by apostrophes (the single quote character) instead of regular quotation marks as in Java: write 'HARD' not "HARD".

Finally, note the references clause in line 5. This enforces the foreign key: Books.publisher references Publishers.id. Notice that both fields, Books.publisher and Publishers.id, have the same datatype: char(4). A foreign key should have the same datatype as the primary key that it references.

Your database structure should now look like the display in the following screenshot, but with your username in place of JHUBBARD:

The SQL commands for creating the Authors table and the AuthorsBooks table are shown in Listing 5-3 and Listing 5-4:

In Listing 5-4, the two required foreign keys are specified by references clauses (lines 2 and 3). Also note that the two-field primary key is specified separately, at line 4. Single-field primary keys (for example, Authors.id) can be specified either way, as an appending clause or on a separate line. But multi-field keys must be specified separately this way.

All four of these SQL commands have been combined into a single file, named CreateTables.sql and shown in Listing 5-5. This is called an SQL script, which you can download from the Packt website.

The script begins with a seven-line header comment that identifies the file by name and states its purpose, identifies the book, the author, and the date it was written. In SQL, anything that appears on a line after a double dash is ignored by the SQL interpreter as a comment.

The four tables must first be deleted before they can be recreated. This is done with the drop table command at lines 9-12.

The tables must be dropped in the reverse order from which they were created. This accommodates the foreign key constraints. For example, the Books table depends on the Publishers table (see Figure 5-2), so it must be dropped before the Publishers table. Of course, the order for the creations of the tables is the reverse, for the same reason: The Publishers table must be created before the Books table is.

Notice that this script contains eight SQL statements: four drop and four create. The statements must be separated by semicolons, although a single statement can be run without a semicolon.

To run the script in the NetBeans SQL editor, right-click anywhere in the window and select Run File.

Data is inserted into a database table by means of the insert into statement. This is illustrated in the following screenshot:

To run this command, first open a new SQL editor window. Right-click on the database connection icon and select Execute Command…. Enter the SQL code as shown here. Then right-click anywhere in the window and select Run Statement.

To see that the data really has been inserted in to the table, right-click on the icon for the PUBLISHERS table (in the NetBeans Services tab) and select View Data…. An output window will appear:

Notice the label at the top of this output window: SELECT * FROM JHUBBARD.PU…. This is the first part of the complete query that was executed to produce this output. You can see that complete query in line 1 of the SQL editor window above it:

There are two versions of the insert into statement. One version is illustrated in Listing 5-6. It requires a value for each field that is defined in the table. If any of those values is unknown or does not exist, then a more detailed version of the insert into statement is needed. This second version simply lists each field for which values are being provided.

The insert statement executed in Listing 5-7 illustrates this second version:

This publisher has no URL, so that field must be left empty. Notice that the output in Figure 5-10 shows that field value as <NULL>:

One advantage of this version is that you can list the fields in any order, as long as the order of their values matches.

Data can also be inserted into tables using insert into statements in batch mode. This simply means combining many of the statements together, one after the other, in a single SQL file, and then running the file itself, as we did with the create table commands in Listing 5-5.

Run this batch file in the SQL Editor:

Then check the results:

The only problem with this approach is the tedium of writing all those insert into statements into the file. We shall see a more elegant solution later using JDBC instead.

An SQL database query is an SQL statement that requests information from the database. If the query is successful, it results in a dataset that is usually presented in tabular form; that is, a virtual table. Figure 5-9 shows an SQL query that was generated automatically when we clicked on View Data… in the drop down menu on the Publishers table. The key word to notice there is the first word, select.

The select statement has more clauses and variations than any other statement in SQL. It is fairly well covered on the Wikipedia page. We shall look at just a few variations in this section.

The query in Listing 5-9 returns the name, city, and country of those publishers who are not US companies, sorting the results by name.

Both the where clause and the order by clause are optional. The symbol <> means "not equal".

The query in Listing 5-10 shows that if you list two fields in an order by clause, the results will be sorted by the second field when tied in the first field.

Standard SQL has four categories of data types: numbers, character strings, bit strings, and temporal types.

The numeric types include int, smallInt, bigInt, double, and decimal(p,s), where p is the precision (the maximum number of digits) and s is the scale (the number of digits after the decimal point).

The character string types include char(n) and varchar(n), as we have already seen.

The two bit string types are bit(n) and bit varying(n). These can be interpreted as arrays of single bits—zeros and ones.

The temporal types include date, time, and timestamp. A timestamp is an instance in time that specifies the date and time, such as 2017-04-29 14:06. Notice that the ISO standard is used for dates; for example, April 29, 2017 is denoted as 2017-04-29.

Because of the competition among the major RDMS vendors, many SQL types have alternative names. For example, integer can be used in place of int, and character can be used in place of char. MySQL uses int and char.

The Java Database Connectivity (JDBC) API is a library of Java packages and classes that provide easy execution of SQL commands on a RDB from within a Java program. This API is called a database driver. It can be downloaded from the provider of your DBMS, for example:

If you are using Java DB in NetBeans, no download is necessary; JDBC is already installed. But the library does have to be added to your project.

To use JDBC in a NetBeans project, follow these steps to add it to your project:

To test your JDBC connection, run the simple test program, JDBCTester, which is shown in Listing 5-11:

The import statements on lines 8-11 can be inserted automatically when you write lines 20-21 (just click on the tiny red ball on the line number where the import is needed, and then select the import statement to be inserted). Notice that these objects, DriverManager, Connection, Statement, and SQLException are all instances of classes that are defined in the java.sql package.

The three constants defined at lines 14-16 are used by the DriverManager.getConnection() method at line 20. The URL locates and identifies the database. The USR and PWD constants should be initialized with the username and password that you chose when you created the database. See Figure 5-3.

A Statement object is instantiated at line 21. It will contain an SQL statement that can be executed against your database. That code will fit in the area where the blank line 22 is now.

The Java program in Listing 5-12 fills in the gap at line 22 of the program in Listing 5-11:

The SQL statement to be executed is defined as a String object at line 20. Line 23 passes that string to the executeQuery() method of our stmt object, and then saves the results in a ResultSet object.

A ResultSet is a table-like object that provides more than 70 getter methods to access its contents. Since we asked for two fields (name and city) from the Publishers table, and we know that it has six rows, we can conclude that this rs object will act like a six by two array of Strings. The while loop at lines 26-30 will iterate down that array, one row at a time. At each row, we read the two fields separately, at lines 25-27, and then print them at line 28. The output is shown in Figure 5-13:

In the previous example, we used a JDBC statement object to query the Library database. We can also use the more flexible PreparedStatement, which includes variables that can be assigned different values dynamically.

The program in Listing 5-13 shows how to insert new records into a database table. It reads the six lines of data from the CSV file shown in Figure 5-14, inserting the data in each line as a new record into the Authors table of our Library database:

To read from the file, we instantiate a File object at line 24 and a Scanner object at line 28.

At line 25, the SQL insert statement is coded into the sql string using four question marks (?) as place holders. These will be replaced by the actual data at lines 38-41. The PreparedStatement object, ps, is instantiated at line 28 using that sql string.

Each iteration of the while loop at lines 30-43 reads one line from the file. A separate scanner parses the line, reading the four CSV values at lines 33-37. Those four values are then pasted into the PreparedStatement at lines 37-40. On the third iteration, they would be the strings "WirthNik", "Wirth", "Niklaus", and the integer 1934, forming the complete SQL statement:

This is then executed at line 41.

Since each execution of that PreparedStatement changes only one row in the table, the rows counter is incremented (by one) at line 41. So, the output from line 45 is:

The program in Listing 5-13 suggests a general process, using JDBC PreparedStatement to load the complete database from external files. The following examples show how to do that:

The data file shown in Figure 5-15 contains data for 12 complete records for our Publishers table. The file has one record per line, separating the fields by the slash character (/). The program in the following screenshot loads all that data into the Publishers table:

The PreparedStatement is defined at line 24 as the constant string SQL, and the data file is defined as the File object DATA at line 23.

At lines 31-33, we first use anonymous Statement objects to execute delete from statements on the AuthorsBooks, Books, and Publishers tables to empty them. We want the Publishers table to be emptied so that it will eventually contain only those records that are in this data file. And we must first empty the AuthorsBooks and Books tables because of their foreign key constraints (see Figure 5-2). SQL will not allow a record to be deleted if it is referenced by another existing record.

The rest of the code in Listing 5-14 is like that in Listing 5-13, except for the processing of the url field. As we can see in lines 1 and 3 of the data file (Figure 5-15), some Publishers records do not have values for the url field (those publishers went out of business before the internet was invented). In situations like this, the database table should have the NULL value in those missing slots. In this case, we want the NULL value of the varchar() type. That is accomplished by the code at line 50:

We first use the conditional expression operator at line 44 to store the empty string in the url variable if the end-of-line has been reached in parsing the data file. Then the if statement at line 44 does the right thing, whether the url exists or not.

Data for the Authors table is shown in Figure 5-16:

These 25 records can be loaded into the Authors table using the program in Listing 5-15. It works the same way as the program in Listing 5-14.

The Packt website for this book has all these files and programs available for download. It also includes data files and programs for loading the other two tables (Books and AuthorsBooks) for the Library database.

A database view is essentially a virtual table. It is created by embedding a select statement within a create view statement. You can then apply select queries to that view as though it were an actual table or result set.

The query shown in Listing 5-16 lists those authors who have published books in the US. The from clause at line 2 lists all four tables because the query must link fields in the Authors table to fields in the Publishers table (see Figure 5-2). The three conditions at lines 3-5 correspond to the three foreign keys in the database. The order by clause at line 7 sorts the results alphabetically by last name, and the keyword distinct at line 1 removes duplicate rows from the output.

The output from this (or any) query is a virtual table. It has the same structure as an ordinary database table, but its data is not stored here as a separate table. A virtual table stores only references to real data that is stored in real tables elsewhere. Nevertheless, the concept of the virtual table is so useful that SQL allows us to name it and then use it as we would a real table. This is called a database view.

The view named AmericanAuthors is created in Listing 5-17. It simply prefaces the previous select statement with the code create view AmericanAuthors as in line 1.

You can see the resulting view object listing among the other database objects in the NetBeans Services tab window.

Now you can select from the AmericanAuthors view the same way you would from any of the four real tables in the database. The query in Figure 5-18 finds all the authors in that view whose first name begins with the letter J.

Notice the use of the SQL substring function in this query:

This returns the first letter of the firstName string.

In general, the call

will return the substring of the specified string of the specified length, beginning with the nth character. But in SQL, counting begins with 1, not 0 as in Java, so the first character is character number 1.

Views are dynamic. If the contents of the tables from which the view was created change, then subsequent output from queries of that view will change accordingly.

For example, suppose we correct the first name of the author GERSTING from Judith to Judith L. That is done with the SQL update command, as shown in Listing 5-19. Then, if we query our AmericanAuthors view again, we see the change there.

A view is a virtual table—it looks like a table, but it has no data itself. You can run select statements on it, but you cannot run update statements on a view. To change the data, you must update the table where the data resides.

A view is a database object that acts like a virtual table for queries. A subquery is essentially a temporary view that is defined within a query and exists only for that query. So, you can think of a subquery as a temporary virtual table.

The statement in Listing 5-21 creates a view named EuropeanBooks that lists the isbn codes of all the books that were published outside of the US.

The statement in Listing 5-22 contains a subquery at lines 4-5 that will return the virtual table of isbns of all those European books. Then the main select statement, beginning on line 1, returns the authors of those books.

Note that it is not necessary to use a view in the subquery. But it does make the query easier to understand. Like using a method in a Java program, the name of the method itself helps the code clarify what it is doing.

The query in Listing 5-23 illustrates two SQL features. It uses the aggregate function avg(), which returns the average of the value of the specified field, numPages in this case. The query also specifies a label for the output: Average, which appears (in all capital letters) at the top of that output column. The output is the single value 447.

There are five aggregate functions in standard SQL: avg(), sum(), count(), max(), and min().

The query in Listing 5-24 uses the previous code as a subquery, listing those books whose numPages value is above average. It also illustrates two other SQL features. It shows how a clause can follow a subquery; in this case, the order by clause. It also shows how to sort the output in descending order using the desc keyword.

A database table index is an object that is bound to a database table that facilitates searching that table.

Although the actual implementation of an index is usually proprietary and designed by the RDBMS vendor, it can be thought of as a separate file containing a multiway search tree that can locate keyed records in logarithmic time. This means that its lookup time is proportional to the logarithm of the number of records in the table. For example, if a lookup takes three probes into a table of 10,000 records, then it should be able to do a similar lookup into a table of 100,000,000 records in six probes (because log n2 = 2 log n).

Indexes take up extra space, and they take a bit of time to be updated when the table is changed by insertions or deletions. But if a large table is used mostly for lookups, then the database administrator is strongly advised to create an index for it.

A database index is usually implemented as a B-tree (invented by Rudolf Bayer and Ed McCreight when working at Boeing Research Labs in 1971), which stores key values and records addresses. The following figure illustrates this data structure:

This could be an index on a table whose key is a two-digit positive integer. To find the record with key value 94, for example, the search begins at the root of the tree, which is the node at the top. That node contains 61 and 82. Since the target value is greater than those values, the search continues down the link on the right of 82, to the node containing 90 and 96. The target is between those two values, so the search continues down the link that is between them, to the next node. That node contains 90 and 92, so the search follows the link on the right of 92 and finally reaches the leaf node that contains the address of the record with the target key value, 94.

The B-tree in Figure 5-18 has 13 key nodes containing a total of 31 keys at three levels, each node containing two four keys. But B-tree nodes can be configured to contain more than four nodes each. In fact, nodes with 20-40 keys each are not unusual. A B-tree with four levels of 40-key nodes could index a table holding 100,000,000,000 records. And yet, having only four levels means only four probes per search—that's nearly instantaneous!

The Java DB database system that comes with the NetBeans IDE automatically indexes each table that is created with a primary key. As with every object in a database, it has a name. As you can see in Figure 5-19, this index is named SQL170430152720920 in the author's version of the database:

If you are using an RDBMS that does not automatically index its keyed tables, then you can create an index with the SQL create index command:

This will create an index named AuthorsIndex on the id field of the Authors table.