Lesson 6
Diving into SQL

SQL is designed as a universal language to manage data stored in a relational database management system. In this lesson, we will look at some of the features of SQL before jumping into using it heavily to manage data. This includes the topic of working with null values and how to use them in MySQL. We will also take a look at indexing.

INTRODUCTION TO SQL

SQL stands for Structured Query Language, and it can be pronounced like “sequel” or using the letters S‐Q‐L. It is designed to work with any relational database management system (RDBMS), which includes MySQL Server, Microsoft SQL Server, and Oracle Database, among others. Because it works universally across RDBMSs, programming languages such as Java, C#, and PHP support the use of SQL to retrieve and manage data stored in an RDBMS.

As relational database systems have become more prominent and widespread, RDBMS‐specific versions of SQL have been developed to extend its abilities and improve performance within those systems. As an example, Microsoft SQL Server uses a variation named Transact‐SQL (T‐SQL), while Oracle Database uses Procedural Language/SQL (PL/SQL). However, the standard SQL statements covered in this book will run regardless of the system you are using. The SQL variations are typically significant only for complex processes or for data types that are not universally handled across RDBMSs, such as DateTime.

Because standard SQL is universal, it is useful for any software developer in any language to understand how to use it. Having a solid foundation on the basics will also make it easier for you to learn any of the variants, should the need arise in the future.

In SQL, the term statement is often used to refer to any complete command. The term query technically refers to a statement that retrieves data from a database, but many developers also use the term query to refer to any SQL statement. GUIs like MySQL Workbench also typically use query in this way.

Additionally, similar to programming languages, SQL has a set of keywords that are reserved words with special meaning to SQL. For example, SELECT is a SQL keyword used to select data. Keywords are used within statements and queries. A final term to be aware of is clause. A clause is simply a section of a query.

SQL SYNTAX

SQL follows a rigid syntax, where each keyword and clause must be in a specific order. The most common statement is a SELECT statement, which is used to retrieve data from one or more tables in a database. The SELECT statement uses the following basic syntax:

 SELECT field1, field2, field3
 FROM table1
 WHERE criteria
 ORDER BY field1, field2;
 

This statement will retrieve the data from field1, field2, and field3 whose values match the criteria statement in the WHERE clause in a table named table1. The results will be sorted by the values in field1 and field2. Each of these clauses must be included in this order for the query to work correctly.

As SQL statements are introduced in this lesson, keep in mind that the order in which the clauses are written is part of the overall syntax, and a statement may not work if the syntax is not strictly followed.

Semicolon

A standard SQL statement always ends in a semicolon, although RDBMS‐specific versions may vary.

  • MySQL requires the use of a semicolon or /g at the end of each statement.
  • Recent versions of T‐SQL (used in Microsoft SQL Server) do not require semicolons, but they do support them, so it doesn't hurt to include them, especially since SQL Server documentation normally includes a warning that semicolons may be required in future versions.

Basically, it's a good idea to always end a SQL statement with a semicolon, even if it may not be strictly necessary. If you are using MySQL and do not include a required semicolon or /g, it will simply wait patiently until you provide the appropriate syntax.

Line Breaks and Indents

Neither line breaks nor indents are required within SQL statements, although you are encouraged to use them to improve readability in the code. As an example, a table could be created using a single‐line command that ends with a semicolon.

 create table `Client` (ClientId char(36) primary key, FirstName varchar(50) not null, LastName varchar(50) not null, BirthDate date null, Address varchar(256) null, City varchar(100) null, StateAbbr char(2) null, PostalCode varchar(10) null, foreign key fk_Client_StateAbbr (StateAbbr) references State(StateAbbr));
 

However, this format is hard to read, and as a result, it's hard to verify that all the expected fields are included and set up correctly. The same statement can be written with each field on a separate line, making it much easier to verify that the table's settings are correct.

 CREATE TABLE `Client` (
     ClientId CHAR(36) PRIMARY KEY,
     FirstName VARCHAR(50) NOT NULL,
     LastName VARCHAR(50) NOT NULL,
     BirthDate DATE NULL,
     Address VARCHAR(256) NULL,
     City VARCHAR(100) NULL,
     StateAbbr CHAR(2) NULL,
     PostalCode VARCHAR(10) NULL,
     FOREIGN KEY fk_Client_StateAbbr (StateAbbr)
         REFERENCES State(StateAbbr)
 );
 

You can also see how much more readable the statement is with the line breaks. This is why each clause of a SQL statement is typically written on a separate line. For example, the following SELECT statement will retrieve data from named fields in the named table:

 SELECT FirstName, LastName, Address FROM Client;
 

If the statement is written on two separate lines, it can be easier to read.

 SELECT FirstName, LastName, Address
 FROM Client;
 

SQL statements can be complex. When you are writing your own SQL statements, you should add line breaks and indents where they make the most sense to you. If you are working on a team, remember that the line breaks and indents should be meaningful to anyone else who will read the code.

Letter Case

SQL keywords are not case‐sensitive, so you can use SELECT, select, or even SeLect interchangeably when typing SQL statements. That said, it is common to write keywords in ALL CAPS as a way of distinguishing keywords from other parts of a statement, mainly to improve readability.

For example, in the following statement, it is easy to see that SELECT and FROM are keywords.

 SELECT FirstName, LastName, Address
 FROM Client;
 

When referencing parts of a database, it is best practice to use the same case used to name tables, fields, keys, or other objects in the database. Different development teams use different naming conventions within a database, and while some SQL servers are not case‐sensitive at all, others require that names use the same case used in the database itself. In addition, if you are using a case‐sensitive language like Java or C# to access data in a database, you will need to use a case consistent with that system.

In other words, if a table is named Client in the database, you should use Client in SQL statements that reference that table. A statement may run even if you use client instead, but it's good to get in the habit of matching the case for database objects.

Commas

SQL uses commas to separate like objects in a series. For example, in the following CREATE TABLE statement, there is a comma after each field definition:

 CREATE TABLE `Client` (
     ClientId CHAR(36) PRIMARY KEY,
     FirstName VARCHAR(50) NOT NULL,
     LastName VARCHAR(50) NOT NULL,
     BirthDate DATE NULL,
     Address VARCHAR(256) NULL,
     City VARCHAR(100) NULL,
     StateAbbr CHAR(2) NULL,
     PostalCode VARCHAR(10) NULL,
     FOREIGN KEY fk_Client_StateAbbr (StateAbbr)
         REFERENCES State(StateAbbr)
 );
 

Similarly, the following SELECT statement names three fields in the SELECT clause, using commas to separate the field names:

 SELECT FirstName, LastName, Address
 FROM Client;
 

Note that there is no comma after the last item in the series. If you were to add a comma after Address in the last statement, SQL would interpret FROM as another field name and throw an error because there is no field named FROM.

Spaces

SQL requires a space after each logical word in a statement, including keywords and object names. Spaces are not required around commas or parentheses, although they can be used to improve readability.

As a general best practice for any RDBMS, names of objects like tables, fields, and indexes should not include spaces. However, if the database designer included them anyway, the name of that object must be inside quotation marks, like this:

 SELECT "First Name", "Last Name", Address
 FROM Client;
 

Quotation Marks

When used, quotation marks may be double quotes (“ “) or single quotes (' '), as long as they are used in pairs. In the previous example, you saw how quotes can be used to identify fields whose names include spaces or other unusual characters, but they are also used to identify string values when adding data to a table or when using a string in a conditional statement. For example, if you wanted to find a client with the last name Smith, either of the following queries will work:

 SELECT FirstName, LastName, Address
 FROM Client
 WHERE LastName = "smith";
 

or

 SELECT FirstName, LastName, Address
 FROM Client
 WHERE LastName = 'smith';
 

You cannot, however, mismatch them. The following query will not work:

 SELECT FirstName, LastName, Address
 FROM Client
 WHERE LastName = 'smith";
 

Spelling

We've saved spelling for the last of the SQL syntax topics because it's both the most obvious and the most common source of errors. All keywords must be spelled correctly, and all object names must be spelled exactly as they appear in the database (even if they might be misspelled in the database). If a statement doesn't run as expected, double‐check the spelling of every word.

WORKING WITH NULL VALUES

In addition to understanding the syntax for SQL statements, a critical concept to understand in any database is that of null values. When a table is defined, one of the key characteristics of any field is whether a value is required for that field in each record. A variety of terms are used for that concept, including required (which means that there must be a value for every record), nullable (which means that a value is optional), and the official definition NOT NULL, which means that a value is required.

Essentially, a null value is an empty value: it contains no value at all.

Null vs. Zero

null is not the same as zero. Zero is a value, while null is not. Let's look at a couple of examples.

Suppose you are working on a candy vendor's database, and the Product table includes a Price field. Table 6.1 contains a few sample records.

Table 6.1 Product Table for Candy Vendor's Database

ProductID ProductName Price
001 Cherry Lollipop 0.5
002 Honey Bit 0
003 Chocolate Toffee

In looking at the Product table, if a customer purchases a cherry lollipop, they will be charged 50 cents. From the table, you can see that honey bits are free because their price is 0. The question to ponder, however, is how much would the store charge for a chocolate toffee?

Because the price is null, we don't know what its price is. Furthermore, we don't know why there is no value. It could be that the data entry person forgot to include the price. It could also be that for whatever reason someone chose not to determine the price when the product was added to the database. If we want to sell chocolate toffees, though, a real price will eventually need to be defined.

Nullable Fields

When a table for any RDBMS is defined, it must be determined whether each field in that table is nullable. The decision mostly lies in how the database will be used. Remember, a nullable field is a field that makes it optional to include a value.

The one hard and fast rule about nullables is that any field used in a primary key is not nullable. Entity integrity requires that there is a value for every primary key, and this is a default setting when a field is defined as a primary key.

All other fields in a table are nullable by default. In other words, if you don't set a NOT NULL attribute on a field, SQL will allow users to leave those fields empty. This means that it is part of the database designer's job to identify which fields are nullable for the purposes of that database when they define the tables.

In the previous candy shop example, the price field should be set to NOT NULL, if only so that it is known how much to charge customers for each type of candy. ProductName should also be required, to avoid entries like this:

ProductID ProductName Price
004 3

In this example, you can see that product 004 costs $3, but we have no idea what the name of the candy is. In this example, it should be clear that all three fields should have the NOT NULL attribute. The first field, ProductID, cannot be null because it is the primary key. Neither of the other fields should be null for the reasons just described. The table could be defined with the SQL code shown in Listing 6.1.

The code in Listing 16.1 creates a new table called Product that is defined to have three fields: ProductID, which is a string and the primary key; ProductName, which is a variable string of characters up to 25 characters long; and Price, which contains a floating‐point number. All three field definitions include NOT NULL. The inclusion of NOT NULL would not prevent someone from adding a price of 0, but it would prevent someone from not including the product name or the price.

In other cases, though, nulls are more acceptable. For example, consider a personal contact list, like the one provided for most smartphones. Because these are for individual use, all the fields are nullable, allowing the user to enter only the data they have and need for each person or company in the contact list. Here are some examples:

  • John contacts Mary about an item she is selling on Craigslist. Mary wants John's phone number so that she can text him (and know who it is when he texts her), but she doesn't need a last name or address. That entry will include only a first name and a mobile number, with possibly a note about the item on Craigslist.
  • Jack wants to add a plumbing company to his contact list so that he can call the company when he needs service. He doesn't have a specific person's name, so he enters the name of the company in the Company field, leaving the name fields empty. He then adds the company's phone number and website but leaves all remaining fields empty.
  • Marcos finds and buys a collectible baseball card from a seller on Etsy. He wants to keep track of the seller in case they have more cards he wants to buy in the future, so he adds the seller's first and last name to his contact list, along with the name of the store, the email address, and the store's web address.
  • Elizabeth adds a good friend to her contact list, including the person's first and last name, mailing address, email address, birthday, and phone number.

All of these examples include null values for fields that are not important for that entry. When designing a database, you must consider the purpose of the database itself, as well as the importance of each field to individual users of the database.

Consequences of Null Values

Choosing to allow nulls in a table has consequences on the efficiency of the database. A phone's contact list likely does not use a relational database structure (even though the concepts are the same), so it can allow as many nulls as the user wants. However, relational databases are highly structured, so nulls have more impact.

RDBMSs set aside storage space for each field in a record, based on the defined size of the field. For example, an INT field requires 4 bytes of space, while a DATETIME field requires 8 bytes of space. Taken individually, these seem like very small numbers, but when you consider that a table can have hundreds of thousands of records, it multiplies quickly.

Note that this is reserved space, not used space, and the RDBMS will use that much storage regardless of the actual value stored in each of those fields, and even if nothing is stored there. Each time a record is added that includes a null value, space is set aside that will not be used.

In the long run, if there are relatively few nullable fields in a table and if the data is likely to use fields even when they are nullable, this is a reasonable trade‐off, because it allows data to quickly be added to those fields if they are to be used in the future.

If, however, a table has a considerable number of nullable fields, it might be worth creating a separate table that includes only those fields, allowing users to create records with those fields only as necessary.

As an example, let's extend the candy shop database. Most of the products are straightforward with a name and a price. Let's branch out the inventory a bit to sell stickers as well as candy. For candy, we want to know the name and price, but also the flavor, the size of the package, the texture, and other candy‐related attributes. For stickers, we might want to know the size, but flavor and texture would be irrelevant. If all these attributes are put into one table, each record will include many null values, as shown in Table 6.2.

Table 6.2 Product Table for Candy and Stickers

ProductID ProductName Price Weight Width Length Flavor Texture
001 Cherry Lollipop 0.5 1.5 oz null null cherry Hard
002 Honey Bit 0.75 3 oz null null Honey Chewy
003 Chocolate Toffee 2.50 6 oz null null Chocolate Hard
004 Sponge Bob .99 null 3 3 null null
005 Robot .99 null 2 4 null null
006 Red hots 1.19 1.2 oz Cinnamon hard
007 Unicorn 1.19 null 4 3 null null

To solve this problem, the attributes could be split across multiple tables. The Product table would include the attributes common to all products.

  • ProductID
  • ProductName
  • Price

We would also have a Sticker table, which includes only the attributes relevant to stickers.

  • ProductID
  • Width
  • Length

Finally, we would have a Candy table that includes only the attributes for candy.

  • ProductID
  • Flavor
  • PackageWeight
  • Texture

You can see that each table includes ProductID to identify what specific product each record describes, but the result is that the number of null values is reduced by using only the table that is appropriate for each specific table. Table 6.3, Table 6.4, and Table 6.5 show the data from reorganizing Table 6.2.

Table 6.3 New Product Table

ProductID ProductName Price
001 Cherry Lollipop 0.5
002 Honey Bit 0.75
003 Chocolate Toffee 2.50
004 Sponge Bob .99
005 Robot .99
006 Red hots 1.19
007 Unicorn 1.19

Table 6.4 Sticker Table

ProductID Length Width
004 3 3
005 2 4
007 4 3

Table 6.5 Candy Table

ProductID PackageWeight Flavor Texture
001 1.5 oz cherry Hard
002 3 oz Honey Chewy
003 6 oz Chocolate Hard
006 1.2 oz Cinnamon hard

Understanding nulls and how they impact the efficiency of a database is an important part of database design. However, as with the denormalization processes, you have to consider how much efficiency reducing null fields gives us. If you routinely pull related data from two or more tables, it might make more sense to combine those fields into a single table to improve the amount of time required to retrieve data.

Using the example given in this lesson, if the majority of the stock is candy while stickers make up a very small percentage, it might make more sense to put everything in one table anyway, knowing that some of the fields will be empty. That would improve the amount of time it takes to retrieve data from that table, while trading off the amount of empty space that the nulls will create in storage.

WORKING WITH INDEXES

Indexing is a complex but essential process in any relational database system. Experienced database developers can use indexing to improve the efficiency of retrieving data from a database, but new database designers must understand what it is and how it affects design decisions.

Chances are good that at some point in your life, you have used the index of a book to find out more information on a topic included in that book. Most textbooks include an index of major topics, for example, but recipe books use indexes to help users find recipes based on ingredient or cuisine, while atlases use indexes to help users find specific maps based on location name.

A good index has the following characteristics:

  • It is easy to find. In printed books, the index is typically at the end of a book, but it might also appear at the beginning. It is never located in a random location somewhere in between.
  • It includes only values the user is likely to look for. A textbook index will tell you what pages to look at for specific concepts, but it does not include every word that might appear in the book. Consider how long an index would be if every instance of the words the or that were included or if a cookbook index listed every recipe that includes salt.
  • It has the indexed values sorted in a way that makes it easy for the user to find specific values. This sorting is normally in alphabetical order. The index values might also be grouped so that subtopics appear under major topics, such as countries in Europe or recipes that use beef as a major ingredient.

Relational databases use index values in much the same way and for the same reasons. Before covering the indexing of fields, let's first talk about storage.

Primary vs. Secondary Storage

For relational databases, it is important to understand the difference between primary storage and secondary storage. Primary storage includes memory or RAM in a computer system, which tends to store data for only a short time. Secondary storage includes things such as hard drives, flash drives, and solid‐state drives, which can store things for a longer period of time including when the computer is not powered. For the sake of clarity, the term memory will be used to refer to a system's short‐term storage and storage to refer to long‐term storage.

Memory is the space where a computer holds the instructions and data it is currently working with. Memory is significantly faster than other forms of storage, which makes it ideal for those things that the computer needs immediately. However, memory is one of the most expensive components of a computer system, so how much is available is often limited to reduce those costs. For this reason, there is normally significantly less memory in any computer system than there is storage. Memory is also volatile, which means that as soon as the program using memory closes (or the computer is shut down for any reason), all data and instructions stored in memory are erased.

Storage, especially hard drive storage, is relatively inexpensive these days. It is also nonvolatile, meaning that anything we store there will be available the next time the computer is logged into or an application is started, barring physical damage to the drive. This makes it ideal for storing data that will be wanted for use again later. However, storage can be very slow both in terms of writing/recording data and retrieving data that is stored on the disk, which makes it less than ideal if users need to retrieve data from a large database.

To improve a database's efficiency, you need to take advantage of both technologies: data most likely to be used should be in memory when the database opens, while data that isn't likely to be used remains in storage until it is needed.

This is where indexing fields come into play.

Indexing Fields

Just like the index entries in a book tell you what page to look at for a specific term, map, or recipe, an indexed field tells the database where to look for related data when the user wants to retrieve data using an indexed value. When fields are indexed, the database engine is being told that the values in those fields need to be loaded into memory (only to the extent that memory is available) when the database opens. Unindexed data remains on the hard drive until it is retrieved by the user, and only the relevant data is retrieved at that time.

A gut reaction to finding out that data stored in memory can be accessed more quickly than data in storage is to try to index everything, so that everything is always immediately available to the user. Remember, though, that most systems have a limited amount of memory. If you try to add too much content to memory, that extra content goes into a swap disk, a portion of the hard drive that most systems use for memory overflow. Because that process puts the data back on the hard drive, you end up back where you started, and you gain no efficiency with that approach. In essence, this is equivalent to a textbook's index that includes every word that appears in the book, including minor words like the and that.

Instead, an experienced database designer will look at how each field is used and index only those fields that users are most likely to need. For example, in a customer database, a user is more likely to look up people by their last name than by their first name, so the last name field is a better candidate for indexing than the first name is. The database engine will automatically load all values in an indexed field into memory when the database is opened, making it faster to find those values. Users can retrieve the first name associated with any given last name more quickly than if the last name was not indexed.

This approach does not prevent users from searching by first name rather than by last name, but the search will be slower because the database will have to search through storage rather than through memory.

Default Indexes

By default, an RDBMS automatically indexes key fields, including primary and foreign keys. These fields are essential for retrieving data across tables, and indexing those fields helps make queries more efficient.

Guidelines state that the order of the rows themselves is not important in a table. While no table should be set up so that the values in one row are dependent on another row in the same table, indexes typically control the order in which the rows appear in a search.

When a table's indexed fields are loaded into memory, the database sorts the values in the order as specified in the index definition—alphabetically if the indexed values are strings or numerically if the values are numbers. This means that the values are in a predictable order, making searches much more efficient. You can see this when you retrieve data from a table without specifying a sort order: regardless of the order in which the records are added to a table, the results will default to being sorted by the primary key values.

This sort order serves the same purpose as putting a book's index in alphabetical order. As a user, you do not have to look at every value in the index to find the one you want. Chances are good that you skip through the index by column or page until you are close to what you want to find and then pay attention to individual values only when you are within a few records of the term you are trying to find. Database searches work the same way, using common search algorithms to quickly find specific values, rather than reading each individual value until it finds the one it needs.

This is also why having too many indexed fields slows down the write processes of saving new data to the database. Indexed fields are stored in a predictable order to make it easier to retrieve data based on those indexes. Every time a new indexed value is added to a table, the database must reorganize the data in that table to match the expected order. Adding records using an auto‐incremented key helps improve write efficiency because it guarantees that each new record will be added at the end of the primary table.

Unique and Nonunique Indexes

Indexes can also be used to control allowable values in a column. For example, a primary key index by definition verifies that each value used in that field is unique within the table. A foreign key index does not check for the unique quality, but it does automatically enforce referential integrity. This means that whenever a value is entered as a foreign key, the database engine will confirm that the value references the primary key of a related table, which helps improve data integrity across tables.

Unique indexes can also be used in other fields. For example, you might want to verify that no two people in a list of employees have the same Social Security number or email address. This can be done by setting a unique index on the field that should be unique, without also defining that field as a primary key.

An index on a last name field, however, should not be unique, mainly because many people can have the same last name. If last name is frequently used for searches, the database designer can add a nonunique index to that field.

SUMMARY

SQL is designed to be a very human‐friendly language (for English speakers, anyway), and most of its syntax rules are easy to use. While there are a few variations from one version of SQL to the next, the guidelines here will apply to virtually every version of SQL.

When creating your database's tables, it is important to review the fields that will be included. Because fields in a table can contain null values by default, a design review should include determining whether fields should be tagged as NOT NULL so that a value is required. For primary key fields, NOT NULL is required. Additionally, if your fields might contain a large number of null elements, then you might need to redesign your database to reduce the amount of storage space needed.

Relational databases generally require large amounts of long‐term storage, which means that searching through databases is inherently slow. Efficiency of searches can be improved by assigning indexes to frequently used fields.

Data in indexed fields, including primary and foreign keys, is automatically loaded into memory when a database is opened, meaning that the database can quickly access those values to make searches more efficient. In addition, database designers can identify additional fields to act as indexes, especially fields that users are likely to access frequently.

EXERCISES

The following exercises are provided to allow you to experiment with concepts presented in this lesson:

Exercise 6.1: Remember Your Lines

Reformat the following line of SQL code so that it is presented with a cleaner presentation that is easier to read.

 SELECT Last, First, Email, MobileNumber FROM Contacts WHERE Age>= 21 ORDERED BY last, First;
 

Exercise 6.2: Contact Questions

Look at the following table:

CREATE TABLE Contact (
     ID    INT NOT NULL PRIMARY KEY,
     Last  VARCHAR(50) NOT NULL,
     First VARCHAR(40),
     Age   INT, 
     Email VARCHAR(100),
     MobileNumber VARCHAR(12),
     HomeNumber   VARCHAR(12),
     WorkNumber   VARCHAR(12)
 );
 

Answer the following questions:

  • Which fields are required in the table?
  • What is the data type of each field?
  • What is the longest last name that can be stored in a field?
  • Is age required?
  • What would need to be changed to make WorkNumber required?

Exercise 6.3: Missing Contact

In this lesson, you learned about null fields. The database for contacts shown in Table 6.6 allows for null fields. This database, however, has a lot of wasted space. Redesign this database so that it uses multiple tables in a manner to be more efficient with storage space.

Table 6.6 Contacts Table

ID Last First Email MobileNumber HomeNumber WorkNumber Fax
c001 Jones John john@bogus.com 317‐555‐1212 317‐555‐1213 317‐555‐1214 null
c002 Buford Bob Null null null 415‐555‐3333 null
c003 Smith Sam Sam@bogus.com null 415‐555‐1212 null null
c004 Michaels Mitch Null 415‐555‐2121 null null null
c005 Andrews Adam Adam@bogus.com 698‐555‐1212 null null null
c006 Finkelstein Fred Null null 217‐555‐4340 null null
c007 Black Brent brent@bogus.com null null null null