SQL is designed as a universal language to manage data stored in a relational database management system. In this lesson, we will look at some of the features of SQL before jumping into using it heavily to manage data. This includes the topic of working with null values and how to use them in MySQL. We will also take a look at indexing.
SQL stands for Structured Query Language, and it can be pronounced like “sequel” or using the letters S‐Q‐L. It is designed to work with any relational database management system (RDBMS), which includes MySQL Server, Microsoft SQL Server, and Oracle Database, among others. Because it works universally across RDBMSs, programming languages such as Java, C#, and PHP support the use of SQL to retrieve and manage data stored in an RDBMS.
As relational database systems have become more prominent and widespread, RDBMS‐specific versions of SQL have been developed to extend its abilities and improve performance within those systems. As an example, Microsoft SQL Server uses a variation named Transact‐SQL (T‐SQL), while Oracle Database uses Procedural Language/SQL (PL/SQL). However, the standard SQL statements covered in this book will run regardless of the system you are using. The SQL variations are typically significant only for complex processes or for data types that are not universally handled across RDBMSs, such as DateTime.
Because standard SQL is universal, it is useful for any software developer in any language to understand how to use it. Having a solid foundation on the basics will also make it easier for you to learn any of the variants, should the need arise in the future.
In SQL, the term statement is often used to refer to any complete command. The term query technically refers to a statement that retrieves data from a database, but many developers also use the term query to refer to any SQL statement. GUIs like MySQL Workbench also typically use query in this way.
Additionally, similar to programming languages, SQL has a set of keywords that are reserved words with special meaning to SQL. For example, SELECT
is a SQL keyword used to select data. Keywords are used within statements and queries. A final term to be aware of is clause. A clause is simply a section of a query.
SQL follows a rigid syntax, where each keyword and clause must be in a specific order. The most common statement is a SELECT
statement, which is used to retrieve data from one or more tables in a database. The SELECT
statement uses the following basic syntax:
SELECT field1, field2, field3
FROM table1
WHERE criteria
ORDER BY field1, field2;
This statement will retrieve the data from field1, field2, and field3 whose values match the criteria statement in the WHERE
clause in a table named table1. The results will be sorted by the values in field1 and field2. Each of these clauses must be included in this order for the query to work correctly.
As SQL statements are introduced in this lesson, keep in mind that the order in which the clauses are written is part of the overall syntax, and a statement may not work if the syntax is not strictly followed.
A standard SQL statement always ends in a semicolon, although RDBMS‐specific versions may vary.
/g
at the end of each statement.Basically, it's a good idea to always end a SQL statement with a semicolon, even if it may not be strictly necessary. If you are using MySQL and do not include a required semicolon or /g
, it will simply wait patiently until you provide the appropriate syntax.
Neither line breaks nor indents are required within SQL statements, although you are encouraged to use them to improve readability in the code. As an example, a table could be created using a single‐line command that ends with a semicolon.
create table `Client` (ClientId char(36) primary key, FirstName varchar(50) not null, LastName varchar(50) not null, BirthDate date null, Address varchar(256) null, City varchar(100) null, StateAbbr char(2) null, PostalCode varchar(10) null, foreign key fk_Client_StateAbbr (StateAbbr) references State(StateAbbr));
However, this format is hard to read, and as a result, it's hard to verify that all the expected fields are included and set up correctly. The same statement can be written with each field on a separate line, making it much easier to verify that the table's settings are correct.
CREATE TABLE `Client` (
ClientId CHAR(36) PRIMARY KEY,
FirstName VARCHAR(50) NOT NULL,
LastName VARCHAR(50) NOT NULL,
BirthDate DATE NULL,
Address VARCHAR(256) NULL,
City VARCHAR(100) NULL,
StateAbbr CHAR(2) NULL,
PostalCode VARCHAR(10) NULL,
FOREIGN KEY fk_Client_StateAbbr (StateAbbr)
REFERENCES State(StateAbbr)
);
You can also see how much more readable the statement is with the line breaks. This is why each clause of a SQL statement is typically written on a separate line. For example, the following SELECT
statement will retrieve data from named fields in the named table:
SELECT FirstName, LastName, Address FROM Client;
If the statement is written on two separate lines, it can be easier to read.
SELECT FirstName, LastName, Address
FROM Client;
SQL statements can be complex. When you are writing your own SQL statements, you should add line breaks and indents where they make the most sense to you. If you are working on a team, remember that the line breaks and indents should be meaningful to anyone else who will read the code.
SQL keywords are not case‐sensitive, so you can use SELECT
, select
, or even SeLect
interchangeably when typing SQL statements. That said, it is common to write keywords in ALL CAPS as a way of distinguishing keywords from other parts of a statement, mainly to improve readability.
For example, in the following statement, it is easy to see that SELECT
and FROM
are keywords.
SELECT FirstName, LastName, Address
FROM Client;
When referencing parts of a database, it is best practice to use the same case used to name tables, fields, keys, or other objects in the database. Different development teams use different naming conventions within a database, and while some SQL servers are not case‐sensitive at all, others require that names use the same case used in the database itself. In addition, if you are using a case‐sensitive language like Java or C# to access data in a database, you will need to use a case consistent with that system.
In other words, if a table is named Client
in the database, you should use Client
in SQL statements that reference that table. A statement may run even if you use client
instead, but it's good to get in the habit of matching the case for database objects.
SQL uses commas to separate like objects in a series. For example, in the following CREATE TABLE
statement, there is a comma after each field definition:
CREATE TABLE `Client` (
ClientId CHAR(36) PRIMARY KEY,
FirstName VARCHAR(50) NOT NULL,
LastName VARCHAR(50) NOT NULL,
BirthDate DATE NULL,
Address VARCHAR(256) NULL,
City VARCHAR(100) NULL,
StateAbbr CHAR(2) NULL,
PostalCode VARCHAR(10) NULL,
FOREIGN KEY fk_Client_StateAbbr (StateAbbr)
REFERENCES State(StateAbbr)
);
Similarly, the following SELECT
statement names three fields in the SELECT
clause, using commas to separate the field names:
SELECT FirstName, LastName, Address
FROM Client;
Note that there is no comma after the last item in the series. If you were to add a comma after Address
in the last statement, SQL would interpret FROM
as another field name and throw an error because there is no field named FROM
.
SQL requires a space after each logical word in a statement, including keywords and object names. Spaces are not required around commas or parentheses, although they can be used to improve readability.
As a general best practice for any RDBMS, names of objects like tables, fields, and indexes should not include spaces. However, if the database designer included them anyway, the name of that object must be inside quotation marks, like this:
SELECT "First Name", "Last Name", Address
FROM Client;
When used, quotation marks may be double quotes (“ “) or single quotes (' '), as long as they are used in pairs. In the previous example, you saw how quotes can be used to identify fields whose names include spaces or other unusual characters, but they are also used to identify string values when adding data to a table or when using a string in a conditional statement. For example, if you wanted to find a client with the last name Smith, either of the following queries will work:
SELECT FirstName, LastName, Address
FROM Client
WHERE LastName = "smith";
or
SELECT FirstName, LastName, Address
FROM Client
WHERE LastName = 'smith';
You cannot, however, mismatch them. The following query will not work:
SELECT FirstName, LastName, Address
FROM Client
WHERE LastName = 'smith";
We've saved spelling for the last of the SQL syntax topics because it's both the most obvious and the most common source of errors. All keywords must be spelled correctly, and all object names must be spelled exactly as they appear in the database (even if they might be misspelled in the database). If a statement doesn't run as expected, double‐check the spelling of every word.
In addition to understanding the syntax for SQL statements, a critical concept to understand in any database is that of null values. When a table is defined, one of the key characteristics of any field is whether a value is required for that field in each record. A variety of terms are used for that concept, including required (which means that there must be a value for every record), nullable (which means that a value is optional), and the official definition NOT NULL
, which means that a value is required.
Essentially, a null value is an empty value: it contains no value at all.
null
is not the same as zero. Zero is a value, while null
is not. Let's look at a couple of examples.
Suppose you are working on a candy vendor's database, and the Product table includes a Price field. Table 6.1 contains a few sample records.
Table 6.1 Product Table for Candy Vendor's Database
ProductID | ProductName | Price |
---|---|---|
001 | Cherry Lollipop | 0.5 |
002 | Honey Bit | 0 |
003 | Chocolate Toffee |
In looking at the Product table, if a customer purchases a cherry lollipop, they will be charged 50 cents. From the table, you can see that honey bits are free because their price is 0. The question to ponder, however, is how much would the store charge for a chocolate toffee?
Because the price is null, we don't know what its price is. Furthermore, we don't know why there is no value. It could be that the data entry person forgot to include the price. It could also be that for whatever reason someone chose not to determine the price when the product was added to the database. If we want to sell chocolate toffees, though, a real price will eventually need to be defined.
When a table for any RDBMS is defined, it must be determined whether each field in that table is nullable. The decision mostly lies in how the database will be used. Remember, a nullable field is a field that makes it optional to include a value.
The one hard and fast rule about nullables is that any field used in a primary key is not nullable. Entity integrity requires that there is a value for every primary key, and this is a default setting when a field is defined as a primary key.
All other fields in a table are nullable by default. In other words, if you don't set a NOT NULL
attribute on a field, SQL will allow users to leave those fields empty. This means that it is part of the database designer's job to identify which fields are nullable for the purposes of that database when they define the tables.
In the previous candy shop example, the price field should be set to NOT NULL
, if only so that it is known how much to charge customers for each type of candy. ProductName should also be required, to avoid entries like this:
ProductID | ProductName | Price |
---|---|---|
004 | 3 |
In this example, you can see that product 004 costs $3, but we have no idea what the name of the candy is. In this example, it should be clear that all three fields should have the NOT NULL
attribute. The first field, ProductID, cannot be null because it is the primary key. Neither of the other fields should be null for the reasons just described. The table could be defined with the SQL code shown in Listing 6.1.
The code in Listing 16.1 creates a new table called Product that is defined to have three fields: ProductID, which is a string and the primary key; ProductName, which is a variable string of characters up to 25 characters long; and Price, which contains a floating‐point number. All three field definitions include NOT NULL
. The inclusion of NOT NULL
would not prevent someone from adding a price of 0, but it would prevent someone from not including the product name or the price.
In other cases, though, nulls are more acceptable. For example, consider a personal contact list, like the one provided for most smartphones. Because these are for individual use, all the fields are nullable, allowing the user to enter only the data they have and need for each person or company in the contact list. Here are some examples:
All of these examples include null values for fields that are not important for that entry. When designing a database, you must consider the purpose of the database itself, as well as the importance of each field to individual users of the database.
Choosing to allow nulls in a table has consequences on the efficiency of the database. A phone's contact list likely does not use a relational database structure (even though the concepts are the same), so it can allow as many nulls as the user wants. However, relational databases are highly structured, so nulls have more impact.
RDBMSs set aside storage space for each field in a record, based on the defined size of the field. For example, an INT field requires 4 bytes of space, while a DATETIME field requires 8 bytes of space. Taken individually, these seem like very small numbers, but when you consider that a table can have hundreds of thousands of records, it multiplies quickly.
Note that this is reserved space, not used space, and the RDBMS will use that much storage regardless of the actual value stored in each of those fields, and even if nothing is stored there. Each time a record is added that includes a null value, space is set aside that will not be used.
In the long run, if there are relatively few nullable fields in a table and if the data is likely to use fields even when they are nullable, this is a reasonable trade‐off, because it allows data to quickly be added to those fields if they are to be used in the future.
If, however, a table has a considerable number of nullable fields, it might be worth creating a separate table that includes only those fields, allowing users to create records with those fields only as necessary.
As an example, let's extend the candy shop database. Most of the products are straightforward with a name and a price. Let's branch out the inventory a bit to sell stickers as well as candy. For candy, we want to know the name and price, but also the flavor, the size of the package, the texture, and other candy‐related attributes. For stickers, we might want to know the size, but flavor and texture would be irrelevant. If all these attributes are put into one table, each record will include many null values, as shown in Table 6.2.
Table 6.2 Product Table for Candy and Stickers
ProductID | ProductName | Price | Weight | Width | Length | Flavor | Texture |
---|---|---|---|---|---|---|---|
001 | Cherry Lollipop | 0.5 | 1.5 oz | null | null | cherry | Hard |
002 | Honey Bit | 0.75 | 3 oz | null | null | Honey | Chewy |
003 | Chocolate Toffee | 2.50 | 6 oz | null | null | Chocolate | Hard |
004 | Sponge Bob | .99 | null | 3 | 3 | null | null |
005 | Robot | .99 | null | 2 | 4 | null | null |
006 | Red hots | 1.19 | 1.2 oz | Cinnamon | hard | ||
007 | Unicorn | 1.19 | null | 4 | 3 | null | null |
To solve this problem, the attributes could be split across multiple tables. The Product table would include the attributes common to all products.
We would also have a Sticker table, which includes only the attributes relevant to stickers.
Finally, we would have a Candy table that includes only the attributes for candy.
You can see that each table includes ProductID to identify what specific product each record describes, but the result is that the number of null values is reduced by using only the table that is appropriate for each specific table. Table 6.3, Table 6.4, and Table 6.5 show the data from reorganizing Table 6.2.
Table 6.3 New Product Table
ProductID | ProductName | Price |
---|---|---|
001 | Cherry Lollipop | 0.5 |
002 | Honey Bit | 0.75 |
003 | Chocolate Toffee | 2.50 |
004 | Sponge Bob | .99 |
005 | Robot | .99 |
006 | Red hots | 1.19 |
007 | Unicorn | 1.19 |
Table 6.4 Sticker Table
ProductID | Length | Width |
---|---|---|
004 | 3 | 3 |
005 | 2 | 4 |
007 | 4 | 3 |
Table 6.5 Candy Table
ProductID | PackageWeight | Flavor | Texture |
---|---|---|---|
001 | 1.5 oz | cherry | Hard |
002 | 3 oz | Honey | Chewy |
003 | 6 oz | Chocolate | Hard |
006 | 1.2 oz | Cinnamon | hard |
Understanding nulls and how they impact the efficiency of a database is an important part of database design. However, as with the denormalization processes, you have to consider how much efficiency reducing null fields gives us. If you routinely pull related data from two or more tables, it might make more sense to combine those fields into a single table to improve the amount of time required to retrieve data.
Using the example given in this lesson, if the majority of the stock is candy while stickers make up a very small percentage, it might make more sense to put everything in one table anyway, knowing that some of the fields will be empty. That would improve the amount of time it takes to retrieve data from that table, while trading off the amount of empty space that the nulls will create in storage.
Indexing is a complex but essential process in any relational database system. Experienced database developers can use indexing to improve the efficiency of retrieving data from a database, but new database designers must understand what it is and how it affects design decisions.
Chances are good that at some point in your life, you have used the index of a book to find out more information on a topic included in that book. Most textbooks include an index of major topics, for example, but recipe books use indexes to help users find recipes based on ingredient or cuisine, while atlases use indexes to help users find specific maps based on location name.
A good index has the following characteristics:
Relational databases use index values in much the same way and for the same reasons. Before covering the indexing of fields, let's first talk about storage.
For relational databases, it is important to understand the difference between primary storage and secondary storage. Primary storage includes memory or RAM in a computer system, which tends to store data for only a short time. Secondary storage includes things such as hard drives, flash drives, and solid‐state drives, which can store things for a longer period of time including when the computer is not powered. For the sake of clarity, the term memory will be used to refer to a system's short‐term storage and storage to refer to long‐term storage.
Memory is the space where a computer holds the instructions and data it is currently working with. Memory is significantly faster than other forms of storage, which makes it ideal for those things that the computer needs immediately. However, memory is one of the most expensive components of a computer system, so how much is available is often limited to reduce those costs. For this reason, there is normally significantly less memory in any computer system than there is storage. Memory is also volatile, which means that as soon as the program using memory closes (or the computer is shut down for any reason), all data and instructions stored in memory are erased.
Storage, especially hard drive storage, is relatively inexpensive these days. It is also nonvolatile, meaning that anything we store there will be available the next time the computer is logged into or an application is started, barring physical damage to the drive. This makes it ideal for storing data that will be wanted for use again later. However, storage can be very slow both in terms of writing/recording data and retrieving data that is stored on the disk, which makes it less than ideal if users need to retrieve data from a large database.
To improve a database's efficiency, you need to take advantage of both technologies: data most likely to be used should be in memory when the database opens, while data that isn't likely to be used remains in storage until it is needed.
This is where indexing fields come into play.
Just like the index entries in a book tell you what page to look at for a specific term, map, or recipe, an indexed field tells the database where to look for related data when the user wants to retrieve data using an indexed value. When fields are indexed, the database engine is being told that the values in those fields need to be loaded into memory (only to the extent that memory is available) when the database opens. Unindexed data remains on the hard drive until it is retrieved by the user, and only the relevant data is retrieved at that time.
A gut reaction to finding out that data stored in memory can be accessed more quickly than data in storage is to try to index everything, so that everything is always immediately available to the user. Remember, though, that most systems have a limited amount of memory. If you try to add too much content to memory, that extra content goes into a swap disk, a portion of the hard drive that most systems use for memory overflow. Because that process puts the data back on the hard drive, you end up back where you started, and you gain no efficiency with that approach. In essence, this is equivalent to a textbook's index that includes every word that appears in the book, including minor words like the and that.
Instead, an experienced database designer will look at how each field is used and index only those fields that users are most likely to need. For example, in a customer database, a user is more likely to look up people by their last name than by their first name, so the last name field is a better candidate for indexing than the first name is. The database engine will automatically load all values in an indexed field into memory when the database is opened, making it faster to find those values. Users can retrieve the first name associated with any given last name more quickly than if the last name was not indexed.
This approach does not prevent users from searching by first name rather than by last name, but the search will be slower because the database will have to search through storage rather than through memory.
By default, an RDBMS automatically indexes key fields, including primary and foreign keys. These fields are essential for retrieving data across tables, and indexing those fields helps make queries more efficient.
Guidelines state that the order of the rows themselves is not important in a table. While no table should be set up so that the values in one row are dependent on another row in the same table, indexes typically control the order in which the rows appear in a search.
When a table's indexed fields are loaded into memory, the database sorts the values in the order as specified in the index definition—alphabetically if the indexed values are strings or numerically if the values are numbers. This means that the values are in a predictable order, making searches much more efficient. You can see this when you retrieve data from a table without specifying a sort order: regardless of the order in which the records are added to a table, the results will default to being sorted by the primary key values.
This sort order serves the same purpose as putting a book's index in alphabetical order. As a user, you do not have to look at every value in the index to find the one you want. Chances are good that you skip through the index by column or page until you are close to what you want to find and then pay attention to individual values only when you are within a few records of the term you are trying to find. Database searches work the same way, using common search algorithms to quickly find specific values, rather than reading each individual value until it finds the one it needs.
This is also why having too many indexed fields slows down the write processes of saving new data to the database. Indexed fields are stored in a predictable order to make it easier to retrieve data based on those indexes. Every time a new indexed value is added to a table, the database must reorganize the data in that table to match the expected order. Adding records using an auto‐incremented key helps improve write efficiency because it guarantees that each new record will be added at the end of the primary table.
Indexes can also be used to control allowable values in a column. For example, a primary key index by definition verifies that each value used in that field is unique within the table. A foreign key index does not check for the unique quality, but it does automatically enforce referential integrity. This means that whenever a value is entered as a foreign key, the database engine will confirm that the value references the primary key of a related table, which helps improve data integrity across tables.
Unique indexes can also be used in other fields. For example, you might want to verify that no two people in a list of employees have the same Social Security number or email address. This can be done by setting a unique index on the field that should be unique, without also defining that field as a primary key.
An index on a last name field, however, should not be unique, mainly because many people can have the same last name. If last name is frequently used for searches, the database designer can add a nonunique index to that field.
SQL is designed to be a very human‐friendly language (for English speakers, anyway), and most of its syntax rules are easy to use. While there are a few variations from one version of SQL to the next, the guidelines here will apply to virtually every version of SQL.
When creating your database's tables, it is important to review the fields that will be included. Because fields in a table can contain null values by default, a design review should include determining whether fields should be tagged as NOT NULL
so that a value is required. For primary key fields, NOT NULL
is required. Additionally, if your fields might contain a large number of null elements, then you might need to redesign your database to reduce the amount of storage space needed.
Relational databases generally require large amounts of long‐term storage, which means that searching through databases is inherently slow. Efficiency of searches can be improved by assigning indexes to frequently used fields.
Data in indexed fields, including primary and foreign keys, is automatically loaded into memory when a database is opened, meaning that the database can quickly access those values to make searches more efficient. In addition, database designers can identify additional fields to act as indexes, especially fields that users are likely to access frequently.
The following exercises are provided to allow you to experiment with concepts presented in this lesson:
Reformat the following line of SQL code so that it is presented with a cleaner presentation that is easier to read.
SELECT Last, First, Email, MobileNumber FROM Contacts WHERE Age>= 21 ORDERED BY last, First;
Look at the following table:
CREATE TABLE Contact (
ID INT NOT NULL PRIMARY KEY,
Last VARCHAR(50) NOT NULL,
First VARCHAR(40),
Age INT,
Email VARCHAR(100),
MobileNumber VARCHAR(12),
HomeNumber VARCHAR(12),
WorkNumber VARCHAR(12)
);
Answer the following questions:
In this lesson, you learned about null fields. The database for contacts shown in Table 6.6 allows for null fields. This database, however, has a lot of wasted space. Redesign this database so that it uses multiple tables in a manner to be more efficient with storage space.
Table 6.6 Contacts Table
ID | Last | First | MobileNumber | HomeNumber | WorkNumber | Fax | |
---|---|---|---|---|---|---|---|
c001 | Jones | John | john@bogus.com | 317‐555‐1212 | 317‐555‐1213 | 317‐555‐1214 | null |
c002 | Buford | Bob | Null | null | null | 415‐555‐3333 | null |
c003 | Smith | Sam | Sam@bogus.com | null | 415‐555‐1212 | null | null |
c004 | Michaels | Mitch | Null | 415‐555‐2121 | null | null | null |
c005 | Andrews | Adam | Adam@bogus.com | 698‐555‐1212 | null | null | null |
c006 | Finkelstein | Fred | Null | null | 217‐555‐4340 | null | null |
c007 | Black | Brent | brent@bogus.com | null | null | null | null |