Lesson 15
Diving into Advanced SQL Topics

You've covered creating, reading, updating, and deleting tables and fields. You've also covered various aspects of selecting and manipulating the data within those databases and tables. There are a number of other actions and features you can tap into when working with SQL.

In this lesson a variety of topics will be touched upon that add to the foundation that you've already established. These topics are considered more advanced, but each is important in its own way.

ADDING SUBQUERIES

SQL is a specialized language that executes only in the context of a relational database. You can't create a video game with it. Despite that, SQL is incredibly powerful, flexible, and expressive. To get a sense of its flexibility, consider this: any value, table, or set of values can be replaced by a second, separate query.

A subquery is a syntactically correct, complete query that is embedded in another query to produce a value or tabular result set. Removed from its parent, a subquery is still valid, though it may use values from its parent to establish context. Three main areas where subqueries are often used include the IN clause, where a table can be used, and where a value can be used.

To better understand subqueries, the TrackIt schema will be used in this lesson. If you already have this database set up in your MySQL Server instance, you are welcome to use it. If you do not have it or if you want to rebuild it for this lesson, you can run the trackit‐schema‐and‐data.sql script that can be found with the downloadable files for this book.

Subqueries in the IN Operator

Values in an IN operator can come from a query. What if you wanted to find all Workers who are assigned to a Project? All the ProjectWorker.WorkerId values could be grabbed with a query that is within an IN clause, and the resulting values from that query would be used for the IN clause. This is shown in Listing 15.1.

In this query, the identifier, WorkerId, means two different things depending on where it's mentioned. In the main query, it refers to Worker.WorkerId. Inside the IN, it refers to ProjectWorker.WorkerId. It's easy to get confused.

An alternative approach might be to use JOIN Worker with ProjectWorker. That's a good idea, but it would return duplicate Workers because Workers can be assigned to more than one Project. The IN approach, on the other hand, does not duplicate Workers. If a value occurs more than once in an IN, it is ignored.

Subqueries for Tables

Any table named in a query can be replaced by a subquery. A secondary SELECT can be built on top of a subquery, or a subquery can be JOINed to a table. A subquery can even be JOINed to another subquery.

Some queries are impossible without a subquery. Consider grabbing both a Project and the first Task added to it. You could use GROUP BY ProjectId and SELECT MIN(TaskId), as shown in Listing 15.2.

The solution in this listing identifies the first Task added, but then we're stuck. There's no way to fetch the first Task's fields. The only values that can be selected are grouped Project fields and Task aggregates. There's no aggregate function that grabs a field from a specific record.

A subquery solves the problem, as shown in Listing 15.3.

The original query becomes the subquery. It's joined to Task and given the alias g. Because a subquery doesn't have a name, an alias is required. Fields from the subquery are available for ON conditions and WHERE conditions and to be selected in the SELECT value list. They retain their aliased name. It's a lot to look at, but if you loosen up your expectations, you start to see the table/tabular data hiding in the subquery.

Subqueries for Values

Any field or calculated value can be replaced by a subquery. In effect, the subquery becomes the calculation. As an example, the query in Listing 15.4 fetches Workers and counts their assigned Projects.

The subquery is embedded directly in the SELECT value list. Be careful with identifiers. Here, WorkerId refers to ProjectWorker.WorkerId, and w.WorkerId aliases back to Worker.WorkerId. If you missed the alias (WHERE WorkerId = WorkerId), every Worker would have a ProjectCount of 165 (all ProjectWorker records). Listing 15.5 presents another way to solve this Project/MIN Task problem.

This solution is probably a bad idea, though, because value subqueries don't perform well. If a query is defined in the SELECT list, it is run once for every record in the result. If you have 1,000 records, your subquery runs 1,000 times. If you have one million records, your subquery runs one million times. Generally, that's bad.

You can always achieve the same result with other techniques. You should always be a good database citizen and avoid executing a subquery for each record.

WORKING WITH VIEWS

A view is a named query that's stored in a database. Once it's stored, other queries can build on it. A view can be treated like a table anywhere in a SELECT statement. You can also think of it as a named subquery.

A bit of DDL is needed to create a view. It follows the DDL pattern shown here:

 CREATE objectType objectName 
 

The view's query follows the AS keyword, as shown in Listing 15.6.

You can see in this listing that a view is created called ProjectNameWithMinTaskId. The definition of the view follows the AS clause. The code after the SELECT should look familiar as it is simply a standard SELECT statement.

With the view created, it can now be used as a data source.

 SELECT * FROM ProjectNameWithMinTaskId;
 

You can also build more complex queries on top of it, as shown in Listing 15.7.

Like any technology, views have advantages and disadvantages. The advantages of using views include the following:

  • Encapsulating complex joins can reduce code complexity and increase code reuse.
  • Views can be secured separately from a table, for example, to grant user access to a view without granting access to the tables underneath.
  • Views can limit columns and rows shown to some users.

There are, however, also disadvantages to views.

  • Within MySQL, views are not a permanent structure. While the view's definition is always available, the subset of the table it creates is temporary. In other words, the table created by the view is generated each time the view is accessed.
  • Just because a view appears simple doesn't mean the underlying data model is. A view with simple results may be very expensive to run.
  • Developers can be tempted to build more and more on top of views because views are easy to understand. As views are joined to views that are joined to other views, performance issues may arise.
  • If you are using a database other than MySQL, you'll want to confirm whether the tables generated by views remain or are re‐created each time as well.

UNDERSTANDING TRANSACTIONS

Because database processes can involve several individual steps that rely on each other, good database design must support the concept of a transaction: an operation that includes multiple individual steps, all of which must complete successfully for the transaction itself to be successful.

Transaction Example

A transaction is a set of operations that together form one indivisible operation. This means that a transaction must succeed or fail as a single operation.

Consider an example of a flat transaction where money is transferred between two accounts. This involves performing two different operations: taking money from one account and adding money to another account. These two operations must be completed together and either succeed or fail together.

Before starting the transfer, the system will check the available balance in the source account to ensure that the balance is higher than the amount to be transferred. Next, the system will perform two operations: deduct the transfer amount from the source account and add the same amount to the target account. Because these two operations must happen together, it is considered to be a single transaction. If the second step fails for any reason, then both operations must be reversed.

To understand this, consider the following example. Assume a customer has a balance of $400 in Account A and wants to transfer $200 to Account B. First, the amount is removed from Account A. The remaining balance in Account A is now $200. The next step is to add the $200 to Account B. Now imagine that during the second step, a computer glitch occurs, and the system cannot finish adding the $200 to Account B. In this scenario, the customer just lost $200 from Account A that did not transfer to Account B.

This is problematic and can be avoided by using transactions. In this case, the transaction includes both operations: the first one deducts the amount from Account A, and the second one adds the amount to Account B. A transaction is defined as a single indivisible operation, which means that if one of the two operations fails, then the other step will fail as well, and the transaction will abort. Aborting the transaction will cancel whatever processing happened during the transaction. In this example, the second operation fails, so the transaction will cancel the first operation that deducted the amount from Account A. We thus ensure that the transaction succeeds only if each operation in the transaction succeeds.

Let's consider a more complicated example that uses a transaction with more than two operations. When you purchase an item from an online retailer, the purchase transaction must include each of the following steps:

  1. Check the available quantity of the item to ensure that there are enough to fulfill the purchase request.
  2. Deduct 1 from the available quantity.
  3. Check that the customer has valid credit card information in the database.
  4. Receive authorization from the credit card merchant account to transfer the purchase amount to the retailer.
  5. Generate an order number in the database to keep track of the purchase.

These steps are considered a single transaction, and for the transaction to complete successfully, each of the steps must be successful. If the merchant account declines the customer's credit card in step 4, step 2 is also rejected, and the available quantity of that item will be reset to the original value. Similarly, if there is a glitch in the database and the system cannot generate an order number, the entire sale is canceled. Not only will the available quantity of the item reset, but the payment through the merchant account will also be canceled.

In a robust system, the code may include gates that will give the user the opportunity to fix errors as they occur, rather than blindly canceling everything as soon as an error is detected. For example, if the customer's credit card information is invalid, the system could prompt the user for another form of payment.

Transactions are widely used in databases and software development, and they represent an important concept that guarantees consistency and recovery of data in case of an error. A transaction can succeed, which means that the operations within the transaction all executed successfully. A transaction can also fail, which means that one or more operations within the transaction failed.

The execution of a transaction typically works as follows:

  1. Initialize the transaction.
  2. Perform operations.
  3. Do either a commit or an abort.
    • Commit: The operations are successfully executed, and the data is submitted.
    • Abort: If one of the operations in the transaction fails, then all the operations are rolled back, and the data goes back to its original state.

ACID

The acronym ACID was covered in Lesson 1, “Exploring Relation Databases and SQL.” You learned that ACID is an acronym for atomicity, consistency, isolation, and durability. A transaction must satisfy these same four properties. You can refer to Lesson 1 for more details, but here is a summary of the four properties:

  • Atomicity: Either all operations in the transaction succeed or all of them fail. When the transaction fails, the state of the data that the operation was being applied on will remain unaffected by the transaction.
  • Consistency: A transaction should not have any adverse effect on the data. If the data is consistent prior to the transaction, then it should remain consistent after the transaction.
  • Isolation: Transactions occur independently of each other, but they cannot interfere with each other. This means that two transactions cannot operate on the same data at the same time.
  • Durability: The changes made by the individual operations are permanent if the transaction is successful.

In the case where the operating system is needed to execute several transactions concurrently, the transactions must be scheduled appropriately to avoid problems. One option is to use serialization, which is the process of executing several transactions one after the other so transactions occur sequentially. The first transaction is executed, then the second one is executed, and so on. In other words, the transactions are executed one after the other without any interleaving of the instructions from different transactions. Serialization allows each transaction to execute safely, without interference from other transactions. However, this approach can be inefficient because it can leave resources in a waiting state, doing nothing while the system executes other operations in the transaction.

To avoid this inefficiency, it is common to interleave instructions from one transaction with instructions from other transactions. Thus, the execution of several transactions will occur concurrently. However, when we interleave the execution of different transactions that are working on the same data, we could have two operations that work on the same data. This could be problematic if it breaks the consistency rule. This means that mechanisms must also be defined that allow the operations to be interleaved from multiple transactions while also ensuring the consistency of the data.

A transaction can have several states, and the different states form a finite state machine. A finite state machine is a system that can have a finite number of states. Figure 15.1 shows the different states of a transaction.

Snapshot of flowchart showing possible transaction states.

Figure 15.1 Flowchart showing possible transaction states

The states shown in Figure 15.1 include the following:

  • Active: A transaction is active during its execution. This is the initial state of any transaction.
  • Partially committed: Once a transaction executes its final operation, it is considered partially committed.
  • Committed: Once a transaction has successfully executed all its operations, if there are no conflicts with other active transactions, the operations are permanently committed.
  • Failed: A transaction fails when one of its operations cannot complete successfully.
  • Aborted: If the transaction fails, then the transaction manager rolls back the data to its original state, and the operations are aborted.

In MySQL, a transaction can be built using the START TRANSACTION statement. Listing 15.8 shows how to perform three operations as one transaction. This mimics a typical transfer of money between checking and saving accounts.

In other situations, MySQL automatically saves new data as it is added to a table or when existing data is modified or deleted, and you do not need to formally commit those changes. In a transaction, however, you must use the COMMIT keyword to save the changes defined by the SELECT and UPDATE statements. If any of those statements fails, the COMMIT also fails, and none of the changes is saved to the database.

SCHEMA OPTIMIZATION

During the design and development of data engineering solutions, the main constraint is scalability. Companies are constantly acquiring new data that must be organized, processed, and used to improve business intelligence. Data engineers must always think about scaling when designing any data solution. In the case of MySQL databases, there are design considerations that can dramatically optimize the performance of the databases and, consequently, any data routines that use databases.

There are different strategies to optimize the schema of MySQL tables that include choosing the optimal data types as well as using proper indexing to improve the performance of read and write operations. These techniques can be used to design databases that can be used in extract‐transform‐load (ETL) processes as well as data‐intensive applications.

Choosing Optimal Data Types

MySQL has an extensive list of data types that can be used to represent data. For a complete list of the data types available in MySQL, please see MySQL's page on Data Types at dev.mysql.com/doc/refman/5.7/en/data-types.html.

Choosing the appropriate data type for a column can improve the performance dramatically. In general, a few guidelines can be used to choose the appropriate data type.

  • Use the smallest reasonable data type: If the age of a customer is being stored, there is no need to use a 32‐bit integer. Using bigger data types requires more space on the disk, memory, and the CPU cache. Moreover, smaller data types require fewer CPU cycles to be processed, which makes them faster. When choosing a data type, always go with the smallest data type that can hold your information. However, make sure that you account for future issues with the range of the data type chosen. Underestimating the range could result in a time‐consuming process to alter the table to change the range of the data type.
  • Keep it simple: When choosing a data type, always go with the simplest possible. For instance, it is better to store an IP address in an integer field than a varchar field because comparing numbers requires fewer CPU cycles than comparing characters. Dates and times should also be stored in the MySQL built‐in date type because it is optimized to compare dates.
  • Consider nulls: When designing tables, a not null column should always be favored over a null column. Queries with nullable columns are harder for MySQL to optimize. Moreover, a nullable column requires more storage space and extra processing by MySQL. Instead of using null values, other data types can be used to refer to a nonexisting value. For instance, an impossible or meaningless value like 0 could be used for cases where the age of a customer is not defined, because this is easier to process and optimize than using a null value. Columns that are used for indexing should always be NOT NULL to avoid the extra processing by MySQL to index null values.

When choosing a data type, start by identifying the class of data that is appropriate to represent the data. The classes on MySQL are numeric, string, date, etc. This choice is typically obvious. A customer's name should be represented by a string, and age should be represented by a numeric type. Once you have identified the appropriate class, you can choose the specific data type that will be used to represent the data. There are several types of integers on MySQL, and each type has a specific range, precision, and storage space needed to store the data.

For integers, there are types TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT, which require respectively 8, 16, 24, 32, and 64 bits of storage. Moreover, integer types can have an unsigned attribute, which limits the integer to be only positive. A data engineer needs to envision the data to be stored and then decide if the data will be positive or negative and use the appropriate data type.

MySQL also includes numeric types for floating numbers: FLOAT, DOUBLE, and DECIMAL.

  • DECIMAL stores exact fractional numbers. (It is ideal for financial applications or when exact math is needed.)
  • Floating‐point types use less space than DECIMAL to store the same range of values.

For strings, MySQL offers the VARCHAR and CHAR types.

  • VARCHAR is used for variable‐length strings: VARCHAR uses only as much storage as needed for specific values, which in many situations takes less storage than the CHAR type, especially for short strings and empty strings. The VARCHAR type helps performance by reducing disk space waste. However, because of the dynamic size of the VARCHAR type, increasing the length of a string stored in VARCHAR type can cause the data to be moved to another location on the disk so it can fit the new size. This causes some performance issues.
  • CHAR is used for fixed‐length strings: CHAR is ideal to store short strings. For instance, CHAR can be used to store MD5 hash values of user passwords because they have the same length (128 bits). The CHAR type is also faster for updates because the size of the string is always known in advance, and thus data does not need to be moved around if the new value does not fit in the original location in the disk.

Choosing the appropriate data type for a column requires knowledge of the different data types offered by MySQL as well as how these data types are stored and processed by MySQL. A data engineer must choose the right data type that provides the best performance of information processing.

Indexing

Indexes are an important factor in MySQL performance. Indexes, also known as keys, are data structures used within MySQL to speed up data lookups and improve the performance of read operations. For big data, it is important to index the data so that data can be retrieved quickly. For example, the user database of Facebook contains more than two billion users. Looking up their information requires indexing the table to allow data retrieval to be done in a reasonable time.

MySQL uses different types of indexes, which are appropriate for different situations. Indexes are implemented in the storage engine of MySQL. There are different storage engines available on MySQL. If no storage engine is specified, MySQL uses the default InnoDB storage engine, which supports primary and foreign keys. For more complex databases, it might be better to designate a different storage engine that will handle the data and relationships a little differently.

For this lesson, two types of indexes will be covered. The first is B‐tree indexes. The second is hash indexes.

B‐Tree Indexes

B‐tree indexes use a B‐tree data structure to store data. This creates a hierarchical structure of data where the root nodes point to the next child nodes. The storage engine follows those pointers until it reaches the data needed. Each node in the tree has a key along with a pointer to the child page and the next leaf in the tree.

B‐tree indexes speed up data access because an entire table does not need to be scanned to find the data needed. Instead, searching starts at the root of the B‐tree and uses the right pointer to access the data needed. The node pages keep track of the range of each index, which is used to locate the needed data.

B‐tree indexes work well with data lookups by a full key value, a key range, or a key prefix. For instance, if first and last names are used as a key such as in the case in Listing 15.9, then all people with the same first and last names can be easily located. All people who have the same first name or the same last name can also be found quickly.

In the example shown in Listing 15.9, an index is created in the final KEY clause based on the first name, last name, and date of birth. The order of the attributes in the key is important. Using this key, lookups for first names can be sped up, but not dates of birth because the key is built using the first name first.

Hash Indexes

A hash index uses a hash table to perform fast data lookups. However, unlike the B‐tree indexes, hash indexes are used to perform an exact lookup of the key. This means that every column of the key is used. In hash indexes, the value of the columns used in the key are hashed together to produce a unique value for each row, and this hash value can then be used to look up data. If you want to search for a person with a specific first name, last name, and date of birth, you first compute the hashed value of the three columns and then look up the hash table to search the row that has the same hash value.

In Listing 15.9, the key could be used to look up people with specific first names because the index starts with the first name column. In the hash index, the three columns could be needed in the lookup; see Listing 15.10.

In this example, the first name is used as an index using hash indexes. MEMORY is used as the storage engine because hash indexing is the default type of indexing in the MEMORY engine. Moreover, MEMORY is the only storage engine that supports explicit hash indexing in MySQL. However, hash indexes can be used on other storage engines implicitly using some hacks or workarounds.

Because the first name is defined as a hash index, the MEMORY storage engine will generate a hash table where each row in the person table has a hash value based on the first name. This hash value is used to look up data quickly. For instance, if you want to search for all people with the first name John, MySQL will first compute the hash value of John and then compare it to the hash table. The hash table is comprised of key‐value pairs, where the key is the hash, and the value is the row that has the same hash. By doing this, rows can be quickly found that have a known hash value.

Hash indexes are extremely fast for data lookups, but they do have some limitations. For instance, hash indexes cannot be used in sorting because the hash function doesn't preserve the order of the rows. Moreover, hash indexes can be used only for equality (=), and they cannot be used for other SQL operators such as IN or LIKE. On other storage engines, hash indexes can be implemented using MySQL built‐in functions that allow hash values to be computed for any column; an example is the CRC32 function.

For a more in‐depth look at indexing in MySQL and a comparison of the options, see Krzysztof Ksiazek's article “An Overview of MySQL Database Indexing” found at severalnines.com/blog/overview-mysql-database-indexing.

SUMMARY

In this lesson, a number of advanced topics were covered. First was coverage of subqueries, which are complete queries embedded inside another. Subqueries can stand alone with few modifications. A subquery can provide values for an IN operator, behave like a table in a join, or evaluate to a single value in a SELECT list. Some query results are not possible without a subquery.

Views are named queries stored in the database. They operate like tables. Views can hide the complexity of a data model and provide an easy‐to‐use abstraction. Like any abstraction, they can also hide decisions that cause performance issues.

Transactions are sets of operations that together form one indivisible operation. This set of actions must succeed or fail as a single operation. Transactions are often described using the acronym ACID, which stands for atomicity, consistency, isolation, and durability. Transactions need to satisfy these four properties to be valid.

Two ways to help optimize your MySQL databases were presented. The first was to use the appropriate data type for columns. A data engineer must choose the right data type that provides the best performance of information processing. The second way to optimize that was presented was to optimize queries into your data through the use of indexing. You learned two types of indexing in this lesson. First was B‐tree, which uses a B‐tree data structure to store data. The second was hash indexing that uses a hash table.

In addition to optimizing data, as your systems get larger, you might need to apply replication and/or scaling. Within the lesson you learned about two types of replication, statement‐based replication and row‐based replication, before seeing a three‐step process for replicating data from the primary to the secondary server on MySQL.

The lesson concluded by covering the concept of high availably of data. As mentioned, high availability is a characteristic of a system that can continue to operate even in the event of an error or failure. High availability depends on the application, and the constraints and thresholds for availability vary from one scenario to another.

EXERCISES

The following exercises are provided to allow you to experiment with concepts presented in this lesson:

Use the TrackIt schema from Lesson 11, “Adding JOIN Queries” to complete the following exercises using subqueries.

Exercise 15.1: Recent Tasks

Retrieve a list of all project names from the TrackIt database that includes the most recent task assigned to each project. The results should display the TaskId and the Task Title.

For your solution, no project should appear in the results more than once. Your query results should include 26 rows, including the following sample output:

ProjectName MaxTaskID MaxTaskTitle
GameIt Accounts Payable 132 Construct user interface
GameIt Accounts Receivable 107 Construct front‐end components
GameIt Enterprise 182 Profile UI

Exercise 15.2: Before Grumps

Generate a list of tasks whose due date is on or before the due date for the project named Grumps. By using a subquery, you do not need to know the project due date to generate the results. Write the query without including a specific date. The query results will include 513 results, including the following sample output:

Title DueDate
Log in 2007‐02‐19
Refactor data store 2015‐04‐04
Refactor service layer and classes 2018‐09‐03

Exercise 15.3: Project Due Dates

A view is a saved query that can be selected by other queries in the same way that a query can select a table. Note that the results of a view are dynamic, just like the results of a query, so a view will always retrieve the most recent version of the data. However, if you find yourself writing the same query over and over, you can save the query as a view and then run just the view. Views are especially useful for queries that include multiple joins across tables.

Create a view that displays a list of all project names and due dates, the title of each task associated with each project, and the first name and last name of each work assigned to the tasks. Assign the view any name that makes sense to you.

The results will include 543 records, including the following sample output:

Name
GameIt Accounts Receivable
GameIt Accounts Receivable
GameIt Accounts Receivable

Exercise 15.4: The Work of Ealasaid Blinco

Use the view created in the previous exercise to generate a list of all tasks assigned to worker Ealasaid Blinco. The results will include 15 records, including the following sample output:

Name
GameIt Accounts Payable
GameIt Accounts Payable
GameIt Accounts Payable

Exercise 15.5: Other Databases

Look at other databases used in other parts of this book and identify places where a view might be useful as a shortcut to creating the same complicated query over and over again.