You've covered creating, reading, updating, and deleting tables and fields. You've also covered various aspects of selecting and manipulating the data within those databases and tables. There are a number of other actions and features you can tap into when working with SQL.
In this lesson a variety of topics will be touched upon that add to the foundation that you've already established. These topics are considered more advanced, but each is important in its own way.
SQL is a specialized language that executes only in the context of a relational database. You can't create a video game with it. Despite that, SQL is incredibly powerful, flexible, and expressive. To get a sense of its flexibility, consider this: any value, table, or set of values can be replaced by a second, separate query.
A subquery is a syntactically correct, complete query that is embedded in another query to produce a value or tabular result set. Removed from its parent, a subquery is still valid, though it may use values from its parent to establish context. Three main areas where subqueries are often used include the IN
clause, where a table can be used, and where a value can be used.
To better understand subqueries, the TrackIt schema will be used in this lesson. If you already have this database set up in your MySQL Server instance, you are welcome to use it. If you do not have it or if you want to rebuild it for this lesson, you can run the trackit‐schema‐and‐data.sql
script that can be found with the downloadable files for this book.
Values in an IN
operator can come from a query. What if you wanted to find all Workers who are assigned to a Project? All the ProjectWorker.WorkerId
values could be grabbed with a query that is within an IN
clause, and the resulting values from that query would be used for the IN
clause. This is shown in Listing 15.1.
In this query, the identifier, WorkerId
, means two different things depending on where it's mentioned. In the main query, it refers to Worker.WorkerId
. Inside the IN
, it refers to ProjectWorker.WorkerId
. It's easy to get confused.
An alternative approach might be to use JOIN Worker
with ProjectWorker
. That's a good idea, but it would return duplicate Workers because Workers can be assigned to more than one Project. The IN
approach, on the other hand, does not duplicate Workers. If a value occurs more than once in an IN
, it is ignored.
Any table named in a query can be replaced by a subquery. A secondary SELECT
can be built on top of a subquery, or a subquery can be JOIN
ed to a table. A subquery can even be JOIN
ed to another subquery.
Some queries are impossible without a subquery. Consider grabbing both a Project and the first Task added to it. You could use GROUP BY ProjectId
and SELECT MIN(TaskId)
, as shown in Listing 15.2.
The solution in this listing identifies the first Task added, but then we're stuck. There's no way to fetch the first Task's fields. The only values that can be selected are grouped Project fields and Task aggregates. There's no aggregate function that grabs a field from a specific record.
A subquery solves the problem, as shown in Listing 15.3.
The original query becomes the subquery. It's joined to Task
and given the alias g
. Because a subquery doesn't have a name, an alias is required. Fields from the subquery are available for ON
conditions and WHERE
conditions and to be selected in the SELECT
value list. They retain their aliased name. It's a lot to look at, but if you loosen up your expectations, you start to see the table/tabular data hiding in the subquery.
Any field or calculated value can be replaced by a subquery. In effect, the subquery becomes the calculation. As an example, the query in Listing 15.4 fetches Workers and counts their assigned Projects.
The subquery is embedded directly in the SELECT
value list. Be careful with identifiers. Here, WorkerId
refers to ProjectWorker.WorkerId
, and w.WorkerId
aliases back to Worker.WorkerId
. If you missed the alias (WHERE WorkerId = WorkerId
), every Worker would have a ProjectCount
of 165 (all ProjectWorker
records). Listing 15.5 presents another way to solve this Project/MIN
Task problem.
This solution is probably a bad idea, though, because value subqueries don't perform well. If a query is defined in the SELECT
list, it is run once for every record in the result. If you have 1,000 records, your subquery runs 1,000 times. If you have one million records, your subquery runs one million times. Generally, that's bad.
You can always achieve the same result with other techniques. You should always be a good database citizen and avoid executing a subquery for each record.
A view is a named query that's stored in a database. Once it's stored, other queries can build on it. A view can be treated like a table anywhere in a SELECT
statement. You can also think of it as a named subquery.
A bit of DDL is needed to create a view. It follows the DDL pattern shown here:
CREATE objectType objectName
The view's query follows the AS
keyword, as shown in Listing 15.6.
You can see in this listing that a view is created called ProjectNameWithMinTaskId
. The definition of the view follows the AS
clause. The code after the SELECT
should look familiar as it is simply a standard SELECT
statement.
With the view created, it can now be used as a data source.
SELECT * FROM ProjectNameWithMinTaskId;
You can also build more complex queries on top of it, as shown in Listing 15.7.
Like any technology, views have advantages and disadvantages. The advantages of using views include the following:
There are, however, also disadvantages to views.
Because database processes can involve several individual steps that rely on each other, good database design must support the concept of a transaction: an operation that includes multiple individual steps, all of which must complete successfully for the transaction itself to be successful.
A transaction is a set of operations that together form one indivisible operation. This means that a transaction must succeed or fail as a single operation.
Consider an example of a flat transaction where money is transferred between two accounts. This involves performing two different operations: taking money from one account and adding money to another account. These two operations must be completed together and either succeed or fail together.
Before starting the transfer, the system will check the available balance in the source account to ensure that the balance is higher than the amount to be transferred. Next, the system will perform two operations: deduct the transfer amount from the source account and add the same amount to the target account. Because these two operations must happen together, it is considered to be a single transaction. If the second step fails for any reason, then both operations must be reversed.
To understand this, consider the following example. Assume a customer has a balance of $400 in Account A and wants to transfer $200 to Account B. First, the amount is removed from Account A. The remaining balance in Account A is now $200. The next step is to add the $200 to Account B. Now imagine that during the second step, a computer glitch occurs, and the system cannot finish adding the $200 to Account B. In this scenario, the customer just lost $200 from Account A that did not transfer to Account B.
This is problematic and can be avoided by using transactions. In this case, the transaction includes both operations: the first one deducts the amount from Account A, and the second one adds the amount to Account B. A transaction is defined as a single indivisible operation, which means that if one of the two operations fails, then the other step will fail as well, and the transaction will abort. Aborting the transaction will cancel whatever processing happened during the transaction. In this example, the second operation fails, so the transaction will cancel the first operation that deducted the amount from Account A. We thus ensure that the transaction succeeds only if each operation in the transaction succeeds.
Let's consider a more complicated example that uses a transaction with more than two operations. When you purchase an item from an online retailer, the purchase transaction must include each of the following steps:
These steps are considered a single transaction, and for the transaction to complete successfully, each of the steps must be successful. If the merchant account declines the customer's credit card in step 4, step 2 is also rejected, and the available quantity of that item will be reset to the original value. Similarly, if there is a glitch in the database and the system cannot generate an order number, the entire sale is canceled. Not only will the available quantity of the item reset, but the payment through the merchant account will also be canceled.
In a robust system, the code may include gates that will give the user the opportunity to fix errors as they occur, rather than blindly canceling everything as soon as an error is detected. For example, if the customer's credit card information is invalid, the system could prompt the user for another form of payment.
Transactions are widely used in databases and software development, and they represent an important concept that guarantees consistency and recovery of data in case of an error. A transaction can succeed, which means that the operations within the transaction all executed successfully. A transaction can also fail, which means that one or more operations within the transaction failed.
The execution of a transaction typically works as follows:
The acronym ACID was covered in Lesson 1, “Exploring Relation Databases and SQL.” You learned that ACID is an acronym for atomicity, consistency, isolation, and durability. A transaction must satisfy these same four properties. You can refer to Lesson 1 for more details, but here is a summary of the four properties:
In the case where the operating system is needed to execute several transactions concurrently, the transactions must be scheduled appropriately to avoid problems. One option is to use serialization, which is the process of executing several transactions one after the other so transactions occur sequentially. The first transaction is executed, then the second one is executed, and so on. In other words, the transactions are executed one after the other without any interleaving of the instructions from different transactions. Serialization allows each transaction to execute safely, without interference from other transactions. However, this approach can be inefficient because it can leave resources in a waiting state, doing nothing while the system executes other operations in the transaction.
To avoid this inefficiency, it is common to interleave instructions from one transaction with instructions from other transactions. Thus, the execution of several transactions will occur concurrently. However, when we interleave the execution of different transactions that are working on the same data, we could have two operations that work on the same data. This could be problematic if it breaks the consistency rule. This means that mechanisms must also be defined that allow the operations to be interleaved from multiple transactions while also ensuring the consistency of the data.
A transaction can have several states, and the different states form a finite state machine. A finite state machine is a system that can have a finite number of states. Figure 15.1 shows the different states of a transaction.
The states shown in Figure 15.1 include the following:
In MySQL, a transaction can be built using the START TRANSACTION
statement. Listing 15.8 shows how to perform three operations as one transaction. This mimics a typical transfer of money between checking and saving accounts.
In other situations, MySQL automatically saves new data as it is added to a table or when existing data is modified or deleted, and you do not need to formally commit those changes. In a transaction, however, you must use the COMMIT
keyword to save the changes defined by the SELECT
and UPDATE
statements. If any of those statements fails, the COMMIT
also fails, and none of the changes is saved to the database.
During the design and development of data engineering solutions, the main constraint is scalability. Companies are constantly acquiring new data that must be organized, processed, and used to improve business intelligence. Data engineers must always think about scaling when designing any data solution. In the case of MySQL databases, there are design considerations that can dramatically optimize the performance of the databases and, consequently, any data routines that use databases.
There are different strategies to optimize the schema of MySQL tables that include choosing the optimal data types as well as using proper indexing to improve the performance of read and write operations. These techniques can be used to design databases that can be used in extract‐transform‐load (ETL) processes as well as data‐intensive applications.
MySQL has an extensive list of data types that can be used to represent data. For a complete list of the data types available in MySQL, please see MySQL's page on Data Types at dev.mysql.com/doc/refman/5.7/en/data-types.html
.
Choosing the appropriate data type for a column can improve the performance dramatically. In general, a few guidelines can be used to choose the appropriate data type.
NOT NULL
to avoid the extra processing by MySQL to index null values.When choosing a data type, start by identifying the class of data that is appropriate to represent the data. The classes on MySQL are numeric, string, date, etc. This choice is typically obvious. A customer's name should be represented by a string, and age should be represented by a numeric type. Once you have identified the appropriate class, you can choose the specific data type that will be used to represent the data. There are several types of integers on MySQL, and each type has a specific range, precision, and storage space needed to store the data.
For integers, there are types TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT, which require respectively 8, 16, 24, 32, and 64 bits of storage. Moreover, integer types can have an unsigned attribute, which limits the integer to be only positive. A data engineer needs to envision the data to be stored and then decide if the data will be positive or negative and use the appropriate data type.
MySQL also includes numeric types for floating numbers: FLOAT, DOUBLE, and DECIMAL.
For strings, MySQL offers the VARCHAR and CHAR types.
Choosing the appropriate data type for a column requires knowledge of the different data types offered by MySQL as well as how these data types are stored and processed by MySQL. A data engineer must choose the right data type that provides the best performance of information processing.
Indexes are an important factor in MySQL performance. Indexes, also known as keys, are data structures used within MySQL to speed up data lookups and improve the performance of read operations. For big data, it is important to index the data so that data can be retrieved quickly. For example, the user database of Facebook contains more than two billion users. Looking up their information requires indexing the table to allow data retrieval to be done in a reasonable time.
MySQL uses different types of indexes, which are appropriate for different situations. Indexes are implemented in the storage engine of MySQL. There are different storage engines available on MySQL. If no storage engine is specified, MySQL uses the default InnoDB storage engine, which supports primary and foreign keys. For more complex databases, it might be better to designate a different storage engine that will handle the data and relationships a little differently.
For this lesson, two types of indexes will be covered. The first is B‐tree indexes. The second is hash indexes.
B‐tree indexes use a B‐tree data structure to store data. This creates a hierarchical structure of data where the root nodes point to the next child nodes. The storage engine follows those pointers until it reaches the data needed. Each node in the tree has a key along with a pointer to the child page and the next leaf in the tree.
B‐tree indexes speed up data access because an entire table does not need to be scanned to find the data needed. Instead, searching starts at the root of the B‐tree and uses the right pointer to access the data needed. The node pages keep track of the range of each index, which is used to locate the needed data.
B‐tree indexes work well with data lookups by a full key value, a key range, or a key prefix. For instance, if first and last names are used as a key such as in the case in Listing 15.9, then all people with the same first and last names can be easily located. All people who have the same first name or the same last name can also be found quickly.
In the example shown in Listing 15.9, an index is created in the final KEY
clause based on the first name, last name, and date of birth. The order of the attributes in the key is important. Using this key, lookups for first names can be sped up, but not dates of birth because the key is built using the first name first.
A hash index uses a hash table to perform fast data lookups. However, unlike the B‐tree indexes, hash indexes are used to perform an exact lookup of the key. This means that every column of the key is used. In hash indexes, the value of the columns used in the key are hashed together to produce a unique value for each row, and this hash value can then be used to look up data. If you want to search for a person with a specific first name, last name, and date of birth, you first compute the hashed value of the three columns and then look up the hash table to search the row that has the same hash value.
In Listing 15.9, the key could be used to look up people with specific first names because the index starts with the first name column. In the hash index, the three columns could be needed in the lookup; see Listing 15.10.
In this example, the first name is used as an index using hash indexes. MEMORY is used as the storage engine because hash indexing is the default type of indexing in the MEMORY engine. Moreover, MEMORY is the only storage engine that supports explicit hash indexing in MySQL. However, hash indexes can be used on other storage engines implicitly using some hacks or workarounds.
Because the first name is defined as a hash index, the MEMORY storage engine will generate a hash table where each row in the person table has a hash value based on the first name. This hash value is used to look up data quickly. For instance, if you want to search for all people with the first name John, MySQL will first compute the hash value of John and then compare it to the hash table. The hash table is comprised of key‐value pairs, where the key is the hash, and the value is the row that has the same hash. By doing this, rows can be quickly found that have a known hash value.
Hash indexes are extremely fast for data lookups, but they do have some limitations. For instance, hash indexes cannot be used in sorting because the hash function doesn't preserve the order of the rows. Moreover, hash indexes can be used only for equality (=), and they cannot be used for other SQL operators such as IN
or LIKE
. On other storage engines, hash indexes can be implemented using MySQL built‐in functions that allow hash values to be computed for any column; an example is the CRC32
function.
For a more in‐depth look at indexing in MySQL and a comparison of the options, see Krzysztof Ksiazek's article “An Overview of MySQL Database Indexing” found at severalnines.com/blog/overview-mysql-database-indexing
.
In this lesson, a number of advanced topics were covered. First was coverage of subqueries, which are complete queries embedded inside another. Subqueries can stand alone with few modifications. A subquery can provide values for an IN
operator, behave like a table in a join, or evaluate to a single value in a SELECT
list. Some query results are not possible without a subquery.
Views are named queries stored in the database. They operate like tables. Views can hide the complexity of a data model and provide an easy‐to‐use abstraction. Like any abstraction, they can also hide decisions that cause performance issues.
Transactions are sets of operations that together form one indivisible operation. This set of actions must succeed or fail as a single operation. Transactions are often described using the acronym ACID, which stands for atomicity, consistency, isolation, and durability. Transactions need to satisfy these four properties to be valid.
Two ways to help optimize your MySQL databases were presented. The first was to use the appropriate data type for columns. A data engineer must choose the right data type that provides the best performance of information processing. The second way to optimize that was presented was to optimize queries into your data through the use of indexing. You learned two types of indexing in this lesson. First was B‐tree, which uses a B‐tree data structure to store data. The second was hash indexing that uses a hash table.
In addition to optimizing data, as your systems get larger, you might need to apply replication and/or scaling. Within the lesson you learned about two types of replication, statement‐based replication and row‐based replication, before seeing a three‐step process for replicating data from the primary to the secondary server on MySQL.
The lesson concluded by covering the concept of high availably of data. As mentioned, high availability is a characteristic of a system that can continue to operate even in the event of an error or failure. High availability depends on the application, and the constraints and thresholds for availability vary from one scenario to another.
The following exercises are provided to allow you to experiment with concepts presented in this lesson:
Use the TrackIt schema from Lesson 11, “Adding JOIN Queries” to complete the following exercises using subqueries.
Retrieve a list of all project names from the TrackIt database that includes the most recent task assigned to each project. The results should display the TaskId and the Task Title.
For your solution, no project should appear in the results more than once. Your query results should include 26 rows, including the following sample output:
ProjectName | MaxTaskID | MaxTaskTitle |
---|---|---|
GameIt Accounts Payable | 132 | Construct user interface |
GameIt Accounts Receivable | 107 | Construct front‐end components |
GameIt Enterprise | 182 | Profile UI |
Generate a list of tasks whose due date is on or before the due date for the project named Grumps. By using a subquery, you do not need to know the project due date to generate the results. Write the query without including a specific date. The query results will include 513 results, including the following sample output:
Title | DueDate |
---|---|
Log in | 2007‐02‐19 |
Refactor data store | 2015‐04‐04 |
Refactor service layer and classes | 2018‐09‐03 |
A view is a saved query that can be selected by other queries in the same way that a query can select a table. Note that the results of a view are dynamic, just like the results of a query, so a view will always retrieve the most recent version of the data. However, if you find yourself writing the same query over and over, you can save the query as a view and then run just the view. Views are especially useful for queries that include multiple joins across tables.
Create a view that displays a list of all project names and due dates, the title of each task associated with each project, and the first name and last name of each work assigned to the tasks. Assign the view any name that makes sense to you.
The results will include 543 records, including the following sample output:
Name |
---|
GameIt Accounts Receivable |
GameIt Accounts Receivable |
GameIt Accounts Receivable |
Use the view created in the previous exercise to generate a list of all tasks assigned to worker Ealasaid Blinco. The results will include 15 records, including the following sample output:
Name |
---|
GameIt Accounts Payable |
GameIt Accounts Payable |
GameIt Accounts Payable |
Look at other databases used in other parts of this book and identify places where a view might be useful as a shortcut to creating the same complicated query over and over again.