Chapter 5. Relational Databases

In Chapter 2, Data Preprocessing, we looked at some standard ways that data is stored. We saw that small unstructured datasets are often stored as text files, using white space, tabs, or commas to separate the data fields. Small, structured datasets are better handled by formats such as XML and JSON.

A database is a large, usually structured data collection that is accessed by an independent software system.

In this chapter, we will look at relational databases and the relational database systems that manage them. In Chapter 10, Working with NoSQL Databases, we will examine non-relational databases.

The relation data model

A relational database (RDB) is a database that stores its data in tables that are related by certain structural constraints. The word relational comes from the mathematical concept of a relation, which is essentially the same thing as a table. The precise definition follows.

A domain is a set of data values of the same data type, usually integer, decimal number, or text, but could be Boolean (True/False), nominal, or URL, among others. If D₁, D₂,…, D_n are domains, then their Cartesian product is the set D₁, D₂,…, D_n of all n-component sequences t = (d₁, d₂,…, d_n ), where each di ∈ Di. Such sequences are called tuples (as in octuples when n = 8). A tuple is like a vector, except that the components of a tuple may be of different types; the components of a vector are usually just numbers. A relation is a subset of a Cartesian product of domains; that is, a set of tuples all from the same domain sequence (D₁, D₂,…, D_n), which is called the schema for the relation. We think of a relation as an ordinary table, where the rows are the tuples and the columns correspond to the domains.

The following table shows an example of a relation whose domain schema could be integer, text, text, date, text, email. It has five tuples.

ID	Last Name	First Name	Date of Birth	Job Title	Email
49103	Adams	Jane	1975-09-02	CEO	jadams@xyz.com
15584	Baker	John	1991-03-17	Data Analyst	jbaker@xyz.com
34953	Cohen	Adam	1978-11-24	HR Director	acohen@xyz.com
23098	Davis	Rose	1983-05-12	IT Manager	rdavis@xyz.com
83822	Evans	Sara	1992-10-10	Data Analyst	sevans@xyz.com

Table 5-1. A database relation for employees

Each column has a unique name (for example, Date of Birth), which is shown at the top of the column when it is displayed. In addition, the table itself must have a name; this one is Employees.

Since tables and relations are practically the same thing, we will usually refer to them as tables, with the correspondingly simplified terms rows for its tuples, columns for its domains, attributes for its column names, and fields for its tuples' component values. A relational table is called an instance of its schema.

Note that a relation is a set of tuples, and as such, it is unordered. In other words, the order of the rows in a database table is irrelevant. The order of the columns is also irrelevant, since each column has a unique column name (for example, Date of Birth) and a type.

In addition to its domain schema, the definition of a table also usually includes a key designation. A primary key for a relation is a set of attributes whose values must be unique among the rows in the table. For Table 5-1, the obvious choice for the key attribute would be the ID field.