Chapter 5. Relational Databases

In Chapter 2, Data Preprocessing, we looked at some standard ways that data is stored. We saw that small unstructured datasets are often stored as text files, using white space, tabs, or commas to separate the data fields. Small, structured datasets are better handled by formats such as XML and JSON.

A database is a large, usually structured data collection that is accessed by an independent software system.

In this chapter, we will look at relational databases and the relational database systems that manage them. In Chapter 10, Working with NoSQL Databases, we will examine non-relational databases.

A relational database (RDB) is a database that stores its data in tables that are related by certain structural constraints. The word relational comes from the mathematical concept of a relation, which is essentially the same thing as a table. The precise definition follows.

A domain is a set of data values of the same data type, usually integer, decimal number, or text, but could be Boolean (True/False), nominal, or URL, among others. If D1, D2 ,…, Dn are domains, then their Cartesian product is the set D1, D2,…, Dn of all n-component sequences t = (d1, d2 ,…, dn ), where each di Di. Such sequences are called tuples (as in octuples when n = 8). A tuple is like a vector, except that the components of a tuple may be of different types; the components of a vector are usually just numbers. A relation is a subset of a Cartesian product of domains; that is, a set of tuples all from the same domain sequence (D1, D2 ,…, Dn), which is called the schema for the relation. We think of a relation as an ordinary table, where the rows are the tuples and the columns correspond to the domains.

The following table shows an example of a relation whose domain schema could be integer, text, text, date, text, email. It has five tuples.

ID

Last Name

First Name

Date of Birth

Job Title

Email

49103

Adams

Jane

1975-09-02

CEO

jadams@xyz.com

15584

Baker

John

1991-03-17

Data Analyst

jbaker@xyz.com

34953

Cohen

Adam

1978-11-24

HR Director

acohen@xyz.com

23098

Davis

Rose

1983-05-12

IT Manager

rdavis@xyz.com

83822

Evans

Sara

1992-10-10

Data Analyst

sevans@xyz.com

Each column has a unique name (for example, Date of Birth), which is shown at the top of the column when it is displayed. In addition, the table itself must have a name; this one is Employees.

Since tables and relations are practically the same thing, we will usually refer to them as tables, with the correspondingly simplified terms rows for its tuples, columns for its domains, attributes for its column names, and fields for its tuples' component values. A relational table is called an instance of its schema.

Note that a relation is a set of tuples, and as such, it is unordered. In other words, the order of the rows in a database table is irrelevant. The order of the columns is also irrelevant, since each column has a unique column name (for example, Date of Birth) and a type.

In addition to its domain schema, the definition of a table also usually includes a key designation. A primary key for a relation is a set of attributes whose values must be unique among the rows in the table. For Table 5-1, the obvious choice for the key attribute would be the ID field.