In Chapter 2, Data Preprocessing, we looked at some standard ways that data is stored. We saw that small unstructured datasets are often stored as text files, using white space, tabs, or commas to separate the data fields. Small, structured datasets are better handled by formats such as XML and JSON.
A database is a large, usually structured data collection that is accessed by an independent software system.
In this chapter, we will look at relational databases and the relational database systems that manage them. In Chapter 10, Working with NoSQL Databases, we will examine non-relational databases.
A relational database (RDB) is a database that stores its data in tables that are related by certain structural constraints. The word relational comes from the mathematical concept of a relation, which is essentially the same thing as a table. The precise definition follows.
A domain
is a set of data values of the same data type, usually integer, decimal number, or text, but could be Boolean (True
/False
), nominal, or URL, among others. If D1, D2 ,…, Dn are domains, then their Cartesian product
is the set D1, D2,…, Dn of all n-component sequences t = (d1, d2 ,…, dn ), where each di
∈
Di. Such sequences are called tuples (as in octuples when n = 8). A tuple is like a vector, except that the components of a tuple may be of different types; the components of a vector are usually just numbers. A relation is a subset of a Cartesian product of domains; that is, a set of tuples all from the same domain sequence (D1, D2 ,…, Dn), which is called
the schema for the relation. We think of a relation as an ordinary table, where the rows are the tuples and the columns correspond to the domains.
The following table shows an example of a relation whose domain schema could be integer, text, text, date, text, email. It has five tuples.
ID |
Last Name |
First Name |
Date of Birth |
Job Title |
|
---|---|---|---|---|---|
49103 |
Adams |
Jane |
1975-09-02 |
CEO |
jadams@xyz.com |
15584 |
Baker |
John |
1991-03-17 |
Data Analyst |
jbaker@xyz.com |
34953 |
Cohen |
Adam |
1978-11-24 |
HR Director |
acohen@xyz.com |
23098 |
Davis |
Rose |
1983-05-12 |
IT Manager |
rdavis@xyz.com |
83822 |
Evans |
Sara |
1992-10-10 |
Data Analyst |
sevans@xyz.com |
Table 5-1. A database relation for employees
Each column has a unique name (for example, Date of Birth), which is shown at the top of the column when it is displayed. In addition, the table itself must have a name; this one is Employees
.
Since tables and relations are practically the same thing, we will usually refer to them as tables, with the correspondingly simplified terms rows for its tuples, columns for its domains, attributes for its column names, and fields for its tuples' component values. A relational table is called an instance of its schema.
Note that a relation is a set of tuples, and as such, it is unordered. In other words, the order of the rows in a database table is irrelevant. The order of the columns is also irrelevant, since each column has a unique column name (for example, Date of Birth) and a type.
In addition to its domain schema, the definition of a table also usually includes a key designation. A primary key for a relation is a set of attributes whose values must be unique among the rows in the table. For Table 5-1, the obvious choice for the key attribute would be the ID field.