T. Bressoud, D. WhiteIntroduction to Data Systemshttps://doi.org/10.1007/978-3-030-54371-6_9

9. Tabular Model: Transformations and Constraints

Thomas Bressoud¹ and David White¹

(1)

Mathematics and Computer Science, Denison University, Granville, OH, USA

Chapter Goals

Upon completion of this chapter, you should understand the following:

The restructuring performed by transformation operations of transpose, pivot, and melt.
How to recognize when data is not in tidy form and what transformations and operations should be used to normalize it.
How to use transformations and operations to achieve common presentational goals for a data set or subset.

Upon completion of this chapter, you should be able to do the following:

Perform transformations in pandas:
- Transpose,
- Pivot and Pivot Table,
- Melt.
Use a sequence of pandas operations to achieve normalization and presentational goals.

9.1 Tabular Model Constraints

Recall that constraints are limitations on the organizational structure, the relationships, and the values entailed in a data set. For the tabular model in general, the structural constraints are articulated by the precepts of tidy data, repeated from Sect. 6.1 here:

1.
Each column represents exactly one variable of the data set (TidyData1).
2.
Each row represents exactly one unique (relational) mapping that maps from a set of givens (the values of the independent variables) to the values of the dependent variables (TidyData2).
3.
Exactly one table is used for each set of mappings involving the same independent variables (TidyData3).

For a particular data set, we might, additionally, have data set specific constraints like the following examples:

A column named age must consist of integer values.
A column named income is a categorical variable and must be one of the strings, "high", "middle", or "low".
A column named year must not have any missing values.
If an income column is "low", a salary column must be a float with values less than or equal to 24000.0.

The first three of these are intra-column value constraints, and the last is an inter-column constraint, which, for this structure, says something about a relationship between two variables. These additional constraints might be required for the sound and correct interpretation of the data set for analysis.

When data is created by a provider source, it is often created either without adhering to the structural constraints of tidy data or not necessarily adhering to intra- and inter-column constraints. In the tabular model, there is neither checking nor enforcement to ensure any limitations on the data are met.

In order to provide a check to see if a given data set conformed to these kinds of limitation, there must be some means to specify the limitations themselves. However, in the tabular model, our data arrives in files or through network streams that are almost always in a “record per line” and “fields delimited by a special character” format, conveying the data itself. There is no additional information conveying limitations, as there is with the relational and hierarchical models. So in the tabular model, all constraints are by convention—a common understanding, or by being specified in some outside documentation, or both. Outside documentation describing the form and assumptions of the data is called a data dictionary and often conveys metadata.

Since any creator of data can choose how to take data and put it into a row and column tabular form, the biggest concern is whether or not the data conform to the structural limitations of the model. In other words, we must always be asking the question of whether or not the data violates any of TidyData1, TidyData2, or TidyData3. Any intra- and inter-column constraints are data set specific, and, without additional information, cannot be addressed in a general manner.

In Sect. 6.1, we illustrated, for the topnames data set, four different representations of the same set of data values, all of which violated one or more of the tidy data principles. It may be worth reviewing those examples before continuing on in this chapter. The main observation is this: just because data is organized in rows and columns does not mean that it conforms to the tabular model. Data can be organized in rows and columns in many different ways, and many of those ways violate the tidy data requirements.

Since intra- and inter-column constraints are data set specific and cannot be handled in a general manner, the goal of the current chapter is to better understand the three principles of tidy data, to help recognize when data does not conform to the model, and to give some tools and examples so that, in the eventuality of having data in an un-tidy form, we can transform it into its tidy data equivalent.

In Sect. 9.2, we discuss some advanced operations that allow us to transform one tabular structure into another tabular structure. These, along with the operations of Chaps. 7 and 8, will give us the tools we need in order to work with data in various un-tidy forms and to transform and restructure it. In Sect. 9.3, we present a series of vignettes that illustrate many commonly occurring un-tidy forms and give examples of the normalization process. Finally, in Sect. 9.4, we generalize some of the characteristics seen in the normalization examples and articulate some of the “red flag signs” that can help us to recognize when data is not tidy.

9.1.1 Reading Questions

9.1

Give 2–3 reasonable column constraints for the data frame topnames.

9.2

Give 2–3 reasonable column constraints for the data frame indicators.

9.3

Give an example from a previous time in your life when you have needed to reference information from a data dictionary.

9.4

Describe one of the examples of an un-tidy way to display the topnames data set, and explain what specifically makes it un-tidy.

9.1.2 Exercises

9.5

The data set ratings.csv has information on several individuals and the ratings they provided for each of two restaurants (called “A” and “B”). Is this data in tidy form? Explain.

9.6

Explore the data provided as ratings.csv, and then draw the functional dependency that it should have when in tidy form. Hint: there should be only one independent variable.

9.7

The data set restaurants_gender.csv has information on several individuals and the ratings they provided for each of two restaurants. Is this data in tidy form? Explain.

9.8

This question concerns popular songs, based on a public domain data set known as billboard:

https://www.kaggle.com/sausen7/billboards-dataset

A portion of this data is hosted on the book web page, and another snippet is visible here:

https://github.com/hadley/tidy-data/blob/master/data/billboard.csv

Is this data in tidy form? Justify your answer.

9.9

This question concerns the data set matches.csv hosted on the book web page and keeping data on sports match-ups. Each row is a team, and the columns tell how well a particular team did against each opponent. Does this data conform to the tidy data restrictions? Justify your answer.

9.10

0.
Read world_bank_pop.csv into a data frame dfwb.
1.
Give a reason why dfwb fails to represent TidyData.

9.11

Explore the data provided as world_bank_pop.csv, and then draw the functional dependency.

9.12

Download the World Health Organization’s “Global Tuberculosis Report” data set as a CSV file from

https://www.who.int/tb/country/data/download/en/

Look at the CSV file. Is it in tidy data form?

9.13

Consider the data set members.csv on the book web page. Please list all steps that would be required to make this data tidy.

9.2 Tabular Transformations

Transformations are the process of converting from one tabular form into another tabular form, while maintaining the information and values represented. We should note that, while these are viewed as complex operations of the tabular data model, they are simply algorithms that are constructed from the more basic operations: selection, projection, iteration, and so forth, which we have already learned. They are well defined and, if we needed to, we could construct these operations for ourselves. But because of their common utility, packages such as pandas in Python, or dplyr in R, have written them for us and allow us to leverage their implementation.

While, in the context of this chapter on constraints, our focus will be on using these transformations to normalize a data set from an un-tidy form, these tools can also be used for some types of analysis. They could also be used when communicating about a data set, manipulating the data into a form for presentation, and allowing us to communicate effectively. Because of this, the discussion in this section will be general, conveying the effect of each transformation, while also highlighting data situations where each transformation should be used.

9.2.1 Transpose

In linear algebra, a transpose is described as reflecting a matrix over its main diagonal. The columns of the matrix become the rows, and the rows become the columns, and an element that was at row i and column j becomes the element at row j and column i. The same transformation can be used to swap the rows and columns of a data frame, in case they are provided incorrectly, with variables as the rows and individuals as the columns.

We construct an example data frame, example1, with row labels A, B, and C and four columns, I, 1, 2, and 3. The values within the table are constructed to incorporate elements of their row and column label. This will help us to understand where values “land” in a transformation result.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figa_HTML.png

| I 1 2 3

| A A.I A.1 A.2 A.3

| B B.I B.1 B.2 B.3

| C C.I C.1 C.2 C.3

In pandas, the transpose is obtained using the .T attribute of the DataFrame, so:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figb_HTML.png

| A B C

| I A.I B.I C.I

| 1 A.1 B.1 C.1

| 2 A.2 B.2 C.2

| 3 A.3 B.3 C.3

The columns are now A, B, and C, and the rows and values are as expected. Also, as we would expect, the transpose of a transpose yields the original data frame, as the second transpose swamps the columns and rows back:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figc_HTML.png

| I 1 2 3

| A A.I A.1 A.2 A.3

| B B.I B.1 B.2 B.3

| C C.I C.1 C.2 C.3

In pandas, the transpose operates so that the row labels become the column labels and vice versa, so the transpose is almost always used in situations where the row labels are a meaningful index.

9.2.2 Melt

The melt operation converts columns into rows, like the transpose, but in a very specialized way, shown in Fig. 9.1. The name is meant to evoke a candle lying across the top of the data frame, melting from column information into row information. We use this transformation when information stored in column names should actually be stored in the rows, as we saw in Sect. 8.3.2 when we had one column per year. We reproduce that example here, as this kind of presentational form is a dead giveaway that a melt is required:

../images/479588_1_En_9_Chapter/479588_1_En_9_Fig1_HTML.png — Fig. 9.1
Melt operation

../images/479588_1_En_9_Chapter/479588_1_En_9_Figd_HTML.png

| year 2013 2014 2015

| name Sophia Emma Emma

| count 21223 20936 20455

The melt operation takes a subset of the columns, the melt columns, and uses the column labels of the melt columns as table entry values under a single new column, called the new variable. The operation uses the values from the original melt columns to become a single new value column in the transformation. This operation is particularly useful when the names of columns in a data set are, in fact, values in the data set, and we need to create a new column for these values.

Figure 9.1 highlights the transformation of row i in the original data frame, with column labels w, x, y, and z and values in row i of w′,x′, y′, and z′. For example, in our un-tidy topnames data, i could be name, the melt columns could each be a year (e.g., w = 2013, x = 2014, etc.), and the red values could be the top names (e.g., w′ = “Sophia,” x′ = “Emma,” etc.).

In the transformed result, this single row i becomes four rows, corresponding to the original columns (w, x, y, and z) and the values of w′, x′, y′, and z′ are gathered in a single new column. The same transformation is applied to every row, e.g., a row of counts (with one column per year) is reshaped into a column named count with one row per year.

For an in-depth example, consider the columns of example1. For this example, we wish to melt the columns labeled 1, 2, and 3, and we wish not to melt column I. In the result, we will have three columns:

the non-melt column, I, which remains and will repeat its values in the rows of the melted result,
the new variable column, which we name K, whose values come from the column labels in the original, and
the new value column, which we name V, whose values come from the values in the melt columns of the original.

The rows of the result are obtained by expanding, for each row in the original, n rows in the result, where n is the number of melt columns. The n rows will repeat the I value, will use exactly one of the column labels for the K column, and will use a lookup of I and the particular melt column to get the value for V.

We repeat example1 and note the n = 3 columns to be melted:

| I 1 2 3

| A A.I A.1 A.2 A.3

| B B.I B.1 B.2 B.3

| C C.I C.1 C.2 C.3

We now show its melted result, where I is not melted, where columns 1, 2, and 3 are melted, and where we have specified column labels for the new variable name and for the new value column:

../images/479588_1_En_9_Chapter/479588_1_En_9_Fige_HTML.png

| I K V

| 0 A.I 1 A.1

| 1 B.I 1 B.1

| 2 C.I 1 C.1

| 3 A.I 2 A.2

| 4 B.I 2 B.2

| 5 C.I 2 C.2

| 6 A.I 3 A.3

| 7 B.I 3 B.3

| 8 C.I 3 C.3

The values of K are exactly the names of the melted columns. And the values of V are exactly the data values from inside table example1. The table meltresult1 has a default integer row index that we use to help describe the result of the melt operation. Rows 0, 3, and 6 in meltresult1 are transformations of row A in the original table. Note that the entry A.I appears n = 3 times in column I. In row 0, the column 1 from the original has become value 1 under new variable column K, and the value in row A, column 1 of the original, namely A.1, has become the value under column V for row 0. Similarly, row 3 is obtained from row A and column 2 of the original, and row 6 is obtained from row A and column 3 of the original.

When we consider the general information needed by a melt operation to perform its function, we see that its arguments must

allow the function to partition the columns into the melt columns and the non-melt columns,
provide a name for the new variable column, and
provide a name for the new value column.

In the pandas melt( ) method of a DataFrame, the id_vars= named parameter allows us to specify the non-melt column or columns, the value_vars= named parameter allows us to specify the melt columns, the var_name= provides the name for the new variable column, and the value_name= provides the name for the new value column. Since we are partitioning the columns between those to be melted and those not melted, we only need to specify one of id_vars and value_vars, and the operation can determine the other. If we do not specify var_name or value_name, the melt( ) will use very generic defaults of var and value for the new column labels.

As a (non-abstract) example, consider a variation of one of the tables from Sect. 6.1, shown here as Table 9.1.

Table 9.1

Top baby names by year

Sex	2015	2016	2017
Female	Emma	Emma	Emma
Male	Noah	Noah	Liam

In this example, we would define sex as the non-melt column(s) and the columns 2014, 2015, and 2016 as the melt columns. The melt operation then introduces a new variable, year, composed from the original column names. Lastly, the entries in the original table are gathered in a new column, name. The desired result after the melt is given in Table 9.2.

Table 9.2

Baby names melted

Sex	Year	Name
Female	2015	Emma
Male	2015	Noah
Female	2016	Emma
Male	2016	Noah
Female	2017	Emma
Male	2017	Liam

In code, the Python variable names_by_year refers to the original data set:

| sex 2015 2016 2017

| 0 Female Emma Emma Emma

| 1 Male Noah Noah Liam

and we melt to transform into a data frame with columns of sex, year, and name:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figf_HTML.png

| sex year name

| 0 Female 2015 Emma

| 1 Male 2015 Noah

| 2 Female 2016 Emma

| 3 Male 2016 Noah

| 4 Female 2017 Emma

| 5 Male 2017 Liam

In this case, we specify only the non-melt column of sex as id_vars, and pandas uses the remaining columns for the melt columns. As usual, the values of the non-melt column have been repeated, explaining why we have three rows with “Female” and three with “Male.”

The name of melt for the operation is intended to convey the idea that some of the columns are “melted” and become rows in the result, transforming a wider table into a table with fewer columns, but with more rows. In pandas, this operation is also performed by the stack method, which uses the row label index as the non-melt column and all other columns as the columns to be melted. Other languages, like R, have a gather( ) function to perform the same operation.

When there is more than one column that makes up the non-melt columns, all the values of the non-melt columns are repeated in each of the rows of the result generated from the same row or the source. We show this with the abstract example2, where we have two non-melt columns, I and J:

| I J 1 2 3

| A A.I A.J A.1 A.2 A.3

| B B.I B.J B.1 B.2 B.3

| C C.I C.J C.1 C.2 C.3

../images/479588_1_En_9_Chapter/479588_1_En_9_Figg_HTML.png

| I J K V

| 0 A.I A.J 1 A.1

| 1 B.I B.J 1 B.1

| 2 C.I C.J 1 C.1

| 3 A.I A.J 2 A.2

| 4 B.I B.J 2 B.2

| 5 C.I C.J 2 C.2

| 6 A.I A.J 3 A.3

| 7 B.I B.J 3 B.3

| 8 C.I C.J 3 C.3

Again, rows with indices 0, 3, and 6 are the melted result from row A in the original, and all of these have the same values for the non-melt columns, I and J. Note how this introduces the redundancy of having the I as well as J values repeated in the result.

In this section, we have focused on melt( ) , but if a data set and/or developer are skilled at using single- and multi-index with data sets, the stack( ) (and unstack( ) ) can be useful for accomplishing the same goals. Additional information about pandas facilities for transformation/reshaping can be found in their user guide [45] and in the references for melt( ) [41] and stack( ) [44].

9.2.2.1 [Optional] Stack Examples

This optional section presents simple examples of the use of stack( ) to accomplish the same goal as explored with melt( ) .

In the first example, we define a DataFrame df with row labels as an Index with values A.I, B.I, and C.I and a name for the Index of I. The column labels are a single-level Index with values 1, 2, and 3 and a name for the index of K.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figh_HTML.png

| K 1 2 3

| I

| A.I A.1 A.2 A.3

| B.I B.1 B.2 B.3

| C.I C.1 C.2 C.3

With row and column logical indices set up this way, a stack( ) operation is the same as a melt( ) . The stack uses the non-Index columns as the set of columns to be melted, in this case columns 1, 2, and 3. The column Index gets “stacked” and becomes a new level of the row index. So the row index transforms from a single-level (I) to a two-level index (I and K), with the values following in a fashion equivalent to the melt( ) . In this case, stack( ) does its job, by default, on the one level of the column Index (as the melt columns) and uses the row Index as the non-melt column.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figi_HTML.png

| I K

| A.I 1 A.1

| 2 A.2

| 3 A.3

| B.I 1 B.1

| 2 B.2

| 3 B.3

| C.I 1 C.1

| 2 C.2

| 3 C.3

| dtype: object

In this case, the result is one-dimensional—we are left with only one non-index column, and so the result is a Series object, albeit one that has a two-level Index. We have seen this exact kind of data frame in the context of topnames in Sect. 6.3.2. If K consisted of year data (say, 2014, 2015, and 2016), and I consisted of sex data, then A.I might be Male (with the three years as the inner index) and B.I might be Female (again representing three rows, one per year). The values A.1, A.2, etc. would be the applicant numbers for top births by sex and year. The reader who is still confused is encouraged to pause and draw the table we have just described.

Our next example demonstrates stack( ) when there are multiple non-melt columns giving the row index. This works just like in the melt( ) ; we end with the non-melt values being repeated, and we add a level to the row index based on the melted/stacked columns. So the result has a three-level row index.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figj_HTML.png

| K 1 2 3

| I J

| A.I A.J A.1 A.2 A.3

| B.I B.J B.1 B.2 B.3

| C.I C.J C.1 C.2 C.3

../images/479588_1_En_9_Chapter/479588_1_En_9_Figk_HTML.png

| I J K

| A.I A.J 1 A.1

| 2 A.2

| 3 A.3

| B.I B.J 1 B.1

| 2 B.2

| 3 B.3

| C.I C.J 1 C.1

| 2 C.2

| 3 C.3

| dtype: object

9.2.3 Pivot

The pivot( ) operation provides the dual operation to that of melt( ) . It is used to combine a set of rows into a single row by adding columns and moving the data from the multiple rows into the new columns of the transformed data set, as shown in Fig. 9.2.

../images/479588_1_En_9_Chapter/479588_1_En_9_Fig2_HTML.png — Fig. 9.2
Pivot operation

For a pivot transformation, there must exist a column, called the “Pivot” column, where the set of unique values in that column becomes the labels for the newly created columns. The values to be populated in the newly created columns come from a single “Value” column from the original table, and the column where the value is placed in the new table is based on the value in the “Pivot” column. We most often use a pivot transformation when the values of the “Pivot” column are actually column names in our desired data presentation, e.g., pop, gdp, and life. We illustrate with an example, ind_to_pivot, featuring data from our indicators data set that would need a pivot transformation to be made tidy:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figl_HTML.png

| code ind value

| 0 CAN pop 36.26

| 1 CAN gdp 1535.77

| 2 CAN life 82.30

| 3 USA pop 323.13

| 4 USA gdp 18624.47

| 5 USA life 76.25

Shortly, we will show how to make this data tidy using a pivot transformation. First, we describe the pivot transformation in general. In Fig. 9.2 we show, among many possible rows in an original table on the left, two sets of rows that, using matching values in the column annotated as “Index”, will be combined into a single row of the result. In the column annotated as “Pivot,” we see three unique values, x, y, and z that are repeated between each of the two sets of rows (just like pop, gdp, and life). So there are six values in the column annotated as “Value.”¹ The values themselves are denoted with single and double prime marks (e.g., because the pop of Canada can be different from the pop of the USA) and must be included in the transformed result.

After the pivot, each of the different values from multiple rows of the source and with a common “Index” column will be represented in exactly one row of the result. The values of the “Pivot” column, x, y, and z, have become the column labels in the result, as can be seen in our corrected indicators data frame:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figm_HTML.png

| ind gdp life pop

| code

| CAN 1535.77 82.30 36.26

| USA 18624.47 76.25 323.13

Furthermore, the values in the resulting table (after the pivot transformation) can be characterized as follows. At the intersection of an “Index” value and one of the values from the “Pivot,” we find the corresponding values from the “Value” column of the source. For example, in the source (ind_to_pivot), the “Value” associated with Canada and life was 82.30, and that is the same value we see after the pivot, in the row indexed by CAN and the column life.

We now illustrate a more in-depth example. We define data frame pivot_ex1, which, for the purpose of the upcoming pivot, has “Index” as column I, pivot column as K, and value column as V. We use row-index labels that help indicate the set of rows to be combined into single rows in the result, so A1 and A2 rows become row A in the result after the pivot, etc. This is analogous to how three USA rows above became compressed into one USA row after the pivot.

../images/479588_1_En_9_Chapter/479588_1_En_9_Fign_HTML.png

| I K V

| A1 A.I K.1 A1.V

| A2 A.I K.2 A2.V

| B1 B.I K.1 B1.V

| B2 B.I K.2 B2.V

| C1 C.I K.1 C1.V

| C2 C.I K.2 C2.V

So, in this example, there are six distinct values from column V that we want to represent in the final table, two per each of A, B, and C. For each set of rows to combine, the values in the pivot column (K.1 and K.2) distinguish the rows from each other, and the same pivot values are repeated in each set. There is exactly one column, I, that uniquely identifies the set of rows to collapse. Think of I and K as two independent variables that yield a dependent variable V.

In the pandas pivot method, the named parameter index= allows us to specify the “Index” column and this becomes the row label/index of the result. The columns= named parameter allows us to specify the “Pivot” column. In this example, only one other column (V) exists, and is used as the “Value” column for the operation. We show an example:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figo_HTML.png

| V

| K K.1 K.2

| I

| A.I A1.V A2.V

| B.I B1.V B2.V

| C.I C1.V C2.V

Note how, when called this way, the resulting column label/index has two levels, with the outside level labeled with the “Value” column label and the inside level giving the new column labels represented by the “Pivot” column. If one wishes to avoid the situation of a two-level column index, then all three of index, columns, and values should be specified to the pivot method:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figp_HTML.png

| K K.1 K.2

| I

| A.I A1.V A2.V

| B.I B1.V B2.V

| C.I C1.V C2.V

This is, in fact, the exact command we used to convert from ind_to_pivot to ind_corrected:

ind_corrected = ind_to_pivot.pivot(index="code",

columns="ind",values="value")

We will return to an analogous example in Sect. 9.3, where we will also learn how to drop a level of a multi-level index.

Extending the abstract example, we can see what happens if there is more than one “Value” column with the same “Index” and “Pivot” columns. In our indicators example, this would be like if we had pop data from each country in each of two years, so instead of value we might have value2016 and value2017 as columns in ind_to_pivot. For the abstract setting, consider data frame pivot_ex2:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figq_HTML.png

| I K V1 V2

| A1 A.I K.1 A1.V1 A1.V2

| A2 A.I K.2 A2.V1 A2.V2

| B1 B.I K.1 B1.V1 B1.V2

| B2 B.I K.2 B2.V1 B2.V2

| C1 C.I K.1 C1.V1 C1.V2

| C2 C.I K.2 C2.V1 C2.V2

Here, the “Index” column is still I, and the pivot column is still K, but now there are two “Value” columns, V1 and V2, each with unique values to be represented in a pivoted result. In this case, beyond the row label index, we expect four columns, with one for each pairing of the pivot column values K.1 and K.2 with the two value columns V1 and V2.

The named argument values= allows us to specify multiple value columns for the pivot, and we obtain our desired columns in the result:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figr_HTML.png

| V1 V2

| K K.1 K.2 K.1 K.2

| I

| A.I A1.V1 A2.V1 A1.V2 A2.V2

| B.I B1.V1 B2.V1 B1.V2 B2.V2

| C.I C1.V1 C2.V1 C1.V2 C2.V2

The two-level index of the result of a pivot is helpful here, as it allows a naming for the combinations of value column and pivot column in the result.

For a non-abstract example, suppose we start with the topnames data from the previous section after a melt has been performed and referred to as melted_names:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figs_HTML.png

| sex year name

| 0 Female 2015 Emma

| 1 Male 2015 Noah

| 2 Female 2016 Emma

| 3 Male 2016 Noah

| 4 Female 2017 Emma

| 5 Male 2017 Liam

If we wanted a transformation with the values of year as the columns, and rows for the two values of sex (since pivot is dual to melt), the corresponding pivot is given by

../images/479588_1_En_9_Chapter/479588_1_En_9_Figt_HTML.png

| year 2015 2016 2017

| sex

| Female Emma Emma Emma

| Male Noah Noah Liam

If, on the other hand, we wanted a transformation with the two values of sex as the columns, and a row for each year, we just need to reverse the “Index” and “Pivot” columns:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figu_HTML.png

| sex Female Male

| year

| 2015 Emma Noah

| 2016 Emma Noah

| 2017 Emma Liam

As a non-abstract example where it makes sense to have more than one “Value” column to use in a pivot, let us return to our indicators data set. Say that we consider the population (pop) and GDP (gdp) indicators for the three countries of China (CHN), India (IND), and Great Britain (GBR) for the three years of 2005, 2010, and 2015, compiled into a DataFrame and referenced by indicators. For this example, we have removed the row label/index, with columns code, year, pop, and gdp all as non-index data columns.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figv_HTML.png

| code year pop gdp

| 0 CHN 2005 1303.72 2285.97

| 1 CHN 2010 1337.70 6087.16

| 2 CHN 2015 1371.22 11015.54

| 3 GBR 2005 60.40 2525.01

| 4 GBR 2010 62.77 2452.90

| 5 GBR 2015 65.13 2896.42

| 6 IND 2005 1147.61 820.38

| 7 IND 2010 1234.28 1675.62

| 8 IND 2015 1310.15 2103.59

To present the data in a table with the country codes as the rows and with columns for each of the three years for each of the two indicators, we would employ the pivot transformation. This would be a pivot where code is the “Index” column and year is the “Pivot” column. The remaining two columns (pop and gdp) are the two “Value” columns, and we invoke the pivot( ) method as follows:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figw_HTML.png

| pop gdp

| year 2005 2010 2015 2005 2010 2015

| code

| CHN 1303.72 1337.70 1371.22 2285.97 6087.16 11015.54

| GBR 60.40 62.77 65.13 2525.01 2452.90 2896.42

| IND 1147.61 1234.28 1310.15 820.38 1675.62 2103.59

This results in the two-level column label/index and the six desired columns.

We could also use pivot( ) to transform the indicators data frame into one with years for the rows, and columns giving the combinations of country and indicator. In the invocation, we simply swap the index= and columns= named parameters.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figx_HTML.png

| pop gdp

| code CHN GBR IND CHN GBR IND

| year

| 2005 1303.72 60.40 1147.61 2285.97 2525.01 820.38

| 2010 1337.70 62.77 1234.28 6087.16 2452.90 1675.62

| 2015 1371.22 65.13 1310.15 11015.54 2896.42 2103.59

For many data sets, there may not always be exactly the same set of rows in an original to be combined into a single row in the result. Suppose, for example, that the indicators data set has years 2010, 2015, and 2018 for China, just years 2010 and 2015 for India and Great Britain.

The new indicators data frame would look like this:

| code year pop gdp

| 0 CHN 2010 1337.70 6087.16

| 1 CHN 2015 1371.22 11015.54

| 2 CHN 2018 1392.73 13608.15

| 3 GBR 2010 62.77 2452.90

| 4 GBR 2015 65.13 2896.42

| 5 IND 2010 1234.28 1675.62

| 6 IND 2015 1310.15 2103.59

Now, when we pivot, we will have a column for 2018 under both pop and gdp, but only data for the CHN row. The remaining slots will be filled as missing values.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figy_HTML.png

| pop gdp

| year 2010 2015 2018 2010 2015 2018

| code

| CHN 1337.70 1371.22 1392.73 6087.16 11015.54 13608.15

| GBR 62.77 65.13 NaN 2452.90 2896.42 NaN

| IND 1234.28 1310.15 NaN 1675.62 2103.59 NaN

9.2.3.1 Pivot Table

There are times when a pivot( ) operation is unable to handle some more general situations. Consider the following table, evaluating it for performing a pivot operation, with I as the “Index,” P as the “Pivot,” and V1 and V2 as two “Value” columns:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figz_HTML.png

| I P V1 V2

| 0 A w 15 23.5

| 1 A x 10 42.5

| 2 A x 5 18.0

| 3 B w 8 10.2

| 4 B w 4 14.3

| 5 B x 6 12.5

Here, we would want and expect the result after a pivot to have two rows, one for A and the other for B. But, unlike in the prior examples, we have subsets of rows of the same index with the same value of the pivot variable. For the A rows, there are two rows with a P value of x. But, in the result of a pivot, we would have only one A row and only one x column (for each of V1 and V2). Similarly, there are two B rows with the w pivot.

The situation is resolved with an operation, pivot_table( ) , that goes beyond the pivot( ) and combines rows with duplicate ( I,P) pairs into a single destination row, using the technique of aggregation that was discussed in Sect. 8.1.2. This means, when we invoke pivot_table( ) , we must specify an aggregation function to use to combine the values with duplicate ( I,P) pairs. The default aggregation function is the mean, so that we would combine the V1 = 8 and V1 = 4 that appear in ( B,w) rows into an output value of V1 = 6. Similarly, we average the two V2 values in ( A,x) rows (42.5 and 18) to get the single V2 value in the output (30.25). In this way, we arrive at a single row per “Index.” By default, pivot_table( ) will expect numeric values in the “Value” columns and computes the mean when it must combine, but other aggregation functions can be specified.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figaa_HTML.png

| V1 V2

| P w x w x

| I

| A 15.0 7.5 23.50 30.25

| B 6.0 6.0 12.25 12.50

We can see that the value for A and x under V1 is the mean of 10 and 5, and the value for B and w under V2 is the mean of 10.2 and 14.3.

In the pivot_table( ) method, the aggfunc= named parameter can accept a dictionary, which maps from a value column to the desired aggregation function. Other than the change in the named parameter, this behaves the same as in the aggregation we have seen earlier. In this example, we perform an aggregation by mean for V1 and the two x values but use an aggregation by sum for V2 and the two w values.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figab_HTML.png

| V1 V2

| P w x w x

| I

| A 15.0 7.5 23.5 60.5

| B 6.0 6.0 24.5 12.5

For example, the ( A,x) value of V2 is now 60.5 = 42.5 + 18.

A common occurrence that also requires a pivot_table( ) operation is when we have more than one column that makes up the operation “Index.” Consider the following table, and assume we wish to pivot on the indicator column, making pop and gdp into column labels, with values based on the values in column value.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figac_HTML.png

| code year indicator value

| 0 IND 2005 pop 1147.61

| 1 IND 2010 pop 1234.28

| 2 IND 2015 pop 1310.15

| 3 USA 2005 pop 295.52

| 4 USA 2010 pop 309.33

| 5 USA 2015 pop 320.74

| 6 IND 2005 gdp 820.38

| 7 IND 2010 gdp 1675.62

| 8 IND 2015 gdp 2103.59

| 9 USA 2005 gdp 13036.64

| 10 USA 2010 gdp 14992.05

| 11 USA 2015 gdp 18219.30

In the desired result, we want the values of the independent variables (code and year) to uniquely determine each row, effectively combining each pair of rows in the original for the same code and year, putting the values in the two new columns for that row. This means we have two columns that, together, define the row-label index of the result. The pivot( ) method is limited and requires a single column for the index= named parameter. Instead, we need to use the more general pivot_table( ) .

../images/479588_1_En_9_Chapter/479588_1_En_9_Figad_HTML.png

| value

| indicator gdp pop

| code year

| IND 2005 820.38 1147.61

| 2010 1675.62 1234.28

| 2015 2103.59 1310.15

| USA 2005 13036.64 295.52

| 2010 14992.05 309.33

| 2015 18219.30 320.74

The result builds a multi-level Index in both the row and the column dimensions. We may need to reset the row label/index or perhaps drop the outer level of the result, depending on what further operations are required.

Details for the pivot( ) and pivot_table( ) operations are given the full documentation of the methods of pandas DataFrame objects [42, 43].

9.2.4 Reading Questions

9.14

Give a real-world example where you might want to take the transpose. Hint: normally, the rows are in the individuals and columns are traits, so think of a situation where you might want to switch your focus to the traits rather than the individuals.

9.15

Does the repetition created by the melt operation mean that the resulting data frame is un-tidy?

9.16

Is it possible to do a melt with no id_vars?

9.17

The reading gives an example of a data frame that melts into our usual topnames example. Give an example of a data frame that melts into indicators[['code','year','pop']]. Hint: your example should have lots of columns.

9.18

Write a sentence describing the number of rows after a melt, based on the number before the melt and the number of melt columns. Does it matter how many non-melt columns there are?

9.19

Read the documentation for stack, including how to stack only a subset of the columns. Then, apply stack to the pandas data frame created from this list of lists:

headers = ['sex', 2015, 2016, 2017]

data = [['Female', 'Emma', 'Emma', 'Emma'],

['Male', 'Noah', 'Noah', 'Liam']]

9.20

Looking at the Pivot Operation Figure, how can you tell how many new columns and new rows will be present after a pivot operation, based on the columns index, pivot, and value in the original data frame?

9.21

In the data frame pivot_ex1, could you have done pivot with other choices of index columns? If so, explain in detail what would have happened.

9.22

When doing a pivot, do the arguments index,columns,values have to partition the original columns? Explain.

9.23

In the indicators example, does the line

indicators.pivot(index='code', columns='year')

yield the same as

indicators.pivot(index='code', columns='year',

values=['pop','gdp'])

Explain.

9.24

Why does it make sense that the default for pivot_table is to take the average of the repeated values?

9.25

Carry out the arithmetic to explain why the values in the pivot_table examples are different, i.e., explain where 30.25, 7.5, 6, and 12.25 come from in the first table and where 60.5 and 24.5 come from in the second table.

9.26

In the last example of pivot_table, explain what would happen if you chose columns = 'value' instead of columns = 'indicator'.

9.27

Consider the original topnames data frame, with year,sex,name, count. Show how to use pivot_table to get a version with sex and year as indices, columns given by name, and with counts as the entries in the data frame.

9.2.5 Exercises

9.28

Make the following into a pandas data frame.

{'A': {0: 'a', 1: 'b', 2: 'c'},

'B': {0: 2, 1: 4, 2: 6},

'C': {0: 1, 1: 3, 2: 5},

'D': {0: 1, 1: 2, 2: 4}})

Suppose further that we have determined that columns B and D are really values of a variable called X. What transformation/reshaping operation should be used to obtain a tidy version of this data?

9.29

At a minimum, what parameter arguments would be needed for this operation to do its job?

9.30

Give the column headers of the transformed data set and give at least two rows.

9.31

Make the following into a pandas data frame.

{'foo': ['one','one','one','two','two','two'],

'bar': ['A', 'B', 'C', 'A', 'B', 'C'],

'baz': [1, 2, 3, 4, 5, 6]}

If the values one and two from column foo should head columns (so it takes more than one row to interpret a single observation), and the values themselves come from the baz column, what transformation/reshaping operation should be used to obtain a tidy version of this data?

9.32

What parameter arguments would be needed for this operation to do its job?

9.33

Give the resultant transformed data set, including column headers.

9.34

Refer back to ratings.csv and the un-tidy data frame you made in the previous section. Make the data frame tidy, storing your result as ratings_tidy.

9.35

Refer back to the restaurants_gender.csv and associated data frame from the previous section. Pivot the restaurants_gender data into a matrix presentation with restaurant down one axis and gender across the other axis. This makes it easy to aggregate female ratings and male ratings. Store the result as rest_mat.

9.36

This question concerns the billboard data introduced in the exercises of the previous section. Download this data as a CSV file, read it into a DataFrame, and melt the data frame to make it tidy.

9.37

This question concerns the data set matches.csv hosted in the data directory and keeps data on sports match-ups. Each row is a team, and the columns tell how well a particular team did against each opponent.

1.
Read matches.csv into a data frame df.
2.
Transform the data frame to make it tidy, storing the result as df_tidy. Your data frame should have no NaN values.

9.38

This question concerns data from the World Bank about population per country from 2000 to 2018, including the total population, the population growth, the urban population, and the urban population growth, for each country and each year. 0. Read world_bank_pop.csv into a data frame dfwb. 1. Transform the data frame to make it tidy and store the result as dfwb_tidy.

9.39

Consider the original topnames data frame, with year,sex,name,count. Show how to use pivot_table to get a version with sex and year as indices, columns given by name, and with counts as the entries in the data frame. Call the result dfPivoted.

9.40

With reference to the exercise above, is dfPivoted in tidy data form? Explain your answer.

9.3 Normalization: A Series of Vignettes

With the full suite of transformation operations to help, we are now ready to look at a primary objective of this chapter: given a data set that does not conform to the principles of tidy data, construct a sequence of operations, as an algorithm, to normalize the data into tidy data form.

Because data may be “messy” in an almost infinite variety of ways, the current section is presented as a series of example vignettes, where we show some of the more common violations of tidy data and discuss how our operations from Chaps. 7 and 8 and our transformations from Sect. 9.2 may be employed to resolve the violation(s).

Before any attempt at normalization for a data set, it is of critical importance to understand your data. You must be able to separate what are values from what are the variables. You should be able to argue that variables are different measures. Variables identified as categorical should be assessed on whether they indeed provide different categories that partition observations or whether the “categories” could be variables themselves. Variables should be identified as dependent or independent. Functional dependencies should be written down and examined critically to understand if dependent variables do, in fact, depend on all the independent variables. If one or more dependent variables are determined by a subset of the independent variables, these should give separate functional dependencies and therefore separate tables.

9.3.1 Column Values as Mashup

TidyData1 states that each column must represent exactly one variable. A violation occurs if, in a given data set, either a column is not actually a variable or if a column is in violation because of the exactly one stipulation. Any time the values in a column are some kind of composition of values of more than one variable, a violation has occurred. A composition might mean a mashup, or it could be the result of a column that is a combination of two other columns, or even an aggregation of multiple columns.

9.3.1.1 Example: Code and Country Mashup

Consider the data frame df below, with column country as an independent variable and dependent variables land, pop, gdp, and life. We see that country is a mashup of a country code and a country name in a single string value.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figae_HTML.png

| land pop gdp life

| country

| CHN--China 9388210.0 1386.40 12143.49 76.41

| GBR--United Kingdom 241930.0 66.06 2637.87 81.16

| IND--India 2973190.0 1338.66 2652.55 68.80

| USA--United States 9147420.0 325.15 19485.39 78.54

At an abstract level, we need to introduce new columns that have, parsed out, the variable values from the current country column and then replace that column with the two new columns. We write our solution as an algorithm.

Solution

1.
If necessary, reset the index to allow manipulation of country as a column.
2.
Define two functions that, given a value of country, yield the code part and the country part.
3.
Apply the function to get a code to the current country column, yielding a codes vector.
4.
Apply the function to get a country to the current country column, yielding a countries vector.
5.
Drop the current country column.
6.
Add codes as a column.
7.
Add countries as a column
8.
If needed, set index to be the new code column.
9.
If needed, change the order by projecting columns in the desired order.

The corresponding code to realize this algorithm is as follows:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figaf_HTML.png

| country land pop gdp life

| code

| CHN China 9388210.0 1386.40 12143.49 76.41

| GBR United Kingdom 241930.0 66.06 2637.87 81.16

| IND India 2973190.0 1338.66 2652.55 68.80

| USA United States 9147420.0 325.15 19485.39 78.54

In the final version of working, we see the normalized data set. In this example, we could also have made country the index and had code as a dependent variable.

9.3.1.2 Example: Year and Month Mashup

Suppose we are collecting data on tourism for a (fictitious) city, Metropolis. Consider the following prefix of a table of data. This table compiles, by month and year from January 2015 through June 2020, a number of measures associated with tourism, including the number of visitors, the number of hotel rooms in the city, the occupancy rate of the hotel rooms, and revenue (Table 9.3).

Table 9.3

Metropolis tourism

Date	Visitors	Rooms	Occupancy	Revenue
Jan-2015	508800	28693	0.857	26124000
Feb-2015	450500	27866	0.656	25618000
Mar-2015	783900	27717	0.940	27317000
Apr-2015	545100	27566	0.887	23254000
May-2015	602100	27656	0.891	20700000

It is quite common in data sets to encode a date as a string, and like in this case, to make that string be a single variable composed of the year and the month (or sometimes, the year, the month, and the day). Such encoding of dates as strings is a violation of TidyData1 and gives us good examples of why this is problematic:

1.
When dates are strings, even a simple sort operation, which occurs as a lexicographic ordering of the string characters involved, can yield unexpected results. In this case, a sort would put Apr-2017 as the first entry. Similar problems can occur even if numeric months and days are used, depending on the ordering.
2.
When data is in a time series over multiple months and years, analysis can often want to GroupBy the years of the data set, to characterize year over year information, or to GroupBy the months of the data set, to characterize monthly or seasonal trends.

To make time series encoded as composite strings tidy and normalize a data set, we have two choices. First, if a package supports it, we could make use of a DateTime object type. These objects are designed to abstract the nature of years, months, days, hours, minutes, seconds into a single object that can be sorted and, in packages like pandas, can be used in a GroupBy which specifies the frequency (like year or month) of the group. Second, we can use the approach employed above, splitting the string into its component types and replacing the composite column with the individual columns.

9.3.2 One Relational Mapping per Row

TidyData2 states that each row represents exactly one mapping from the values of the independent variables to the values of the dependent variables. This means that if, for a representation, more than one row is employed to give the value of dependent variables for the same independent variable, the data is not in tidy form. In the next two examples, we will look at variations of having multiple rows used to represent a single relational mapping.

9.3.2.1 Example: One Value Column and One Index Column

Suppose our single year indicator data set was represented as shown in Table 9.4, with columns code, indicator, and value, giving data for the indicators of population, GDP, and exports:

Table 9.4

Indicator data

Code	Indicator	Value
CHN	Pop	1386.40
CHN	Gdp	12143.49
CHN	Exports	2280.09
GBR	Pop	66.06
GBR	Gdp	2637.87
GBR	Exports	441.11
IND	Pop	1338.66
IND	Gdp	2652.55
IND	Exports	296.21

We might, just looking at the table and not understanding our data, think that the functional dependency entailed was

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{indicator}\ \rightarrow \mathtt{value}\ \end{aligned}$

which says that code and indicator are independent variables and, given a code and an indicator, the value variable is determined. The problem is that indicator, which might look like a categorical variable, has the names of actual variables in its column: pop, gdp, and exports are separate measures. Further, unlike a categorical variable, they do not partition the space of values for each code. Because of this, we have three rows representing a single mapping of a country to indicators, given by the functional dependency

$\displaystyle \begin{aligned} \mathtt{code}\ \rightarrow \mathtt{pop}, \mathtt{gdp},\mathtt{exports}\ \end{aligned}$

Normalizing this particular data set is straightforward and, modulo manipulation of the row or column Index in the result, requires the single operation of a pivot. The “Index” of the pivot is the code column, the pivot column is indicators of the source data frame, and the value column is named value. If the data from Table 9.4 is a pandas DataFrame referred by Python variable mult_rows, the pivot is given as follows:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figag_HTML.png

| indicator exports gdp pop

| code

| CHN 2280.09 12143.49 1386.40

| GBR 441.11 2637.87 66.06

| IND 296.21 2652.55 1338.66

This yields the desired tidy form of the data.

It is worth noting that, when TidyData2 was violated by having more than one row per mapping, TidyData1 was also violated: neither indicator nor value in the original data frame satisfied the condition of a column representing exactly one variable.

9.3.2.2 Example: One Value Column and Two Index Columns

If we extend the previous example, we can see a normalization use case for the pivot_table operation. Suppose our data set is the time-series version of the indicator data, so that we have indicators that are dependent on both a year and a country code. The same kind of representation, where we have an indicator and value column, might produce a data set as seen in Table 9.5, where we show data for indicators of pop and gdp for three countries and for years 2015 and 2017:

Table 9.5

Indicator data

Year	Code	Indicator	Value
2015	CHN	Pop	1371.22
2015	CHN	Gdp	11015.54
2015	GBR	Pop	65.13
2015	GBR	Gdp	2896.42
2015	IND	Pop	1310.15
2015	IND	Gdp	2103.59
2017	CHN	Pop	1386.40
2017	CHN	Gdp	12143.49
2017	GBR	Pop	66.06
2017	GBR	Gdp	2637.87
2017	IND	Pop	1338.66
2017	IND	Gdp	2652.55

Like before, this representation might lead us to the incorrect functional dependency

$\displaystyle \begin{aligned} \mathtt{year}, \mathtt{code}, \mathtt{indicator}\ \rightarrow \mathtt{value}\ \end{aligned}$

where, instead, by understanding our data and the separately measured variable nature of the indicators, the functional dependency should be

$\displaystyle \begin{aligned} \mathtt{year}, \mathtt{code}\ \rightarrow \mathtt{pop}\ , \mathtt{gdp}\ \end{aligned}$

Observe also how the numbers in the value column, because they really come from different variables, are not comparable. If we aggregated the column, we would get a nonsensical result.

Because we have two columns that make up the index in a pivoted result, we must use the pivot_table method. The pivot column of indicator and the value column of value are as before, and the normalization again proceeds in a single step:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figah_HTML.png

| indicator gdp pop

| year code

| 2015 CHN 11015.54 1371.22

| GBR 2896.42 65.13

| IND 2103.59 1310.15

| 2017 CHN 12143.49 1386.40

| GBR 2637.87 66.06

| IND 2652.55 1338.66

9.3.3 Columns as Values and Mashups

Another frequent violation of TidyData1 occurs when the column labels represent values of a variable instead of a single variable of the data set. Sometimes this is easy to see, like when the column labels are, in fact, numeric values,² but sometimes it is less obvious, like when the column name incorporates a variable name along with a value in a mashup as the column name. The next two examples look at variations of this scenario.

9.3.3.1 Example: Single Variable with Multiple Years

Table 9.6 presents a table with columns for country code and population for the years 2014 through 2017. There is a single row per country code. From our experience with the time-series version of the indicators data set, we should be able to write the correct functional dependency

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{year}\ \rightarrow \mathtt{pop}\ \end{aligned}$

where we have a single dependent variable, pop, determined by code and year, which should guide us as we normalize this data and transform Table 9.6 into a data frame with columns for code, year, and pop.

Table 9.6

Population by year

Code	pop2014	pop2015	pop2016	pop2017
CHN	1364.27	1371.22	1378.66	1386.40
GBR	64.61	65.13	65.60	66.06
IND	1295.60	1310.15	1324.51	1338.66
USA	318.39	320.74	323.07	325.15

To normalize this data, we need to start with a melt of the table using all the “popXXXX” columns as the melt columns. The value column of the result will have collected all the values from the original table, as desired. But the “New Var” in the result will still have the mashup of the string "pop" composed with the value of the year, so this will need to be resolved to obtain the integer year value. We write our solution as an algorithm.

Solution

1.
If necessary, reset the index to allow the use of the code index as a column.
2.
Melt the data frame, so that all columns except the code column transform into rows, where the value in the column becomes the pop column, and the previous column name becomes a value in a column named colname.
3.
Define a function that, given a value of colname yields the year value.
4.
Apply the function to the colname column and assign to the data frame as a column named year.
5.
Perform the cleanup of dropping the colname column and setting the index based on two independent variables (year and code); if desired, change the row order so that the index shows grouping in more friendly presentation.

With the original data from Table 9.6 as a data frame referred to by pop_columns, this algorithm translates into the following code and yields the normalized result named tidy:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figai_HTML.png

| pop

| year code

| 2014 CHN 1364.27

| GBR 64.61

| IND 1295.60

| USA 318.39

| 2015 CHN 1371.22

| GBR 65.13

| IND 1310.15

| USA 320.74

| 2016 CHN 1378.66

| GBR 65.60

| IND 1324.51

| USA 323.07

| 2017 CHN 1386.40

| GBR 66.06

| IND 1338.66

| USA 325.15

An equivalent tidy representation could swap the levels of the row Index, and possibly perform a sort_index, if we desired code as the outer index and year as the inner index.

9.3.3.2 Example: Multiple Variables with Multiple Years

Our last example in this subsection extends the previous example by just a bit, adding columns that are compositions of a second variable (gdp) with the year values of the data set. Interestingly, the normalization becomes more involved—the solution presented here involves both a melt and a following pivot to transform the data set into tidy form. Consider the data and structure represented in Table 9.7, where, relative to Table 9.6, we have added three GDP columns, giving the GDP for each of the countries for the same years of 2015, 2016, and 2017.

Table 9.7

Population and GDP by year

	pop2015	pop2016	pop2017	gdp2015	gdp2016	gdp2017
CHN	1371.22	1378.66	1386.40	11015.54	11137.95	12143.49
GBR	65.13	65.60	66.06	2896.42	2659.24	2637.87
IND	1310.15	1324.51	1338.66	2103.59	2290.43	2652.55
USA	320.74	323.07	325.15	18219.30	18707.19	19485.39

When we extend our (correct) functional dependency to include the variable of gdp, we get

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{year}\ \rightarrow \mathtt{pop}, \mathtt{gdp}\ \end{aligned}$

Like in the last example, we clearly need to start with a melt, which will transform the data set into a frame with columns for code, as the index, ind_year for the new variable column, and value for the value column. Also, like the previous example, the values in the ind_year column will be strings composed of indicator and year. But now, we will have strings like both 'pop2015' and 'gdp2015'. So now when we separate, we will get two columns, one for the indicator part and the other for the year part.

We show the code for realizing this part of the algorithm, essentially extending our previous algorithm, and display a prefix of the result, named part_tidy, and assuming that the original data is in a DataFrame referenced by popgdp_columns:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figaj_HTML.png

| code value indicator year

| 0 CHN 1371.22 pop 2015

| 12 CHN 11015.54 gdp 2015

| 1 GBR 65.13 pop 2015

| 13 GBR 2896.42 gdp 2015

| 2 IND 1310.15 pop 2015

| 14 IND 2103.59 gdp 2015

This should look familiar. After melting and cleanup, we now have a representation that uses multiple rows to map from the independent variables of code and year to the two dependent indicator variables of pop and gdp. This is like our examples in Sect. 9.3.2. Fortunately, with two columns making up the index needed for a pivot, a pivot_table can transform into normal form, referenced by variable tidy.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figak_HTML.png

| value

| indicator gdp pop

| year code

| 2015 CHN 11015.54 1371.22

| GBR 2896.42 65.13

| IND 2103.59 1310.15

| USA 18219.30 320.74

| 2016 CHN 11137.95 1378.66

| GBR 2659.24 65.60

| IND 2290.43 1324.51

| USA 18707.19 323.07

| 2017 CHN 12143.49 1386.40

| GBR 2637.87 66.06

| IND 2652.55 1338.66

| USA 19485.39 325.15

In this particular case, the column Index has, unnecessarily, two levels, with the outer level being the generic value. The pandas module supports dropping a level of an Index, which we can do if the remaining levels retain all the necessary information. So we want to drop the outer level, level 0, along the column dimension, which is axis 1:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figal_HTML.png

| indicator gdp pop

| year code

| 2015 CHN 11015.54 1371.22

| GBR 2896.42 65.13

| IND 2103.59 1310.15

| USA 18219.30 320.74

| 2016 CHN 11137.95 1378.66

| GBR 2659.24 65.60

| IND 2290.43 1324.51

| USA 18707.19 323.07

| 2017 CHN 12143.49 1386.40

| GBR 2637.87 66.06

| IND 2652.55 1338.66

| USA 19485.39 325.15

Alternatives

For this example, the original data set, as a data frame, could have been normalized in a couple of other ways than what was presented here.

First, we could have taken the original data frame and, using two different column projections, created a data frame with all the population data and another data frame with all the GDP data. Each of these could have been transformed into a partially tidy data set, with population or GDP indexed by code and year. Then, using the table combination techniques of Sect. 8.3, these partial data sets could be combined into a tidy whole.

A second alternative would involve the use of a two-level Index in the column dimension. If we transformed the data frame, through manipulation of the column Index, so that it had one level of the indicator and the other level of the year, then the pandas unstack( ) method, which generically is really transforming column Index levels into row Index levels, could be used to unstack just the year level. This is a more advanced topic specific to pandas, so we have chosen to omit it.

9.3.4 Exactly One Table per Logical Mapping

Up to now, the examples of messy data we have looked at have all involved TidyData1 and/or TidyData2. The other constraint of this model is TidyData3, which requires that exactly one table be used for a set of mappings (i.e., rows) involving the same combination of independent variables. To better understand this, it helps to think in terms of functional dependencies, where the combination of independent variables is specified by the left-hand side of the relationship.

This means that if we have, for example,

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{year}\ \rightarrow \mathtt{pop}\ \end{aligned}$

and

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{year}\ \rightarrow \mathtt{gdp}\ \end{aligned}$

then these should be represented in a single table, not in two separate tables. If they were in separate tables, the complete tables (like a pop table and a gdp table) would each define the values for a variable, so this also violates TidyData1, because the variable is not specified by a column but rather, in this case, by a table.

The dual, where two separate things are combined into a single table, must also be upheld—we would not have a tidy data set if some of the dependent variables in a table were related to one combination of independent variables, and some other dependent variables in the same table were related to a different combination of independent variables. To see this point, consider the dependent variables of country and land in some of the variations of our indicators data set. If the data set is not a time series, the functional dependency could be given by

$\displaystyle \begin{aligned} \mathtt{code}\ \rightarrow \mathtt{country}, \mathtt{land}, \mathtt{pop}, \mathtt{gdp}\ \end{aligned}$

where all four dependent variables are determined by code alone. However, in the time-series version of the data set, we have, by definition of these indicators, that pop and gdp are determined by both code and year:

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{year}\ \rightarrow \mathtt{pop}, \mathtt{gdp}\ \end{aligned}$

But let us consider more carefully the dependent variables of country, the English name of the country, and land, the land area of the country. If the value of these variables does not change over time (i.e., with the year), then the year variable is not part of their functional dependency. This means that the correct functional dependency for these dependent variables is given by

$\displaystyle \begin{aligned} \mathtt{code}\ \rightarrow \mathtt{country}, \mathtt{land}\ \end{aligned}$

So if we have a single table that combines these two separate logical mappings into one, we have violated TidyData3, and, to normalize the data set, we should separate the two logical mappings into two tables, one for each of these last two functional dependencies.

9.3.4.1 Example: Variable Values as Two Tables

We start with an example that uses the top baby names data set and was first presented as the first of the alternative representations illustrating tabular structure in Sect. 6.1. Tables 9.8 and 9.9 show a prefix of the data where we have separate tables for the top female and top male baby names and counts by year. Each of these tables has 139 rows, one per year of collected data.

Table 9.8

Top female baby names

Year	Name	Count
2018	Emma	18688
2017	Emma	19800
2016	Emma	19496
2015	Emma	20455
2014	Emma	20936
2013	Sophia	21223

Table 9.9

Top male baby names

Year	Name	Count
2018	Liam	19837
2017	Liam	18798
2016	Noah	19117
2015	Noah	19635
2014	Noah	19305
2013	Noah	18257

Recall that Female and Male are values of a categorical variable, sex, which, along with year, are the independent variables of the data set and which determine the dependent variables name and count. To achieve the functional dependency of

$\displaystyle \begin{aligned} \mathtt{year}, \mathtt{sex}\ \rightarrow \mathtt{name}, \mathtt{count}\ \end{aligned}$

we need to combine these tables together. Suppose the two source tables are referenced by variables topfemale and topmale. In the final result, we need all four columns of year, sex, name, and count, but there is currently no sex column, since that is encoded as part of the separate table representation.

When we have the same columns and what are, logically, different rows, we know from Sect. 8.3 that we can use the concat( ) function along the row dimension. But, before we combine, we should add the sex column to each of the source tables, setting values in the column to "Female" for the topfemale table and to "Male" for the topmale table.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figam_HTML.png

At this point, tidy is a DataFrame with 278 rows, with no index, and with all the female rows preceding all the male rows. A more convenient situation would have year and sex as the row Index and use sort_index for the rows to be sorted in index order:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figan_HTML.png

| name count

| year sex

| 1880 Female Mary 7065

| Male John 9655

| 1881 Female Mary 6919

| Male John 8769

| 1882 Female Mary 8148

| Male John 9557

9.3.4.2 Example: Separate Logical Mappings in a Single Table

For our final example of this section, we look at the example described above, where we have a single combined table, but the two different functional mappings of

$\displaystyle \begin{aligned} \mathtt{code}, \mathtt{year}\ \rightarrow \mathtt{pop}, \mathtt{gdp}\ \end{aligned}$

and

$\displaystyle \begin{aligned} \mathtt{code}\ \rightarrow \mathtt{country}, \mathtt{land}\ \end{aligned}$

Our source table, which will be referenced by the variable mixed_table is shown in Table 9.10. Note how, in different rows, the contents of code, country, and land always have the same values, even while the values in pop and gdp differ.

Table 9.10

Combined table

Year	Code	Country	Land	Pop	Gdp
2000	CHN	China	9388210	1262.64	1211.35
2000	IND	India	2973190	1056.58	468.39
2000	USA	United States	9147420	282.16	10252.35
2017	CHN	China	9388210	1386.40	12143.49
2017	IND	India	2973190	1338.66	2652.55
2017	USA	United States	9147420	325.15	19485.39

The strategy for normalizing this data involves using column projection twice on the original source to obtain the correct set of columns for each of the constituent tables. The code below projects the columns code, country, and land into the table countries and projects the columns code, pop, and gdp into the table indicators, and we show the results for the latter table.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figao_HTML.png

| code pop gdp

| 0 CHN 1262.64 1211.35

| 1 IND 1056.58 468.39

| 2 USA 282.16 10252.35

| 3 CHN 1386.40 12143.49

| 4 IND 1338.66 2652.55

| 5 USA 325.15 19485.39

But when we look at the countries table, we see:

../images/479588_1_En_9_Chapter/479588_1_En_9_Figap_HTML.png

| code country land

| 0 CHN China 9388210.0

| 1 IND India 2973190.0

| 2 USA United States 9147420.0

| 3 CHN China 9388210.0

| 4 IND India 2973190.0

| 5 USA United States 9147420.0

This still violates TidyData2 since there are multiple rows (duplicates) conveying the same mapping. While we could use iteration to obtain, row-by-row, the contents of countries with duplicates and accumulate a new set of rows for a table where we added a row only if it were not already present, pandas has a drop_duplicates method that can perform this functionality for us.

../images/479588_1_En_9_Chapter/479588_1_En_9_Figaq_HTML.png

| code country land

| 0 CHN China 9388210.0

| 1 IND India 2973190.0

| 2 USA United States 9147420.0

The resulting two tables of countries and indicators are now a tidy data set. We will explore these tools further in the exercises at the end of the chapter.

9.3.5 Reading Questions

9.41

In the country example, explain the 3 and 5 that appear in the lambda expression. Illustrate with an example.

9.42

Give an example of a way to report dates numerically, but where sorting gives a result different from chronological order.

9.43

Give an example of a real-world seasonal trend you might want to study by grouping data by month.

9.44

In the un-tidy example of indicators (with three rows for each mapping from a country to its indicators), explain why it is a bad idea to have a single column with three types of separate measures. Rather than speaking in generalities, give a specific example of something that could easily go wrong in such a representation.

9.45

Often in scientific inquiry, it is valuable to think about the units of various quantities, e.g., distance being measured in miles. Please write a sentence about how the consideration of units can be useful in determining if a given column satisfies TidyData1.

9.46

Consider the example of Population by Year, where normalization resulted in an index of ['year','code']. If the index had been set as ['code','year'] instead, what would the result after sorting look like?

9.47

In the example Population and GDP by Year, the data was normalized with a melt followed by a pivot, and alternatives were given after. Could you have normalized it by pivoting first and then doing a melt? Why or why not?

9.48

In the Top Baby Names example, what is the difference between pd.concat( [topfemale, topmale], axis=0) and pd.concat( [top male, topfemale], axis=0) ?

9.49

In the countries example (with pop,gdp,country,land), is any information lost with the projections

countries = mixed_table[['code', 'country', 'land']] indicators = mixed_table[['code', 'pop', 'gdp']]?

9.50

Describe the relationship between countries, indicators, and mixed_table using the language of joins.

9.51

How do you think the drop_duplicates( ) command is implemented?

9.4 Recognizing Messy Data

This section attempts to arm the reader with some tools for recognizing violations of one or more of the structural constraints of tidy data (i.e., TidyData1, TidyData2, and TidyData3). We frame these as red flags—characteristics of the structure and the values of a given data set that may indicate a violation. Even to employ these red flags and to truly assess whether or not a given data set is in tidy data form, we must start by understanding the data, as emphasized at the start of Sect. 9.3.

9.4.1 Focus on Each Column as Exactly One Variable (TidyData1)

The following is a sequence of questions we must ask ourselves.

If a table has columns whose value is derived by other columns, e.g., GDP per capita.
If a table has columns where the column labels incorporate multiple parts. The examples in the chapter include cases where there are mashups of a variable name and the value of another variable
If a table has column labels that are values of a variable.
- Column labels that are the values of years, months, or days.
- Column labels that are the various categories of a single categorical variable; we could imagine topnames with a column for Female and another column for Male.
- In a table with a single independent variable, the values of the independent variable are the column labels; consider the country codes in indicators that could be structured as the column labels or the names of patients in a treatment data set.
If the values in a column can be seen as a mashup, possibly using a hyphen (-) or slash (/) or comma (,) to divide parts of the mashup.
If we have dates with multiple parts (year, month, and day) as strings in a single column. Note that a DateTime object in a column is allowed, as it allows proper sorting and GroupBy in the various dimensions of the time series (by year, month, and day of month).
If it makes sense and we are inclined to aggregate values in more than one column for a single aggregate value, then this could indicate that multiple columns are, in fact, values for a single variable.
If it does not make sense to aggregate in a column. This is particularly clear if the units of the values are different, but like units can still exhibit this red flag. Consider the open, close, high, and low stock price in a table of stocks. Taking an average of these values makes no sense, and so a single column of stock prices and another column of high/low/open/close would violate tidy data.
- Another way to see this red flag is if we are treating as categorical something that does not give a partitioning and are really independent measures.
If it is not treating as a variable something that gives a categorical partitioning of the data (e.g., Male/Female).

9.4.2 Focus on Each Row Giving Exactly One Mapping (TidyData2)

The following is a sequence of questions we must ask ourselves.

If it makes sense to aggregate across a row, it may mean that the values are somehow compatible and could be values of the same variable, instead of a given row having values of different variables. This could be a subset of the values in a row, as opposed to the full row.
- In this case, the row labels would be the names of variables.
If we can write down the functional dependencies and if we can see that a group of rows is needed to capture one mapping instance.
If the name of a dependent variable is a value in a single column. Our “indicator” meta-variable with “values” of 'pop', 'gdp', 'cell', and 'life' is an example where it would take multiple rows to capture a single logical mapping.
If the name of a column is too generic, like “value” or “variable.” This red flag often occurs at the same time as the previous point.
The use of missing data (or some other encoding) to mean that there is not a value for this variable in this row because it is not applicable.
Data rows that are dependent on other rows, like rows that are aggregations for a subset of the rows.

9.4.3 Focus on Each Table Representing One Data Set (TidyData3)

The following is a sequence of questions we must ask ourselves.

If there are multiple tables and the names of the tables are values of a categorical variable, like we had with a Female and a Male table for the topnames data set.
If there are multiple tables and the names of the tables name different dependent variables, particularly if they are dependent on the same independent variable. For instance, the indicators data set might have one table for population and another for GDP, with both being dependent on the independent variables of the year and country code. In this case, if we write down the functional dependencies, the different tables would have the same left-hand side and differ in the right hand sides.
If, when we understand the data and write down the functional dependencies, there are different functional dependencies with one having a subset of the variables on the left-hand side of another. Our example in the book had code, year -> pop,gdp but code->country, land.
If there is repeated data and, upon examination, the redundant data is because, in a row (instance mapping), the repeated data is dependent on less than the full set of independent variables.

9.4.4 Reading Questions

9.52

The first red flag for TidyData1 states “If a table has columns whose value is derived by other columns.” Does this description always lead to a violation of TidyData1? Illustrate with examples. This question hinges on how you interpret “derived,” so please be clear in what this term means.

9.53

Give an example, different from any in the text, of a table where the column labels incorporate multiple parts.

9.54

Give an example, different from any in the text, of a table where we might be inclined to aggregate values in more than one column for a single aggregate value.

9.55

Give an example, different from any in the text, of a table with a numerical column that it does not make sense to aggregate, e.g., because it contains values in different units. Your example should involve “treating as categorical something that does not give a partitioning and are really independent values.”

9.56

The last red flag listed for TidyData1 says “Not treating as a variable something that gives a categorical partitioning of the data (e.g., Male/Female).” Suppose you had a data set containing salary data for individuals of different hair colors. What would it look like to represent this data set in an un-tidy way, leading to this red flag? What would it look like to represent this data in a tidy way?

9.57

Give an example, different from any in the text, of a table where it makes sense to aggregate across a row, and give a specific TidyData assumption your example fails to satisfy.

9.58

Give an example, different from any in the text, of a table where a group of rows is needed to capture one mapping instance of the functional dependency.

9.59

Give an example, different from any in the text, of a table where the name of a dependent variable is a value in a single column.

9.60

Give an example, different from any in the text, of a table where missing data is used to mean that a variable is not applicable to a given row.

9.61

Referring to the language of the TidyData assumptions, why is it a violation to have “Data rows that are dependent on other rows, like rows that are aggregations for a subset of the rows”?

9.62

Give an example, different from any in the text, of a table where “there is repeated data and, upon examination, the redundant data is because, in a row (instance mapping), the repeated data is dependent on less than the full set of independent variables.”

9.4.5 Exercises

9.63

The data set members.csv has a mashup column containing both first and last names. Download this data from the book web page, read it into a data frame, and correct this mashup column.

9.64

The data set members.csv has a date column. Correct this as shown in the reading, by splitting it into three columns.

9.65

Now correct the date column in members.csv by creating a column of DateTime types.

9.66

The data set members.csv has another mashup column. Find it and correct it.

9.67

Read us_rent_income.csv from the book web page into a data frame, and then look at the data. Is it in tidy data form? Explain your answer. Note: this data set contains estimated income and rent in each state, as well as the margin of error for each of these quantities. This data came from the US Census.

9.68

Explore the data provided as us_rent_income.csv, and then draw the functional dependency.

9.69

Read us_rent_income.csv into a data frame (with “GEOID” as the index), and then transform as needed to make it tidy. Store the result as df_rent.

For each of the next four exercises, assess the provided data against our Tabular-Tidy data model constraints. If the data conforms, explain. If the data does not conform, explain why and describe the transformation(s) needed to bring the data set into conformance. Be clear about the columns that would be in the resultant data set.

9.70

Below is some (fake) data about the connection between religion and salary. This was inspired by real research done by the Pew Research Center.

Religion	Under 20K	20–40K	40K–75K	Over 75K
Agnostic	62	140	173	132
Buddhist	49	68	84	61
Catholic	1025	1432	1328	1113
Protestant	1413	2024	1661	1476
Jewish	39	50	39	91

Is it tidy? Assess.

9.71

Below is (fake) weather data, giving the minimum and maximum temperature recorded at each station on each day.

Date	Element	Station	Value
01/01/2016	tempmin	TX1051	15.1
01/01/2016	tempmax	TX1051	26.9
01/01/2016	tempmin	TX1052	14.8
01/01/2016	tempmax	TX1052	27.5
01/02/2016	tempmin	TX1051	13.9
01/02/2016	tempmax	TX1051	26.4
01/02/2016	tempmin	TX1052	14.6
01/02/2016	tempmax	TX1052	30.7

Is it tidy? Assess.

9.72

The following is public transit weekday passenger counts, in thousands.

****

Mode	2007–01	2007–02	2007–03	2007–04	2007–05	2007–06
Boat	4	3.6	40	4.3	4.9	5.8
Bus	335.81	338.65	339.86	352.16	354.37	350.54
Commuter Rail	142.2	138.5	137.7	139.5	139	143
Heavy Rail	435.29	448.71	458.58	472.21	474.57	477.32
Light Rail	227.231	240.22	241.44	255.57	248.62	246.10

Is it tidy? Assess.

9.73

Here is some (fake) stock price data by date.

Date	Apple Stock Price	Yahoo Stock Price	Google Stock Price
2010-04-01	177.25	179.10	178.62
2010-04-02	176.11	175.39	175.84

Is it tidy? Assess.

9.74

Similarly to the above, another way that stock data is commonly presented is with one row per company and one column per month. Would that form be tidy? If not, what transformations would be needed to fix it? Justify your answer.

9.75

Explore the data from the previous exercise, and then draw the functional dependency. Note: this data came from a survey of individuals about their religion and their income.

9.76

Read the data from the previous exercise into a data frame, and then transform as needed to make it tidy.

9.77

Download the World Health Organization’s “Global Tuberculosis Report” data set as a CSV file from

https://www.who.int/tb/country/data/download/en/

Then, wrangle this data into a tidy data frame. Hint: you will need to study the data dictionary carefully.