T. Bressoud, D. WhiteIntroduction to Data Systemshttps://doi.org/10.1007/978-3-030-54371-6_8

8. Tabular Model: Advanced Operations and pandas

Thomas Bressoud¹ and David White¹

(1)

Mathematics and Computer Science, Denison University, Granville, OH, USA

Chapter Goals

Upon completion of this chapter, you should understand the following:

Mutation operations to update, add, and delete data.
Operations to group and aggregate data.
The need for operations specific to handling missing data.

Upon completion of this chapter, you should be able to do the following:

Perform aggregations on single and multiple Series.
Aggregate and summarize information on group-based partitions of data frames.
Combine and update data frames.
Detect and mutate data frames to handle missing data.
Carry out these operations in pandas.

8.1 Aggregating and Grouping Data

In many analyses as well as in data exploration, we need to summarize data from a data frame in various ways. We may, for a single column, want to get some individual values that help characterize the data in that column, called aggregating, by finding its mean, median, max, min, and so forth. Sometimes we want a single aggregate for a column, but sometimes we want to compute multiple aggregates for a single column. We can extend this, and if we have a data frame with multiple columns of numeric data, we may want to characterize those multiple columns in a single operation, perhaps finding the mean for one column and the max and min for another column.

When the variables of a data frame allow us to define different row subsets of the data, we often want to compute aggregates for one or more of these subsets. This can allow us to focus on and characterize a single subset, or it can allow us to draw comparisons between the different subsets. An example is grouping countries by their regions of the world and computing aggregates over different regions. This kind of grouping often relies on the independent variables of the data frame, or values of other columns in the data frame, like columns containing categorical variables.

The collection of operations that support such aggregation and grouping are the subject of the current subsection.

8.1.1 Aggregating Single Series

The simplest form of aggregation occurs when we have a single Series, often a single column projection from a DataFrame, and want to compute one or more functions over the values in the Series. The pandas Series class defines methods that operate on a Series, and these methods include many/most that one might need. You will notice in the description that aggregation functions often omit missing values (values with a NaN entry in the DataFrame) from their computation.

Table 8.1 summarizes many of the most common aggregation functions/methods:

Table 8.1

Aggregation function/methods

Method	Description
mean( )	Arithmetic mean, not including missing values
median( )	Value occurring halfway through the population, omitting missing values
sum( )	Arithmetic sum of the non-missing values
min( )	Smallest value in the set
max( )	Largest value in the set
nunique( )	Number of unique values in the set
size( )	Size of the set
count( )	Number of non-missing values in the set

Many of the methods are overloaded—they have the same name as built-in functions, so some may be invoked by passing the Series as the argument to the function, instead of invoking a method (with no argument) on a Series object. For example, if s is a Series of integers, then max( s) and s.max( ) compute the maximum value in s.

Table 8.2 displays two other Series methods that come in handy in data exploration and characterizing data. Both return a collection, rather than a single value.

Table 8.2

Aggregation to Series methods

Method	Description
unique( )	Construct a subset Series consisting of the unique values from the source Series
value_count( )	Construct a Series of integers capturing the number of times each unique value is found in a source Series; the Index of the new Series consists of the unique values from the source Series

Consider the topnames0 DataFrame, which has the default integer index, and so year and sex are the regular columns. We can determine the max and median counts of the top social security applications, for a year and sex, by a single column projection of 'count', followed by a method invocation for the two aggregates. In the following, we project the 'name' column and, in the same statement, invoke the method for determining the number of unique names in the collection:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figa_HTML.png

| Mean:99689 Median:52556.0 Number of Unique Names:18

We see that the number of unique names, over the 238 rows of the data frame, is a mere 18. So if we wanted to see what these top 18 names are, we could invoke the unique( ) method and get the vector of names; if we wanted to see how many of each of these unique names, we would invoke the value_count( ) method. We do so in the following code and then use the head( ) method on the latter Series to output just a prefix of the result:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figb_HTML.png

| Mary 76

| Michael 44

| John 44

| Robert 17

| Jennifer 15

| Name: name, dtype: int64

8.1.2 Aggregating a Data Frame

In this and the following two subsections, we build from aggregation of a Series to some of the most powerful and useful operations of the tabular model—grouping rows into discrete partitions, aggregating each partition into a row of computed values, and combining the result. We help the discussion using a sequence of figures that themselves build upon one another, starting in Fig. 8.1.

../images/479588_1_En_8_Chapter/479588_1_En_8_Fig1_HTML.png — Fig. 8.1
Aggregating a data frame

The left side of the figure depicts an original data frame with four columns, A, B, C, and D. Assuming that column A is a level of an index, or a column with a categorical variable, and, by color and shading, we see three different “categories” of values for A. Let us assume that columns B and C are numerical and are columns that we want to employ aggregation to characterize the data in the frame. Column D might be a string value and, by assumption, is not of interest in our aggregation. For example, suppose the source frame is our indicators data set, column A is the index (code), column B is the life expectancy column, and column C is the cell subscriber column.

The goal then is, starting with a source data frame, to perform an operation on the frame and get a result that is composed of (multiple) aggregations, e.g., the minimum life expectancy and median number of cell subscribers.

On the right side of the figure, column A is gone because we have aggregated over it. We no longer have one row per country, but now instead have one row per aggregation function. Figure 8.1 shows the general case, where, for any given column, we might apply multiple different aggregations. We show the result as a table in its own right, with columns corresponding to the columns from the original (B and C), and rows corresponding to different aggregation functions (f ₁ and f ₂, in this case). When, for any given column, we only have a single aggregation function, an alternative view of the result is as a vector, where the elements contain the particular aggregation for each of the requested columns. We label these as B′ and C′ to emphasize that they are different, by their aggregation, from the original column.

If f1 were the min aggregation and f2 were the median aggregation, the entries in the result would allow us to lookup, by the original column and aggregation function, each of the individual results. Suppose further that we wanted the min and median for the life column, but only the median for cell column. The figure would still be a valid representation of the result, but at column cell and row min, there would be a missing value (NaN), because we did not request to compute the minimum number of cell phone subscribers.

In pandas, we use the agg( ) method of a DataFrame to perform this general form of the aggregation operation. The argument to agg( ) must convey, for each column of the original DataFrame, what aggregation function or functions to perform. One way of doing is this is with a dictionary, where we map from a column name to a function or string function name, or to a list of specified functions.

We demonstrate the indicators example below, performing two aggregations on the 'life' column ('min' and 'median'), and just the 'median' aggregation on the 'cell' column. The result is a DataFrame, and the row labels on the result are the names of the aggregation functions, matching the top right of Fig. 8.1.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figc_HTML.png

| life cell

| median 76.41 391.6

| min 68.80 NaN

If a data frame aggregation only has a single aggregation for any given column, the result can be simpler—it can be a one-dimensional object with an entry for each of the columns being aggregated. This is still true, even if the same aggregation function is repeated for different columns. This is shown pictorially in the bottom right of Fig. 8.1. In the following example, we compute the min aggregation for the life and cell columns, and the sum aggregation for the gdp column.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figd_HTML.png

| life 68.80

| cell 120.02

| gdp 36083.83

| dtype: float64

The Index of the Series reflects the names of the columns being aggregated, but the information on what aggregation is performed is not part of the result.

Many other shortcuts are provided by the pandas module. For instance, if we want to perform the same aggregation for all columns of a DataFrame, we can simply invoke the aggregation method directly on the DataFrame itself. Here, we invoke max( ) aggregation for all columns of the indicator frame:

../images/479588_1_En_8_Chapter/479588_1_En_8_Fige_HTML.png

| country Vietnam

| pop 1386.4

| gdp 19485.4

| life 78.54

| cell 1469.88

| dtype: object

This is like the prior example: since there is only a single aggregation function per column, the result is a Series containing the results. Also notice that pandas will, whenever possible, perform the aggregation over all columns. Here, the max( ) aggregation, when applied to the string column of country names, results in the alphabetically last country name of the collection.

8.1.3 Aggregating Selected Rows

Often, we wish to focus on a subset of a data frame and then characterize, through aggregation, the data within that subset. At one level, this is a composition of operations we have already seen:

1.
Create a subset by selecting just the rows in the data frame that form the subset, yielding a new data frame.
2.
Perform aggregation on the new subset data frame.

This “subset and aggregation” is also a critical piece in understanding the more general group partitioning and aggregation that is to follow, and so we will look at this limited example first.

The situation is depicted in Fig. 8.2. We see the filter step, where the value from column A is used as the criterion for the subset—all rows that have a “downward hash” value in column A are selected for a new data frame, and then we perform an aggregation to obtain the desired result.

../images/479588_1_En_8_Chapter/479588_1_En_8_Fig2_HTML.png — Fig. 8.2
Filtering and aggregating

Say that the original data frame is topnames, defined with the two-level index of year and sex. We may be interested in all the rows where 'Mary' was the top applicant. We can select rows with the access operator and using a Boolean Series compute which rows have a value in the 'name' column of 'Mary'. We display a portion of the result and see that we have the index values and the two columns of name and count.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figf_HTML.png

| name count

| year sex

| 1957 Female Mary 61094

| 1958 Female Mary 55857

| 1959 Female Mary 54474

| 1960 Female Mary 51475

| 1961 Female Mary 47680

This new DataFrame, mary_rows, is the middle data frame in Fig. 8.2. We show two variations where we aggregate to obtain the min, max, and median of the count column of the selected rows. For our first example, we aggregate against the subset DataFrame and use a dictionary to specify the mapping from a column to a list of desired aggregations:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figg_HTML.png

| count

| min 6919.0

| max 73985.0

| median 54423.0

The result is a DataFrame with a single column, and the labels of the rows are the three aggregation functions.

For our second example, we first project the 'count' column and then use agg as a method of the Series and specify the list of aggregations.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figh_HTML.png

| min 6919.0

| max 73985.0

| median 54423.0

| Name: count, dtype: float64

The result is a Series, with the row labels denoting the requested aggregations.

Sometimes when we wish to define a row selection, it is based on part of the row Index, not based on a regular column. This can occur when we have a multi-level index. Consider our topnames DataFrame. We may want a selection of rows that consists of just one of the sex values, like just the Female rows, which would be a subset that includes half the rows of the original DataFrame. Or, we might be interested in a row selection of a single year, say 1961, in which case the subset would consist of exactly two rows.

When the condition for the row selection is based on a value of an index level, rather than a value of a column, we cannot use a Boolean Series to obtain our subset. While we could reset_index to make each of the levels regular columns (and then use the Boolean Series method of selecting rows), pandas offers an alternative. In pandas, a subset of the DataFrame that is based on one or more index values is called a cross section, abbreviated (xs). So pandas provides a cross-sectional method, xs( ) , that allows us to specify a value and an index level and can very efficiently create the subset. We do so here, to obtain the selection of rows for 'Female' entries and then aggregate the count column by min, max, and median.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figi_HTML.png

| min 6919.0

| max 99689.0

| median 48347.0

| Name: count, dtype: float64

This allows us, in this case, to compare the application counts for when 'Mary' was the top baby name against the population of 'Female' entries overall.

8.1.4 General Partitioning and GroupBy

Generalizing the pattern above, an analysis may want to go beyond a single selection of rows to one where the filtering is repeated to obtain a set of row selections, with a goal of comparing the various sets. This is depicted in Fig. 8.3, where we see three separate row selections based on the three different values present in the A column. For example, the rows might be individuals and A could be gender, taking values “Male,” “Female,” and “Non-Binary.”¹

../images/479588_1_En_8_Chapter/479588_1_En_8_Fig3_HTML.png — Fig. 8.3
Partition/aggregate/combine

We call this partitioning to draw on the mathematical meaning: each and every row from the original is selected into exactly one of the sets. Following the figure, from left to right, we start on the left with a single data frame. The partitioning operation occurs, called a groupby operation (pronounced “group by”), yielding, logically, a set of data frames, each called a partition. Each of these partitions then has an aggregation performed upon it, and so, from each partition, we obtain a single row of values. Finally, the set of individual rows are combined to build a single data frame result. For our gender example, we would end up with three data frames after the groupby operation, one for each of “Male,” “Female,” and “Non-Binary.” The aggregation could then compute, for example, the number in each group, so that the resulting data frame would have one column (“number”) and three rows (“Male,” “Female,” and “Non-Binary”).

In general, for the aggregation, we need to yield a result that is a single row per partition. Each row contains a column corresponding to the column (or columns) that we use for the partitioning, known as the groupby column. By the definition of how the partitioning works, each discrete value in the groupby column yields exactly one partition and then exactly one row after the aggregation. So the values in the groupby column are unique, and in pandas, they form a row label index for the final combined result. In Fig. 8.3, A is the groupby column and so the unique values in A form the row labels in the final combined data frame.

For the rest of the columns in Fig. 8.3 that result from the aggregation, we use subscripts to denote that these are some aggregation function over the elements of some particular column and are not the same as the column itself. In the figure, we depict two different aggregations that are based on the B column, and one aggregation that is based on the C column. This would be like if, in our gender example, we originally had columns B and C for the salary and number of years of education and then in the aggregation computed the median salary and average number of years of education, for each of the three gender groups.

In the topnames DataFrame, we could group by name, or by sex, or by year. The pandas method for the partitioning/groupby operation is called groupby( ) , and its first parameter is the name of the column (or columns) to use for the partitioning. The result is a GroupBy object, which is different from a DataFrame because it, in essence, is a set of data frames. In the example below, we create the three aforementioned groupby partitions and then use the built-in len function to see the length of each, where the built-in length function, applied to a GroupBy object, gives the number of partitions.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figj_HTML.png

| 18 2 139

We see that partitioning by name yields 18 partition groups, which is the same as the number of unique names from the data frame. When we group by sex, we, as expected, get 2 partition groups, and when we group by year, we get 139 partition groups, one for each year in the data set.

To perform the aggregation step, we use the agg( ) method, defined for GroupBy objects in a fashion similar to the Series and DataFrame agg( ) methods. This lets us specify which columns and which aggregation methods to apply to each partition group. The result is the data frame obtained after both the aggregation step (to get a single row per partition group) and the combine step that concatenates these aggregation rows together.

For the groupby_name partition group, our example below shows aggregation of count, computing median and sum, which, in this case, helps us to understand, through the sum, the total number of applications for the top applications. We subset the result using head( ) after sorting the result by descending count sum values:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figk_HTML.png

| count

| median sum

| name

| Mary 54423.0 3098428

| Michael 68117.5 3084824

| James 86224.0 1056228

| Robert 60699.0 1041984

| John 9032.5 861403

The result shows how row and column Index objects are constructed when performing this kind of partition/aggregate/combine operation. The row labels come from the column specified in the groupby( ) , with a unique row per discrete value from the column. The column labels yield a two-level index, with the outer level determined by the column being aggregated, and the inner level determined by the set of aggregation functions applied to the column. This allows fully general aggregation over multiple columns and using multiple, different, aggregation functions.

8.1.5 Indicators Grouping Example

Because the groupby and aggregation operations are somewhat complex, we conclude this subsection by reinforcing our understanding with an additional example, using a derivative of the indicators data set. In this derivative, we drop the columns of life and cell but, based on a categorization from the World Bank Open Data project[89], include two additional columns, region, that provide a categorical variable for countries, grouping them into regions of the world, and income, that categorizes each country into “High Income,” “Upper Middle Income,” “Lower Middle Income,” and “Low Income.” We reference the data set with the indicators2 variable. Further, the new data set has many more countries—all the countries with at least $US 100 million in GDP. In this new version of the indicators2 data set, there are a total of 63 countries within a total of 7 regions of the world represented. Table 8.3 presents a five row prefix of the data set.

Table 8.3

World indicators

Code	Country	Region	Income	Pop	Gdp
AGO	Angola	Sub-Saharan Africa	Lower middle income	29.82	122.12
ARE	United Arab Emirates	Middle East & North Africa	High income	9.49	382.58
ARG	Argentina	Latin America & Caribbean	Upper middle income	44.04	642.70
AUS	Australia	East Asia & Pacific	High income	24.60	1330.80
AUT	Austria	Europe & Central Asia	High income	8.80	416.84

Before we perform a groupby( ) partitioning, we explore the data by first projecting the region column and use value_counts( ) to obtain the Series with the unique regions and the number of non-missing values in each:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figl_HTML.png

| Europe & Central Asia 24

| East Asia & Pacific 12

| Middle East & North Africa 10

| Latin America & Caribbean 8

| Sub-Saharan Africa 4

| South Asia 3

| North America 2

| Name: region, dtype: int64

The regions range from just two countries in the “North America” region to 24 countries in the “Europe & Central Asia” region.

We could further identify the countries in a particular region by selecting the rows for that region. For instance, we select the rows where region is 'Sub-Saharan Africa' and project all columns except the 'region' column:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figm_HTML.png

| country income pop gdp

| code

| AGO Angola Lower middle income 29.82 122.12

| NGA Nigeria Lower middle income 190.87 375.75

| SDN Sudan Lower middle income 40.81 123.05

| ZAF South Africa Upper middle income 57.00 348.87

In a similar fashion, we could explore the distribution of the countries in the different income categories. To start, we determine how many countries are in each category:

../images/479588_1_En_8_Chapter/479588_1_En_8_Fign_HTML.png

| High income 34

| Upper middle income 17

| Lower middle income 12

| Name: income, dtype: int64

Say that our goal was to compare the population, GDP, and number of countries based on the two categorizations of region and income. We first group by income and then apply our desired aggregates, seeking the mean of pop and gdp, and an aggregation of count for the number of countries in each partition:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figo_HTML.png

| pop gdp country

| income

| High income 33.574118 1458.293824 34

| Lower middle income 217.416667 486.519167 12

| Upper middle income 143.549412 1256.629412 17

Similarly, we can group by region and aggregate. This time, instead of counting the number of countries in the partition, we find the number of unique income categories in each region.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figp_HTML.png

| pop gdp income

| region

| East Asia & Pacific 180.982500 1922.537500 3

| Europe & Central Asia 32.806667 861.221667 3

| Latin America & Caribbean 61.947500 607.893750 2

| Middle East & North Africa 34.971000 287.073000 3

| North America 180.845000 10566.130000 1

| South Asia 568.743333 1069.073333 1

| Sub-Saharan Africa 79.625000 242.447500 2

8.1.6 Reading Questions

8.1

Thinking back to what you know about data analysis, give an example when you might want multiple aggregates of a single column.

8.2

Give an example of an aggregate you might use on a column of string data.

8.3

At a high level, how do you think the method unique( ) is implemented?

8.4

At a high level, how do you think the method value_count( ) is implemented?

8.5

In the example that calls max, median, and nunique, what would happen if you called nunique on topnames0 instead of topnames0['name']?

8.6

Give an example of a data set of the sort described in the figures, i.e., with columns A, B, C, and D, where A has three levels (and multiple rows of each level). Then, give an example of two possible aggregation functions f ₁ and f ₂.

8.7

Give an example of a dictionary mapping from column names to lists of functions (rather than function names). Use more than just built-in functions. Hint: you do not need parentheses because you are not calling the functions.

8.8

The command indicators.max( ) applies the function max to every column. How would you change this command to perform the max aggregation for all the columns except the country column? Use the agg method.

8.9

Why is it that, “when the condition for the row selection is based on a value of an index level, rather than a value of a column, we cannot use a Boolean Series to obtain our subset”?

8.10

Write a sentence to explain why len( groupby_name) is 18, and why len( groupby_year) is 139, in the topnames example.

8.1.7 Exercises

8.11

Use agg to return a data set with the max and min counts per year in topnames for males born in the 1960s. Is your data set a DataFrame or a Series, and how do you know?

8.12

Write code to add a new column popSize to indicators, which takes value “high” if pop > 1000, “low” if pop < 100, and “medium” otherwise.

8.13

Building on the question above, use groupby to partition indicators by this new column popSize.

8.14

Building on the question above, use agg to report the average population in each of the three groups.

8.15

Use one line of code to sort indicators by pop and gdp. Warning: pay attention to your inplace settings.

8.16

Modify the indicators2 example to find the maximum and minimum population and GDP in each of the three income categories.

8.17

Modify the indicators2 example to project onto the columns country, pop, gdp, region for all lower middle income countries. Then find the number of countries in each region of the resulting data frame.

8.18

This is a pencil and paper exercise. Suppose we have the following data set named df, and our goal is, for column C, to add together values for various row subsets of the data, and for column D, to average values from that column for various row subsets of the data.

A	B	C	D
foo	one	0.4	-1
foo	two	-1.5	0
bar	three	-1.1	1
bar	two	-0.1	-1
foo	three	-1.0	1

Answer the following:

Show the partitioning (i.e., the sets of rows) that would result if we did a GroupBy the A column.

8.19

Give the result of

dropping the A and C columns,
grouping by the B column, and
specifying aggregation by taking the mean( ) of the D column

df[[‘B’, ‘D’]].groupby(‘B’).mean()

8.20

On the book web page, you will find a file sat.csv, based on data from www.onlinestatbook.com. This data contains the high school GPA, SAT math score, SAT verbal score, and college GPAs, for a number of students. Read the data into a pandas dataframe dfsat with column headers. Use agg to compute the average of each numeric column in the data set.

8.21

On the book web page, you will find a file education.csv, based on data hosted by www.census.gov. Each row is a US metropolitan area, with data on population (row[3]), number unemployed without a high school degree (row[15]), number unemployed with a terminal high school degree (row[29]), number unemployed with some college (row[43]), and number unemployed with a college degree (row[57]). Read the data into a pandas data frame dfedu with column headers. Then use agg to compute the total number of people in each unemployment category and to report the fraction of unemployed people from each category.

8.22

The reading showed three ways to group the topnames data but only carried out agg using one of them. Use grouping and aggregation to produce a DataFrame containing the average number of male applications and the average number of female applications.

8.23

Similarly to the previous exercise, produce a DataFrame containing the maximum number of applications (regardless of gender) each year.

8.24

Similarly to the previous exercise, find the year that had the maximum number of applications (regardless of gender).

8.25

To practice with a two-level index, use read_csv to read

earthquake_all_month.csv

into a data frame. Group the data frame by magType, and give statistics regarding the magnitudes, including min, max, median, and mean. Sort the resulting data frame by the mean, ascending. Store your result in the variable newDf.

8.2 Mutation Operations for a Data Frame

Up to this point, we have learned operations to project columns and select rows from a data frame, as well as partitioning and aggregating data frames, but none of the operations have modified an existing data frame. To achieve our needs for specific domain and analysis goals, and for normalizing data frames, like we do in Chap. 9, we also need to perform such update operations, as summarized previously in Sect. 7.1.

Because pandas is implemented to take advantage of many of the object-oriented facilities in Python, much of the syntax learned in Sect. 7.3 will be used in similar fashion for mutations, often with the specification of data frame rows, columns, and subsets on the left-hand side of an assignment.

In what follows, we will primarily use the indicators data set for each of the examples. Each of the examples will start from the same initial state, with the indicators data frame as shown initially in Table 7.1, with information on five countries from 2017, an index of code, and columns of country, pop, gdp, life, and cell. The first step in each example will be to copy the original data frame, since many mutations will change the data frame directly, and we want to maintain the original.

8.2.1 Operations to Delete Columns and Rows

8.2.1.1 Single Column Deletion

When we wish to delete a single column from a DataFrame, the easiest method is to use the Python del statement, using an argument specifying the single column. This is accomplished with the access operator using a column label.

In the following example, we remove the 'cell' column from the ind2 DataFrame using del and display the result:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figq_HTML.png

| country pop gdp life

| code

| CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| RUS Russian Federation 144.50 1578.62 72.12

| USA United States 325.15 19485.39 78.54

| VNM Vietnam 94.60 223.78 76.45

Alternatively, we may use the pop( ) to delete a single column. This method should be familiar from Python list methods; it deletes and modifies the data structure in place and returns the element that was deleted. In a similar fashion, we can delete a column and assign the deleted column (a Series) to a variable. In the following example, we copy the indicators DataFrame to ind2 and reset the index, so the result is a data frame where code is just another column. In the example, we then invoke the pop( ) method, specifying the column we want to delete (and return) and assign to code_series. We then show the resultant DataFrame:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figr_HTML.png

| country pop gdp life cell

| 0 China 1386.40 12143.49 76.41 1469.88

| 1 India 1338.66 2652.55 68.80 1168.90

| 2 Russian Federation 144.50 1578.62 72.12 227.30

| 3 United States 325.15 19485.39 78.54 391.60

| 4 Vietnam 94.60 223.78 76.45 120.02

So we see that the column has been dropped, but we still have the variable code_series, which we can use, if we wish, in later code. We extend the example showing how we can assign code_series to the index attribute of the DataFrame, thereby establishing a set of row labels:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figs_HTML.png

| country pop gdp life cell

| code

| CHN China 1386.40 12143.49 76.41 1469.88

| IND India 1338.66 2652.55 68.80 1168.90

| RUS Russian Federation 144.50 1578.62 72.12 227.30

| USA United States 325.15 19485.39 78.54 391.60

| VNM Vietnam 94.60 223.78 76.45 120.02

8.2.1.2 Multiple Column Deletion

Dropping multiple columns can be accomplished with the DataFrame drop( ) method, where the first argument can be a single column label, or can be a list of column labels. For example, to delete both the 'cell' and 'life' columns from the indicators data frame, we execute:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figt_HTML.png

| country pop gdp

| code

| CHN China 1386.40 12143.49

| IND India 1338.66 2652.55

| RUS Russian Federation 144.50 1578.62

| USA United States 325.15 19485.39

| VNM Vietnam 94.60 223.78

The axis=1 argument indicates the “dimension” for what is to be dropped, in this case, 1 indicates the column dimension (and an axis of 0 would indicate the row dimension).

For many methods that support changes to a DataFrame, pandas includes an inplace named parameter. If True, then the change does not create a new structure and does its work in place. This would be preferred when dealing with large data frames to save memory, and if the original non-modified frame is not needed in future steps. When inplace=False is specified, the change creates a new structure in memory and returns the result. In this case, the method invocation would appear on the right-hand side of an assignment, so that we have a variable referencing the new, changed, structure. In our example, since we are using ind2 as a new copy of indicators, we could specify inplace=False on the invocation of drop( ) on indicators, which would create a new copy of the data frame with the specified columns dropped. We assign this new copy to ind2. Thus, the following is equivalent to the last example:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figu_HTML.png

| country pop gdp

| code

| CHN China 1386.40 12143.49

| IND India 1338.66 2652.55

| RUS Russian Federation 144.50 1578.62

| USA United States 325.15 19485.39

| VNM Vietnam 94.60 223.78

8.2.1.3 Row Deletion

Because of the axis named parameter, the drop( ) method can also be used to drop a list of specified rows, and the first argument would contain a list of row labels. The following example uses axis=0 to specify that rows should be dropped, and drops the rows corresponding to the USA and Russia:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figv_HTML.png

| country pop gdp life cell

| code

| CHN China 1386.40 12143.49 76.41 1469.88

| IND India 1338.66 2652.55 68.80 1168.90

| VNM Vietnam 94.60 223.78 76.45 120.02

When we have a multi-level index for the rows, like we do in the topnames data set, we can specify a drop that is explicit as to which level is used to interpret the first argument. Here, the first argument to drop( ) specifies a range list of the integer years to drop, from 1880 up to but not including 1900. The level='year' argument is how we specify that the first argument should apply to the 'year' level of the two-level index.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figw_HTML.png

| name count

| year sex

| 1900 Female Mary 16706

| Male John 9829

| 1901 Female Mary 13136

| Male John 6900

| 1902 Female Mary 14486

So in this example, we dropped all rows in the range of years from 1880 up to 1900 and did not have to specify the sex level.

8.2.2 Operation to Add a Column

One of the most common mutation operations is the addition of a new column to an existing data frame. The values of a new column is normally a Series (or other list-like set of values of the proper length) that is specified on the right-hand side of an assignment. We can also specify a scalar assignment to a new column, which would create the column as a Series with the scalar value repeated for the length of the vector, e.g., adding a column with value “Male” for every row of a data set of male individuals, perhaps in preparation for combining the data set with a different data set of females.

The left-hand side of these assignments uses the same notation we use for a single column projection we learned previously: the name of the data frame followed by the access operator with a column label between the brackets.

So the right-hand side could use the single column projection of one or more existing columns and perform column-vector computations to realize a new Series. This is then assigned, using the column label on the left-hand side to give a name to the new column. In the following example, we compute a Series based on a scalar multiple of an existing Series to get a 15% growth in GDP. We also carry out a vector division of two Series to compute the number of cell phone subscriptions per person in a country.

In the example below, we create a new ind2 data frame by initially dropping the country and life columns to help the readability of the result. We then compute and assign two new columns, based on column-vector computation, and finally, we display the result. Of course, it would be wise to include a check, as in Chap. 3, that we are not accidentally dividing by zero, but we have omitted this step.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figx_HTML.png

| pop gdp cell growth cellper

| code

| CHN 1386.40 12143.49 1469.88 13965.0135 1.060214

| IND 1338.66 2652.55 1168.90 3050.4325 0.873187

| RUS 144.50 1578.62 227.30 1815.4130 1.573010

| USA 325.15 19485.39 391.60 22408.1985 1.204367

| VNM 94.60 223.78 120.02 257.3470 1.268710

Other techniques using column-vector operations, like using apply to perform a unary operation on the elements of a vector, work equally well. For instance, we can apply a lambda function that converts a string to uppercase to define a new column with country names in upper case. We start with the original indicators frame and create a new frame while dropping (for readability) several columns. We then use a lambda function applied to the country column and show the result:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figy_HTML.png

| country countryCaps

| code

| CHN China CHINA

| IND India INDIA

| RUS Russian Federation RUSSIAN FEDERATION

| USA United States UNITED STATES

| VNM Vietnam VIETNAM

When pandas uses a list of column names for projection, the result is actually a view or reference to the columns as they reside in their original data frame. For read-only access to the data, this is much more efficient, particularly for large data sets. However, if, after projecting the columns, the result is then updated, perhaps by adding a new column, then pandas does not know if the programmer’s intent is to add a column to the subset data frame, or to add a column to the original data frame. The reader may be familiar with this phenomenon from a previous study of lists in Python, as an analogous issue motivates the use of the copy( ) function.

For example, when we set country_cell = indicators[['country', 'cell']], the new data frame country_cell is bound to indicators, so that changes in indicators will cause changes in country_cell. This means, if we later add a column to country_cell, pandas will be confused as to whether or not to add the same column to indicators and will produce an error stating "A value is trying to be set on a copy of a slice from a DataFrame."

Just like the situation of lists in Python, a good solution is to specify that the new data frame be a copy( ) of the projected columns from the original data frame. Then pandas knows that the assignment should be treated as a one-time operator, rather than as binding these two data frames together forever. We illustrate the correct code for our example, which avoids the error discussed above.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figz_HTML.png

| country cell cellper

| code

| CHN China 1469.88 1.060214

| IND India 1168.90 0.873187

| RUS Russian Federation 227.30 1.573010

| USA United States 391.60 1.204367

| VNM Vietnam 120.02 1.268710

8.2.3 Updating Columns

Instead of creating an entirely new column in a data frame, we often want to change values in an existing column.

8.2.3.1 Update Entire Column

When the goal is to update all the values in an existing column, we specify the existing column on the left-hand side of an assignment and a valid column-vector operation on the right-hand side. So to update the life column, whose values are:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figaa_HTML.png

| code

| CHN 76.41

| IND 68.80

| RUS 72.12

| USA 78.54

| VNM 76.45

| Name: life, dtype: float64

we simply use an assignment with ind2['life'] on left-hand side, and the column computation on the right-hand side:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figab_HTML.png

| country pop gdp life cell

| code

| CHN China 1386.40 12143.49 81.41 1469.88

| IND India 1338.66 2652.55 73.80 1168.90

| RUS Russian Federation 144.50 1578.62 77.12 227.30

| USA United States 325.15 19485.39 83.54 391.60

| VNM Vietnam 94.60 223.78 81.45 120.02

Like creating new columns, the right-hand side of the assignment for a column update operation can utilize all the column projection and column-vector computation operations we have learned previously.

8.2.3.2 Selective Column Assignment

When we wish to change a subset of the values of an existing column, we need a mechanism to specify which row positions should be updated. In pandas, we use the same mechanism that we use to select row positions in a filter—the Boolean vector, where values in the vector of True mean that the column’s row position should be updated, and values in the vector of False mean that the column’s row position should be left at its current value.

Since we want to specify both rows (through a filter) and exactly one column (by column label), we use the .loc attribute so that we can provide both rowspec and colspec. We start by using the appropriate .loc access to query the values we want to update, namely the countries with GDP greater than 1000:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figac_HTML.png

| code

| CHN 76.41

| USA 78.54

| Name: life, dtype: float64

So, in this example, we want to update just the 'CHN' and 'USA' rows of the 'life' column. The update places this on the left-hand side of the assignment, with the “increment by 5” column-vector computation on the right-hand side:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figad_HTML.png

| country pop gdp life cell

| code

| CHN China 1386.40 12143.49 81.41 1469.88

| IND India 1338.66 2652.55 68.80 1168.90

| RUS Russian Federation 144.50 1578.62 72.12 227.30

| USA United States 325.15 19485.39 83.54 391.60

| VNM Vietnam 94.60 223.78 76.45 120.02

We get the desired result where just 'CHN' and 'USA' rows of the column are updated. From our understanding of .loc and .iloc in Sect. 7.3.5, we can generalize our selective update and use .loc and .iloc and specify rows by explicit list and by slice as well as by Boolean Series.

8.2.4 Reading Questions

8.26

Why is it important to maintain a copy of the original data frame? Give an example of a previous time you needed to retain a copy of the original data.

8.27

Does the drop( ) method have the same functionality as iterated calls to del?

8.28

Give a real-world example where you would use drop( ) with axis = 0.

8.29

After popping 'code' from ind2, into a variable code_series, how could we create a new column in the data frame with the same value as code_series? In other words, how could we put 'code' back into the data frame?

8.30

Check out the online documentation for the insert( ) command in pandas, and then write down a line of code that would insert a column named 'code', with values code_series, in position index 2, into ind2.

8.31

Can you use apply on a whole data frame instead of just a single column? Explain.

8.32

The reading highlighted the importance of the copy( ) method, when working with a subset of the columns, as in country_cell = indicators[[ 'country', 'cell']].copy( ) . Give another example of a time in your life that failing to work with a copy could confuse Python about where you meant to change something.

8.33

Run the following code, and then explain precisely why Python gets confused.

country_cell = indicators[['country', 'cell']]

country_cell['cellper'] = indicators.cell /

indicators['pop']

8.34

The reading shows how to update the 'CHN' and 'USA' rows of ind2 to add 5 to the values in the life expectancy column. Show how to do the same thing using iloc.

8.35

Read the online documentation about sort_values( ) in pandas and then show how to mutate the ind2 data frame to sort it by the values in the life expectancy column.

8.2.5 Exercises

8.36

On the book web page, you will find members.csv, with (fake) information on a number of individuals in Ohio. We will use this for the next several exercises. Read this data set into a pandas DataFrame using read_csv. Name it members0, and do not include an index.

8.37

Repeat the above, but now do include an index, by specifying index_col in the constructor. Name your DataFrame members.

8.38

Write a projection that will isolate just the 'Phone' column into a variable phone_series. What type are the values in this column?

8.39

Write a selection that will isolate just those rows where the phone number starts with a 614 area code. Selection involves picking which rows will be shown.

In order to complete this:

1.
Write a lambda function that will, given a string, isolate the first three characters and compare these to '614'.
2.
Apply the lambda function to the column you isolated in the last question.
3.
Use the result as an index to the pandas DataFrame.

8.40

Split up the column Name into two different columns, FName and LName. We follow a similar process for this example as we did for selecting rows.

1.
Write a lambda function that will, given a string, split on space and select only the first element in the resultant list.
2.
Apply the lambda function to the Name column, and save the result.
3.
Access the data frame text using a new column name (FName), and assign the result from step 2.
4.
Acquire the last name from the Name column and create a new column in the data frame called LName.
5.
Acquire the city from the Address column and create a new column in the data frame called City.
6.
Acquire the state from the Address column and create a new column in the data frame called State.
7.
Drop the original “Name” and “Address” columns.

8.41

1.
Read the data into a pandas data frame dfsat with column headers.
2.
Add a new column 'total_SAT' containing the total SAT score for each student, calling the resulting data frame dfsat2.
3.
Add a new column 'GPA_change' with the change in GPA from high school to college, calling the resulting data frame dfsat3.
4.
Update the column 'total_SAT' to scale it so that the max score is 2400 instead of 1600 (i.e., multiply each entry by 1.5).
5.
Delete the column 'total_SAT', calling the resulting data frame dfsat5.

8.42

Building on the exercise above, create a new column that partitions students by their high school GPA, into groups from 2.0 to 2.5, 2.5–3.0, 3.0–3.5, and 3.5–4.0, such that each group includes its upper bound but not its lower bound. Then use grouping and aggregation to compute the average of each numeric column in the data set.

8.43

On the book web page, you will find a file education.csv, based on data hosted by www.census.gov. Each row is a U.S. metropolitan area, with data on population (row[3]), number unemployed without a high school degree (row[15]), number unemployed with a terminal high school degree (row[29]), number unemployed with some college (row[43]), and number unemployed with a college degree (row[57]). Read the data into a pandas data frame dfedu with column headers. Create a new column 'total_unemployed' with the total number of unemployed people in each metro area (i.e., the sum of the columns just listed).

8.44

Building on the exercise above, use a regular expression to extract the state (e.g., 'TX') from the geography column, and add state as a new column in your data frame. Then use grouping and aggregation to compute the total number of people in each unemployment category, in each state.

8.45

Building on the exercises above, enrich your data frame with a column containing the population of each state and then use grouping and aggregation to compute the per capita total unemployment in each state.

8.3 Combining Tables

The operations associated with combining tables will, in pandas, use two source DataFrame objects with a goal of a single resulting DataFrame that has some union or intersection of the rows, columns, and values from the source data frames.

The need for combining tables comes up most often when:

We have one data set with a particular set of columns and need to append by adding additional rows, but with the same set of columns. This could be indicative of a “growing” data set, perhaps along a time dimension, where the variables remain the same, but additional rows mapping the independent variables to the dependent variables are to be appended.
We have two data sets with, logically, the same set of rows, but with different columns, and want to create a data set with those same rows and the columns from both the source data sets. Here, we are adding new variables with their corresponding values to the data set, e.g., to enrich an old data set with more information on each individual.
We have two tables that are related to each other through the values of a particular column or index level that already exist in both tables. Here we want to combine the tables so that the values of columns of the source tables are combined, but constrained to the matching values. We saw an example, in Chap. 6, of combining tables on classes and students to get information about which students are in which class sections.

We will deal with each of these cases in turn, showing a representative example, and limiting ourselves to two-table combinations. There are many variations possible when combining tables that are DataFrame objects. One source of variation comes about depending on whether or not we have a “meaningful” index for row labels or column labels. By this we mean, for instance, that for a row index, the index and its levels capture what makes rows unique vis-a-vis the independent variables of the data set.

In Chap. 12, we will examine more deeply some of the same table combination concepts in the context of the relational model.

8.3.1 Concatenating Data Frames Along the Row Dimension

For the examples and use cases we illustrate here, we make the assumption that, for the tables being combined, the source data frames represent different information. When combining along the row dimension, that implies that there is no intersection of value combination of the independent variables (i.e., the same mapping is not present in both data frames). Because pandas is designed to be general purpose for more than data frames that conform to the tidy data standard, the operations to combine tables have many additional arguments that can perform, for instance, union or intersection of columns when this assumption is not met. But these variations will not be covered here.

When we want to concatenate along the row dimension, we are referring to the situation where we have two source data frames, each with the same set of columns, but, at least logically, rows that are non-intersecting between the two source data frames.

8.3.1.1 Meaningful Row Index

In normalized tidy data, a row index is meaningful if the index is composed of the independent variable(s) of the data set. So if a data set has one independent variable, the value of that variable is unique per row, and the remaining columns give the values of the dependent variables. For instance, the code variable is independent for the indicators data set, and the values of this variable form the row label index. In the topnames data set, the year and sex variables are independent, and the combinations of year and sex give the row labels for the data frame.

Suppose, from the indicators data set, we define a subset, ind1, with columns for country, pop, gdp, and life for the rows 'CHN', 'IND', and 'USA':

../images/479588_1_En_8_Chapter/479588_1_En_8_Figae_HTML.png

| country pop gdp life

| code

| CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| USA United States 325.15 19485.39 78.54

Now suppose that you construct, from a list of row lists, a new DataFrame with values for the same columns, but for the countries 'DEU' and 'GBR':

../images/479588_1_En_8_Chapter/479588_1_En_8_Figaf_HTML.png

| country pop gdp life

| DEU Germany 82.66 3693.20 80.99

| GBR United Kingdom 66.06 2637.87 81.16

How the two source data frames might be obtained or constructed does not matter, but that they have the same columns and different rows, and a meaningful index, where the row labels give the independent variable code does.

The combination of the two frames is obtained with the concat( ) function of the pandas module, where the first argument is a list of the frames to be combined, and the axis= named parameter, with a value of 0, specifies to combine in the row dimension:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figag_HTML.png

| country pop gdp life

| CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| USA United States 325.15 19485.39 78.54

| DEU Germany 82.66 3693.20 80.99

| GBR United Kingdom 66.06 2637.87 81.16

The order of the result is based simply on the order of the data frames in the list, and no sorting occurs.

8.3.1.2 Meaningful Index with Levels

Suppose we have two source data frames that we want to combine along the row dimension, and these have a meaningful index, the code giving the country designation in this case, but the frames might have the same code, and the source frames are separate due to another independent variable. So, for example, suppose with have indicator data from 2015 for 'CHN', 'IND', and 'USA', referenced by indicators2015:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figah_HTML.png

| country pop gdp life

| code

| CHN China 1371.22 11015.54 76.09

| IND India 1310.15 2103.59 68.30

| USA United States 320.74 18219.30 78.69

Now, we might acquire data for the same countries, but in a different year, say 2017, and this is a DataFrame referenced by indicators2017:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figai_HTML.png

| country pop gdp life

| code

| CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| USA United States 325.15 19485.39 78.54

If we want to combine these into a single, tidy data frame, we are, in essence, adding an independent variable of year, so that each row in the result is uniquely identified by the combination of a year and a code.

If we combine the same way as we did in the first example, the result is a valid data frame, but one where the row labels actually have duplicates. While such non-uniqueness is allowed by pandas, we now no longer have a tidy data set.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figaj_HTML.png

| country pop gdp life

| code

| CHN China 1371.22 11015.54 76.09

| IND India 1310.15 2103.59 68.30

| USA United States 320.74 18219.30 78.69

| CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| USA United States 325.15 19485.39 78.54

The concat( ) function supports a keys= named parameter that allows us to specify a new, additional level for the row labels. The value is a list where each element specifies an index level value, so keys=[2015, 2017] will use an outer index of 2015 for the first data frame’s rows and 2017 for the second data frame’s rows.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figak_HTML.png

| country pop gdp life

| code

| 2015 CHN China 1371.22 11015.54 76.09

| IND India 1310.15 2103.59 68.30

| USA United States 320.74 18219.30 78.69

| 2017 CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| USA United States 325.15 19485.39 78.54

This gives us the tidy two-level index with year and code as the levels. If we want the outer level to have a symbolic name, we can construct an Index object and specify both the list of outer level values as well as the name of the outer level index.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figal_HTML.png

| country pop gdp life

| year code

| 2015 CHN China 1371.22 11015.54 76.09

| IND India 1310.15 2103.59 68.30

| USA United States 320.74 18219.30 78.69

| 2017 CHN China 1386.40 12143.49 76.41

| IND India 1338.66 2652.55 68.80

| USA United States 325.15 19485.39 78.54

8.3.1.3 No Meaningful Index

As we manipulate data frames, even ones that satisfy the structural constraints of tidy data, we often process through cycles where the index of a data frame has been reset, or, on first construction, has only the default integer index for the row labels. When we combine such tables through concat( ) , we need to understand what happens.

For the example source data frames to illustrate this case, let us suppose a data frame topnames1, a subset of the data from the topnames data set with the data from years 2015 and 2016, with both Female and Male rows, and topnames2, with the data from 2017 and 2018. We assume the only index for both source data frames is the default integer index.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figam_HTML.png

| year sex name count

| 0 2015 Female Emma 20455

| 1 2015 Male Noah 19635

| 2 2016 Female Emma 19496

| 3 2016 Male Noah 19117

../images/479588_1_En_8_Chapter/479588_1_En_8_Figan_HTML.png

| year sex name count

| 0 2017 Female Emma 19800

| 1 2017 Male Liam 18798

| 2 2018 Female Emma 18688

| 3 2018 Male Liam 19837

If we perform concat( ) , we see, not surprisingly, the index values of 0 through 3 repeated.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figao_HTML.png

| year sex name count

| 0 2015 Female Emma 20455

| 1 2015 Male Noah 19635

| 2 2016 Female Emma 19496

| 3 2016 Male Noah 19117

| 0 2017 Female Emma 19800

| 1 2017 Male Liam 18798

| 2 2018 Female Emma 18688

| 3 2018 Male Liam 19837

To avoid errors on subsequent operations, like performing a .loc that uses the index, we should specify that the operation ignores the incoming row label index of the two source frames and generates a new one:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figap_HTML.png

| year sex name count

| 0 2015 Female Emma 20455

| 1 2015 Male Noah 19635

| 2 2016 Female Emma 19496

| 3 2016 Male Noah 19117

| 4 2017 Female Emma 19800

| 5 2017 Male Liam 18798

| 6 2018 Female Emma 18688

| 7 2018 Male Liam 19837

8.3.2 Concatenating Data Frames Along the Column Dimension

For combining data frames along the column dimension, we make assumptions similar to those made in Sect. 8.3.1. When we are using tidy data and we want to combine tables, we assume that the two source data frames have the same rows and an entirely different set of columns. In tidy data, the stipulation of the same rows means that the values of the independent variables are the same, and here we assume that they are incorporated into a meaningful index, and the stipulation of different columns means the data frames have different dependent variables. Since we use the names of variables as the column labels, we rarely have columns whose Index is generated by pandas.

8.3.2.1 Single Level Row Index and New Columns

Using subsets of the indicator data set, let cols1 refer to a data frame with rows for 'IND', 'CHN', and 'USA', and columns of country and gdp:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figaq_HTML.png

| country gdp

| code

| IND India 2103.59

| CHN China 11015.54

| USA United States 18219.30

Let cols2 refer to a data frame with rows for the same codes of 'IND', 'CHN', and 'USA', and columns of imports and exports:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figar_HTML.png

| imports exports

| code

| USA 2241.66 1504.57

| IND 392.23 266.16

| CHN 1601.76 2280.54

Both cols1 and cols2 have an index defined by code. For the purpose of this example, we have intentionally used a different order for the rows in the two data frames. The concat( ) of these two frames, specifying combination along the column dimension by the argument axis=1, follows:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figas_HTML.png

| country gdp imports exports

| IND India 2103.59 392.23 266.16

| CHN China 11015.54 1601.76 2280.54

| USA United States 18219.30 2241.66 1504.57

This example also illustrates the use of the sort= named parameter. When False, the order of the result is based on the order of the first data frame listed in the concat and is an efficient operation. We can also use pass sort=True to get a sorted result, but this could be slower.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figat_HTML.png

| country gdp imports exports

| CHN China 11015.54 1601.76 2280.54

| IND India 2103.59 392.23 266.16

| USA United States 18219.30 2241.66 1504.57

8.3.2.2 Introducing a Column Level

Having discussed data frames with a two-level row index, we now briefly explain the concept of a two-level column index, and how to create one during a concatenation, again using the keys= named parameter. We will also use this as an opportunity to introduce an unfortunate common form of untidy data, where columns represent years. We will see in Chap. 9 how to wrangle such data into tidy form, with a column called year and years represented in the rows where they belong.

We suppose we are given the topnames data in a different form, where there are two data frames, one for the female top applications and another for the male top applications. We further suppose that, in both, the columns are years, as shown below. We assume the female data has data for years 2013 through 2015 and is referred to by the Python variable females:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figau_HTML.png

| year 2013 2014 2015

| name Sophia Emma Emma

| count 21223 20936 20455

For the male data frame, suppose we have data for a slightly different set of years, from 2012 through 2015:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figav_HTML.png

| year 2012 2013 2014 2015

| name Jacob Noah Noah Noah

| count 19074 18257 19305 19635

While the tables females and males are not tidy, they do satisfy the initial assumptions of this section that the two tables have the same rows (name and count) and the columns have all logically different data, even if some of the column names are the same.

To combine these two frames along the column dimension, and because such combination would have two groupings of columns, one for the Female columns and one for the Male columns, we would want to add a new level to the column Index. We do so in exactly the same way as we did in the previous section, constructing an Index object for the new level and passing it as the argument to the keys= named parameter.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figaw_HTML.png

| sex Female Male

| year 2013 2014 2015 2012 2013 2014 2015

| name Sophia Emma Emma Jacob Noah Noah Noah

| count 21223 20936 20455 19074 18257 19305 19635

The use of the argument axis=1 tells pandas that both the concatenation and the index setting are occurring in the column direction. The name for the new column level is sex. In Chap. 9, we will see how to extract data out of a column index and reshape the data into tidy form.

8.3.3 Joining/Merging Data Frames

When combining data from two frames becomes more complicated than concatenation along the row or column dimension, as with the example of combining data on students with data on classes (Fig. 5.1), we need a more powerful tool. The pandas package provides two variations, depending on the frames we are combining, and whether or not we match up rows of the two tables based on their row label index, or on values from a regular column.

One variation, the join( ) , is a DataFrame method and uses the row label index for matching rows between the source frames. The other variation, the merge( ) , is a function of the package and can use any column (or index level) for matching rows between the source frames. The combined frame has columns from both the original source frames with values populated based on the matching rows. If the two frames have columns with the same column name, the join/merge will include both, with the column names modified to distinguish the source frame of the overlapping column. If we had three pandas DataFrames representing the three tables of Fig. 5.1, we could join/merge schedule and classes along the column classRegNum to get a single table with the six distinct columns contained in these two tables.

8.3.3.1 Using Index Level

Suppose we have a one-level index, the code, in the indicators data set, and we have one data frame, join1, that has columns for imports and exports, and another data frame, join2, that has columns for the country name, country, and the land area of the country, land. These two data frames might, in fact, come from different sources and might have code row labels that are not identical.

For the first data frame, join1, we might have:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figax_HTML.png

| imports exports

| code

| CAN 457.46 418.86

| GBR 643.52 441.11

| USA 2342.67 1545.61

And the functional dependency for this data set is:

$\displaystyle \begin{aligned} \mathtt{code}\ \rightarrow \mathtt{imports}, \mathtt{exports}\ \end{aligned}$

For the second data frame, join2, obtained from a different source, we might have:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figay_HTML.png

| country land

| code

| BEL Belgium 30280.0

| GBR United Kingdom 241930.0

| USA United States 9147420.0

| VNM Vietnam 310070.0

The functional dependency of join2 is

$\displaystyle \begin{aligned} \mathtt{code}\ \rightarrow \mathtt{country}, \mathtt{land}\ \end{aligned}$

Both have the same independent variable, code, and so we could want to combine the tables so that matching codes yield rows with the union of the dependent variable columns, e.g., GBR would now have data on imports, exports, the country name, and the land mass. It is essential to note that the code values between the tables may not all match up. In this case, join1 has a row for CAN that is not in join2, and join2 has rows for BEL and VNM that are not in join1. There are several choices for what to do with the partial data present for these three rows, and the choice is made based on the analysis one desires to carry out.

All choices of how to combine these two tables are predicated on the code index level providing matches for rows. Suppose the analysis emphasizes the import and export values and, if available, will use land area, but can handle the case where land area is a missing value. In this case, we want all rows of join1 to be represented in the combined result that includes country and land, but if a match is not found in join2, we just fill with missing values for country and land. If join1 is the first (or left-hand) frame written as we combine, this type of combination is called a left join. We invoke the join( ) method on the first data frame and specify a second argument of the “right” data frame. The named parameter how= is passed a string "left" to specify a left join, as illustrated below:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figaz_HTML.png

| imports exports country land

| code

| CAN 457.46 418.86 NaN NaN

| GBR 643.52 441.11 United Kingdom 241930.0

| USA 2342.67 1545.61 United States 9147420.0

So we see the appropriate values for the matched rows, and missing values where the right-hand frame did not have a match to correspond to the left frame. In this situation, we think of the left frame as being dominant, and taking whatever values it can from the right frame, to enrich rows present in the left frame.

If the desired analysis requires that the combined result only has rows where a row label index matches from both frames, this is called an inner join. It is a form of intersection, where the intersection is defined by matching just the value of the common row label index, and disregarding any/all other column values. This inner join simply changes the how= named parameter argument to "inner":

../images/479588_1_En_8_Chapter/479588_1_En_8_Figba_HTML.png

| imports exports country land

| code

| GBR 643.52 441.11 United Kingdom 241930.0

| USA 2342.67 1545.61 United States 9147420.0

The result includes just the two rows where code is in common between the two frames, namely GBR and USA. In this case, we think of the new table as sitting between the two old tables, and drawing from both sides, whenever the same code is present in both.

If we were to join the two tables in a different order, making join2 be the left-hand frame and join1 be the right-hand frame, then, in a left join, the result would have all the rows from join2 with the join2 columns of country and land and would fill in with missing values for the columns in join1 (imports and exports) where there was no corresponding row in join1. In this case, join2 is the dominant table.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbb_HTML.png

| country land imports exports

| code

| BEL Belgium 30280.0 NaN NaN

| GBR United Kingdom 241930.0 643.52 441.11

| USA United States 9147420.0 2342.67 1545.61

| VNM Vietnam 310070.0 NaN NaN

Rows BEL and VNM are present in join2 but not in join1, so these are the ones that are augmented with missing values for the join1 columns.

There also exist join types of a right join where the right-hand frame has all its rows included, and we fill in with missing values for unmatched rows from the left-hand frame, and an outer join that includes all rows from both source frames.

In the previous example, the matching of the row label index was one to one. For any given match in one frame, there was at most a single match in the other frame. However, this need not be the case in general. We are now expanding our join1 example to encompass two years of import and export indicator data:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbc_HTML.png

| imports exports

| year code

| 2016 CAN 426.94 389.66

| GBR 636.64 409.04

| USA 2189.18 1453.70

| 2017 CAN 457.46 418.86

| GBR 643.52 441.11

| USA 2342.67 1545.61

Now the functional dependency of join1 is

$\displaystyle \begin{aligned} \mathtt{year}, \mathtt{code}\ \rightarrow \mathtt{imports}, \mathtt{exports}\ \end{aligned}$

and join1 uses a two-level index for the two independent variables. We may still want to combine this new join1 with join2 that maps from code to country and land, to essentially “fill in” with these values for all the matches of code between the two data sets. This often occurs when we want to do groupby and aggregation operations.

The join( ) method continues to work as expected in the “many-to-one” relationship between join1 and join2. So a left join yields:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbd_HTML.png

| imports exports country land

| year code

| 2016 CAN 426.94 389.66 NaN NaN

| GBR 636.64 409.04 United Kingdom 241930.0

| USA 2189.18 1453.70 United States 9147420.0

| 2017 CAN 457.46 418.86 NaN NaN

| GBR 643.52 441.11 United Kingdom 241930.0

| USA 2342.67 1545.61 United States 9147420.0

and an inner join yields:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbe_HTML.png

| imports exports country land

| year code

| 2016 GBR 636.64 409.04 United Kingdom 241930.0

| USA 2189.18 1453.70 United States 9147420.0

| 2017 GBR 643.52 441.11 United Kingdom 241930.0

| USA 2342.67 1545.61 United States 9147420.0

Note how, in the result, we have redundant data in the country and land columns. Care needs to be taken so that redundant data does not cause inconsistency problems if data frames are updated, but the possibility exists that such updates might not be applied to all instances of redundant data. In Chap. 9, we discuss how such redundant data is a sign that two different tables have been combined but, by TidyData3, should be separate.

8.3.3.2 Using Specific Columns

Sometimes, we wish to combine two tables based on common values between the two tables, but the values are in a regular column and not part of an index. This could be because the column is, in fact, not an independent variable and would not be part of an index, or it could be because, for the sake of manipulation and transformation, we have a column that could be an index, or a level of an index, and is not. Both cases come up frequently. For example, with indicators, we could reasonably use either the country code or name as an index and (based on the specifics of the table we combine with) might need to combine using whichever is not the index.

Logically, the operation is like the join( ) we have examined above, but, without the extra knowledge of index levels, we need to call a different method, called merge( ) , and we need to provide additional information to make the table combination possible.

Consider the two data sets we have been working with in this chapter: the topnames has, by year and by sex, the name and count of the top application for the US social security card. In the full indicators data set, we have, among other things, the population for each of the countries, including the population for the USA. We might want to create a new data frame based primarily on the topnames data set, but where we augment each row with the US population for that year. That could be used, in an analysis, to divide the count of applicants by the population for that year to be able to see what percentage of the population is represented. This could give a more fair comparison as we try to compare between years, because an absolute count does not take into consideration the changes in the population over time.

We prepare the source data frames, and in us_pop, we have a data frame with columns of year and pop, the changing population of the USA from 1960 to 2017. We show the first five rows of us_pop:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbf_HTML.png

| year pop

| 0 1960 180671000

| 1 1961 183691000

| 2 1962 186538000

| 3 1963 189242000

| 4 1964 191889000

Note that the units of the pop column are in persons, not in millions of persons, like our earlier examples.

Because we want to illustrate a table combination that does not use an index or index level, our second data frame is topnames0, which has only the default integer index, and columns year, sex, name, and count. We show the prefix:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbg_HTML.png

| year sex name count

| 0 1880 Female Mary 7065

| 1 1880 Male John 9655

| 2 1881 Female Mary 6919

| 3 1881 Male John 8769

| 4 1882 Female Mary 8148

The merge( ) function has, as its first two arguments, the two DataFrame objects to be combined, where the first argument is considered the “left” and the second argument is considered the “right.” The on= named argument specifies the name of a column expected to exist in both data frames and to be used for matching values. The how= named parameter allows us to specify the logical equivalent of an inner join, a left join, or a right join.

In the following code, we show both a left join/merge and an inner join/merge. Since topnames0 is the “left,” the first will give us all the rows of topnames0 and will add the pop column from matching rows in us_pop. For those rows and years where us_pop does not have population data (i.e., those years before 1960), the merge1 will have NaN indicating missing data.

The merge2, by contrast, does an “inner” join/merge. The result will only have rows where the year has common values from both data frames. So this result will effectively prune the topnames0 portion to the years after 1960.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbh_HTML.png

| Rows in merge1: 278 Rows in merge2: 116

Looking at the first six rows of merge2, we see the added column, and the data starting with 1960, not 1880. Also note that since topnames0 has two rows for each year the same population is repeated for each row, since both match the year value from us_pop.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbi_HTML.png

| year sex name count pop

| 0 1960 Female Mary 51475 180671000

| 1 1960 Male David 85929 180671000

| 2 1961 Female Mary 47680 183691000

| 3 1961 Male Michael 86917 183691000

| 4 1962 Female Lisa 46078 186538000

| 5 1962 Male Michael 85034 186538000

The merge( ) function is quite versatile and can be used for almost all the variations of combining taken covered in this section. It can also be used for hybrid situations where the criteria for matching rows between two tables might involve an index or index level from one source frame, and a regular column from the other source frame. See the pandas documentation for all the parameters that make this possible.

8.3.4 Reading Questions

8.46

Give a real-world example of two data sets with the same set of rows, but with different columns, that you might want to combine.

8.47

Give a real-world example of two tables that are related to each other through the values of a particular column, where you might want to combine the tables based on matching values.

8.48

What could go wrong if you try to naively concatenate two data frames along the row dimension, when they do contain rows with the same row index? Give an example.

8.49

Consider the code that concatenates indicators1 and indicators2 by row (axis=0). Suppose that, when defining indicators1, we use the code

indicators1 = indicators.loc[['CHN', 'IND', 'USA'], 'life' ] instead of

indicators1 = indicators.loc[['CHN', 'IND', 'USA'], :'life' ] Explain what happens in combined and why.

8.50

What happens if we use concat with ignore_index = True on a data frame that does have a meaningful index, like 'code' in indicators?

8.51

Consider the example of concatenating data frames along the column dimension. When you do this with a command like pd.concat( [cols1, cols2], axis=1, sort=True) , what field does it sort based on?

8.52

Give a real-world example, different from the one in the book, when you might want to combine two data sets with the same rows but different columns, along a column level. Does your example represent tidy data? Why or why not?

8.53

Is the inner join symmetric? In other words, does it matter if you do join1.join( join2, how="inner") or join2.join( join1, how="inner") ?

8.54

Do you get the same data by doing join1.join( join2, how="left") versus by doing join2.join( join1, how="right") ?

8.55

Give a real-world example, different from the one in the book, plus a reason you might want to do an inner join instead of a left or right join.

8.56

What happens if you use join when you meant to use merge, i.e., on a column that is not an index?

8.57

What happens if you use merge when you meant to use join, i.e., on a column that is an index?

8.3.5 Exercises

8.58

On the book web page, you will find two csv files, educationTop.csv and educationBottom.csv, both based on data hosted by www.census.gov. Both have the same columns, and each row is a US metropolitan area, with data on population, education, and unemployment. The first csv file contains metropolitan areas starting with A–K, and the second starting with L–Z. Read both into pandas data frames, using the column GEO.id2 as an index. Concatenate these two data frames along the row dimension (with the top one on top), and call the result educationDF.

8.59

On the book web page, you will find two csv files, educationLeft.csv and educationRight.csv, both based on data hosted by www.census.gov. Both have the same rows, and each row is a US metropolitan area, with data on population, education, and unemployment. The first has information on individuals without a college degree, and the second has information on individuals with a college degree. Read both into pandas data frames. Concatenate these two data frames along the column dimension, and call the result educationDF2.

8.60

On the book web page, you will find two csv files, educationLeftJ.csv and educationRightJ.csv, both based on data hosted by www.census.gov. In both, rows represent US metropolitan area, with data on population, education, and unemployment. However, they do not have exactly the same set of rows, and the columns are totally different except for the index column Geography. 0. Read both into pandas data frames, with names educationLeftJ and educationRightJ. 1. Make a copy of educationLeftJ called educationLeftOriginal. 2. Starting with educationLeftJ, do a left join to bring in the data from educationRightJ, storing your answer as dfJ. 3. Starting with educationLeftOriginal, do an inner join to bring in the data from educationRightJ, storing your answer as dfJ2. 4. Now read the original csv files in as eduLeft and eduRight with no meaningful index. Then, starting from eduLeft, do an inner merge along the column Geography, storing your answer as dfJ3.

8.61

Finish the application from the end of the section. This means, merge together topnames0 and us_pop, using whichever type of join is appropriate for this application, and then normalize the application counts by pop. Use the resulting data frame to find the highest and lowest percentage counts over the years, and think about why a social scientist might be interested in this kind of information.

8.4 Missing Data Handling

We conclude with a brief discussion of the issues that arise with missing data. As we have seen, both join and merge operations can easily produce missing data. So can operations that create a new column, e.g., life∕cell if a row had 0 for cellphones. Lastly, many data sets encountered in the wild are missing data, as the following examples illustrate. Missing data can occur when:

Individuals do not respond to all questions on a survey.
Countries fail to maintain or report all their data to the World Bank.
An organization keeping personnel records does not know where everyone lives.
Laws prevent healthcare providers from disclosing certain types of data.
Different users of social media select different privacy settings, resulting in some individuals having only some of their data publicly visible.
Many other situations analogous to these.

When encountered in the real world, there is no true standard by which data sources denote (or “code”) missing data. For example, a provider of healthcare data may code missing white blood cell counts as 0, knowing that no reader versed in healthcare would ever think that an individual truly had no white blood cells. In this situation, blindly applying an aggregation function (like mean( ) ) could result in a badly wrong result. Even worse, some providers code missing data using a string, like "N/A", "Not Applicable", "Not applic.", "NA", "N.A.", etc. Such a situation can entirely break simple code, e.g., that reads from a CSV file into a native dictionary of lists and then invoking an aggregation function.

Thankfully, information on how missing data is coded is almost always contained in the metadata associated with the data set, sometimes called a codebook. Whenever investigating a new data source, the reader should seek out this metadata and learn how missing data is coded, before attempting to read the data into native data structures or pandas. If there is no associated metadata, then it is usually possible to figure out how missing data is coded by a combination of the following steps:

Careful inspection of the data in its original format.
Visualizations (e.g., histograms and scatterplots), to look for data values that stand out from the rest.
Consultation with an expert, to learn about conventions in the field, and about which data values look unusual or impossible.

Once we know how missing data is denoted, we should replace that coding with a special type. In pandas, this type is denoted nan and is displayed as NaN. Importantly, it is NOT a string. This special type has certain rules, such as “anything plus nan is nan.” We illustrate, using the numpy library to gain direct access to the nan type.²

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbj_HTML.png

| nan

This rule is also satisfied for aggregation operators, as we illustrate below. Note that the nan elements still count toward the length of data, which is important for making sure all columns in a DataFrame have the same length. The sum is nan because the accumulation pattern used to compute it involves adding nan elements.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbk_HTML.png

| 5

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbl_HTML.png

| nan

When we have a DataFrame with missing data, like the pop column of merge1 above, built-in Python aggregation operations again produce nan if they involve elementary operations (like addition and multiplication) with a nan. We illustrate first demonstrating that the type of the first entry in the 'pop' column really is nan:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbm_HTML.png

| nan

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbn_HTML.png

| 278

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbo_HTML.png

| nan

As we know, not all of merge1['pop'] is null. The null values are only those from before 1960, when a match was not obtained in the left merge from the previous section. It would be reasonable to compute aggregation operations, like count, sum, and mean, on just the non-missing data. Thankfully, this is precisely what the pandas methods do:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbp_HTML.png

| 116

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbq_HTML.png

| 29144237198.0

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbr_HTML.png

| 251243424.12068966

We can also use the method isna( ) to find which values are nan (or, we could use notna( ) to find which are not). Below, we apply this on the population Series, but we could have also applied it on the entire DataFrame, resulting in a new DataFrame where every value is a Boolean.

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbs_HTML.png

| 273 False

| 274 False

| 275 False

| 276 True

| 277 True

| Name: pop, dtype: bool

If we are certain that we do not want rows with nan values, the DataFrame method dropna( ) can be used to drop all such rows. As usual, the parameter inplace= tells pandas whether we want to modify the DataFrame in place or not. Here we choose not to, and we see that the new DataFrame no longer contains years before 1960:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbt_HTML.png

| year sex name count pop

| 160 1960 Female Mary 51475 180671000.0

| 161 1960 Male David 85929 180671000.0

| 162 1961 Female Mary 47680 183691000.0

| 163 1961 Male Michael 86917 183691000.0

| 164 1962 Female Lisa 46078 186538000.0

This can be confirmed with the attribute merge1clean.shape, or again using len( ) , to see that the new DataFrame is smaller than merge1:

../images/479588_1_En_8_Chapter/479588_1_En_8_Figbu_HTML.png

| 116

If we prefer to fill in nan values with some specific scalar, this can be done with the fillna option, but we will not need that functionality. It would be appropriate, for example, if we had survey data and one column asked if individuals were willing to be contacted for a follow-up survey. If a survey recipient left that question blank, we could probably safely fill in the answer “No.”

Lastly, it is common for a CSV file to have blank cells for missing data, e.g., the line "Michael","Male",,"35" means the third column is missing. The pandas.read_csv( ) method will automatically read such missing cells in as nan in the relevant row–column location, and hence the methods of this section apply.

8.4.1 Reading Questions

8.62

Have you ever analyzed a data set with missing data? If so, how were missing values coded?

8.63

Have you ever analyzed a data set with an attached codebook? If so, describe the data set and codebook. What kind of information was in the codebook?

8.64

The developers of Python made a design decision that adding a number and nan should yield nan. The developers of pandas made a design decision that methods should ignore nan values. What considerations do you think went into those design decisions?

8.65

In the reading, we discussed several options for dealing with missing data, including replacing aggregates by nan when nan data is encountered, ignoring nan data, dropping nan data, and filling in a single scalar value for all nan data encountered. Can you think of other ways of dealing with missing data?

8.66

Imagine at a scatterplot with a regression line, based on a data set of individuals where you know the age and salary of each. Suppose you want to predict the salary of a 32-year old, but none of the individuals in your data set were 32. What technique could you use to predict the salary of a 32-year old based on the other data that you do have? Could an approach like this be used to fill in missing data? Would it always work?

8.67

The book web page has a file, indicators.csv, with our indicators data for many countries, from 1960 forward. It has tons of missing data (e.g., because there were no cell phones in the 1960s). Use pandas to read this data into a DataFrame, then illustrate each of the functions from the reading on that DataFrame. Please use the cell column when you illustrate the fillna( ) function, and remember that there were no cellphones before 1980.