Exploring the tables

Next, let's explore the tables and check the numbers of rows and columns in each:

for k, v in dfs.items():
print(
k + ' - Number of rows: ' + str(v.shape[0]) +
', Number of columns: ' + str(v.shape[1])
)

The output is as follows:

hvbp_tps_11_07_2017.csv -  Number of rows: 2808, Number of columns: 16
hvbp_clinical_care_11_07_2017.csv -  Number of rows: 2808, Number of columns: 28
hvbp_safety_11_07_2017.csv -  Number of rows: 2808, Number of columns: 64
hvbp_efficiency_11_07_2017.csv -  Number of rows: 2808, Number of columns: 14
hvbp_hcahps_11_07_2017.csv -  Number of rows: 2808, Number of columns: 73

For the preceding cell, we used the items() method of dictionaries to iterate over each key-DataFrame pair in the dictionary of DataFrames.

The tables all have the same number of rows. Since each row corresponds to a hospital, it is safe to assume that the hospitals in all of the tables are the same (we will test that assumption shortly).

Any analysis we perform is limited due to the separation of the tables. Since all the hospitals are (assumed to be) the same, we can combine all the columns into one table. We will do that using the merge() function of pandas. Using pandas merge() is akin to using SQL JOIN (you learned about SQL JOIN in Chapter 4Computing Foundations – Databases). The merge is performed by specifying a common ID column present in both of the tables on which the rows will be matched. To see whether there is a common ID column in the five HVBP tables, we can print out the column names of each table:

for v in dfs.values():
for column in v.columns:
print(column)
print('\n')

If you scroll through the results, you'll notice the presence of the Provider Number column in all of the tables. Provider Number is a unique identifier that can be used to link the tables.