Exploring the data rows and columns

Let's explore the data. In the next cell, type the following:

print('Number of rows: ' + str(df.shape[0]))
print('Number of columns: ' + str(df.shape[1]))

The output is as follows:

Number of rows: 6825
Number of columns: 153

In the 2018 file, there should be 6,825 rows and 153 columns. Each row corresponds to a dialysis facility in the United States. Here we used the shape attribute of DataFrames, which returns a tuple containing the number of rows and number of columns.

We can also get a visualization of what the DataFrame looks like by using the head() function. The head() function takes a parameter, n, that tells it how many rows of the DataFrame to print. In the next cell, type the following and press Play:

print(df.head(n=5))

The output is as follows:

                    Facility Name  CMS Certification Number (CCN)  \
0     CHILDRENS HOSPITAL DIALYSIS                           12306   
1                FMC CAPITOL CITY                           12500   
2                GADSDEN DIALYSIS                           12501   
3  TUSCALOOSA UNIVERSITY DIALYSIS                           12502   
4                  PCD MONTGOMERY                           12505   

...

You should see some of the columns for the first five rows, such as facility name, address, and measure scores. The head() function prints an abbreviated list of columns, selecting some from the beginning of the .csv file and some from the end, separated by an ellipsis.

Let's get a complete list of all 153 columns. Type the following and press Enter:

print(df.columns)

The output is as follows:

Index(['Facility Name', 'CMS Certification Number (CCN)', 'Alternate CCN 1',
       'Address 1', 'Address 2', 'City', 'State', 'Zip Code', 'Network',
       'VAT Catheter Measure Score',
       ...
       'STrR Improvement Measure Rate/Ratio',
       'STrR Improvement Period Numerator',
       'STrR Improvement Period Denominator', 'STrR Measure Score Applied',
       'National Avg STrR Measure Score', 'Total Performance Score',
       'PY2018 Payment Reduction Percentage', 'CMS Certification Date',
       'Ownership as of December 31, 2016', 'Date of Ownership Record Update'],
      dtype='object', length=153)

Here, we use the columns attribute of a DataFrame, which gives us access to the column names of the DataFrame as a list. Unfortunately, again, pandas abbreviates the output so we can't see all 153 columns. To do so, we need to be more explicit and print each column using a for loop:

for column in df.columns:
print(column)

The output is as follows:

Facility Name
CMS Certification Number (CCN)
Alternate CCN 1
Address 1
Address 2
City
State
Zip Code
Network
VAT Catheter Measure Score
...

Now you'll see all 153 column names. Use the scrollbar to skim through all of them. You'll notice that each of the 16 measures has several columns associated with it, and there are also additional columns like demographic data and total performance scores.

Now that we have gained a rough overview of our dataset, we can move on to more in-depth analysis.