Next, we import the contents of the fixed-width data file into Python as a pandas DataFrame composed of string columns, using the widths list created in the previous cell. We then name the columns using the col_names list:

df_ed = pd.read_fwf(
    HOME_PATH + 'ED2013',
    widths=width,
    header=None,
    dtype='str'  
)

df_ed.columns = col_names

Let's print a preview of our dataset to confirm it was imported correctly:

print(df_ed.head(n=5))

The output should look similar to the following:

  VMONTH VDAYR ARRTIME WAITTIME   LOV  AGE AGER AGEDAYS RESIDNCE SEX ...   \
0     01     3    0647     0033  0058  046    4     -07       01   2 ...    
1     01     3    1841     0109  0150  056    4     -07       01   2 ...    
2     01     3    1333     0084  0198  037    3     -07       01   2 ...    
3     01     3    1401     0159  0276  007    1     -07       01   1 ...    
4     01     4    1947     0114  0248  053    4     -07       01   1 ...    
  
  RX12V3C1 RX12V3C2 RX12V3C3 RX12V3C4 SETTYPE  YEAR   CSTRATM   CPSUM   PATWT  \
0      nan      nan      nan      nan       3  2013  20113201  100020  002945   
1      nan      nan      nan      nan       3  2013  20113201  100020  002945   
2      nan      nan      nan      nan       3  2013  20113201  100020  002945   
3      nan      nan      nan      nan       3  2013  20113201  100020  002945   
4      nan      nan      nan      nan       3  2013  20113201  100020  002945

  EDWT  
0  nan  
1  nan  
2  nan  
3  nan  
4  nan  

[5 rows x 579 columns]

Looking at the column values and their meanings in the documentation confirm that the data has been imported correctly. The nan values correspond to blank spaces in the data file.

Finally, as another check, let's count the dimensions of the data file and confirm that there are 24,777 rows and 579 columns:

print(df_ed.shape)

The output should look similar to the following:

(24777, 579)

Now that the data has been imported correctly, let's set up our response variable.