Next, we import the contents of the fixed-width data file into Python as a pandas DataFrame composed of string columns, using the widths list created in the previous cell. We then name the columns using the col_names list:
df_ed = pd.read_fwf(
HOME_PATH + 'ED2013',
widths=width,
header=None,
dtype='str'
)
df_ed.columns = col_names
Let's print a preview of our dataset to confirm it was imported correctly:
print(df_ed.head(n=5))
The output should look similar to the following:
VMONTH VDAYR ARRTIME WAITTIME LOV AGE AGER AGEDAYS RESIDNCE SEX ... \ 0 01 3 0647 0033 0058 046 4 -07 01 2 ... 1 01 3 1841 0109 0150 056 4 -07 01 2 ... 2 01 3 1333 0084 0198 037 3 -07 01 2 ... 3 01 3 1401 0159 0276 007 1 -07 01 1 ... 4 01 4 1947 0114 0248 053 4 -07 01 1 ...
RX12V3C1 RX12V3C2 RX12V3C3 RX12V3C4 SETTYPE YEAR CSTRATM CPSUM PATWT \ 0 nan nan nan nan 3 2013 20113201 100020 002945 1 nan nan nan nan 3 2013 20113201 100020 002945 2 nan nan nan nan 3 2013 20113201 100020 002945 3 nan nan nan nan 3 2013 20113201 100020 002945 4 nan nan nan nan 3 2013 20113201 100020 002945
EDWT 0 nan 1 nan 2 nan 3 nan 4 nan [5 rows x 579 columns]
Looking at the column values and their meanings in the documentation confirm that the data has been imported correctly. The nan values correspond to blank spaces in the data file.
Finally, as another check, let's count the dimensions of the data file and confirm that there are 24,777 rows and 579 columns:
print(df_ed.shape)
The output should look similar to the following:
(24777, 579)
Now that the data has been imported correctly, let's set up our response variable.