Scikit-learn's PCA

As usual, scikit-learn saves the day by implementing this procedure in an easy to use transformer so that we don't have to go through that manual process each time we wish to use this powerful process:

We can import it from scikit-learn's decomposition module:

# scikit-learn's version of PCA
from sklearn.decomposition import PCA

To mimic the process we performed with the iris dataset, let's instantiate a PCA object with only two components:

# Like any other sklearn module, we first instantiate the class
pca = PCA(n_components=2)

Now, we can fit our PCA to the data:

# fit the PCA to our data
pca.fit(iris_X)

Let's take a look at some of the attributes of the PCA object to see if they match up with what we achieved in our manual process. Let's take a look at the components_ attribute of our object to see if this matches up without the top_2_eigenvectors variable:

pca.components_

array([[ 0.36158968, -0.08226889,  0.85657211,  0.35884393],
       [ 0.65653988,  0.72971237, -0.1757674 , -0.07470647]])

# note that the second column is the negative of the manual process
# this is because eignevectors can be positive or negative
# It should have little to no effect on our machine learning pipelines

Our two components match, almost exactly, our previous variable, top_2_eigenvectors. We say almost because the second component is actually the negative of the eigenvector we calculated. This is fine because, mathematically, both eigenvectors are 100% valid and still achieve the primary goal of creating uncorrelated columns.
So far, this process is much less painful than what we were doing before. To complete the process, we need to use the transform method of the PCA object to project data onto our new two-dimensional plane:

pca.transform(iris_X)[:5,]

array([[-2.68420713,  0.32660731],
       [-2.71539062, -0.16955685],
       [-2.88981954, -0.13734561],
       [-2.7464372 , -0.31112432],
       [-2.72859298,  0.33392456]])

# sklearn PCA centers the data first while transforming, so these numbers won't match our manual process.

Notice that our projected data does not match up with the projected data we got before at all. This is because the scikit-learn version of PCA automatically centers data in the prediction phase, which changes the outcome.

We can mimic this by altering a single line in our version to match:

# manually centering our data to match scikit-learn's implementation of PCA
np.dot(iris_X-mean_vector, top_2_eigenvectors.T)[:5,]

array([[-2.68420713, -0.32660731],
       [-2.71539062,  0.16955685],
       [-2.88981954,  0.13734561],
       [-2.7464372 ,  0.31112432],
       [-2.72859298, -0.33392456]])

Let's make a quick plot of the projected iris data and compare what the dataset looks like before and after projecting onto our new coordinate system:

# Plot the original and projected data
plot(iris_X, iris_y, "Original Iris Data", "sepal length (cm)", "sepal width (cm)")
plt.show()
plot(pca.transform(iris_X), iris_y, "Iris: Data projected onto first two PCA components", "PCA1", "PCA2")

The following is the output of the preceding code:

In our original dataset, we can see the irises in their original feature space along the first two columns. Notice that in our projected space, the flowers are much more separated from one another and also rotated on their axis a bit. It looks like the data clusters are standing upright. This phenomenon is because our principal components are working to capture variance in our data, and it shows in our plots.

We can extract the amount of variance explained by each component as we did in our manual example:

# percentage of variance in data explained by each component
# same as what we calculated earlier
pca.explained_variance_ratio_

array([ 0.92461621,  0.05301557])

Now, that we can perform all of the basic functions with scikit-learn's PCA, let's use this information to display one of the main benefits of PCA: de-correlating features.

By nature, in the eigenvalue decomposition procedure, the resulting principal components are perpendicular to each other, meaning that they are linearly independent of one another.

This is a major benefit because many machine learning models and preprocessing techniques make the assumption that inputted features are independent, and utilizing PCA ensures this for us.

To show this, let's create the correlation matrix of the original iris dataset and find the average linear correlation coefficient between each of the features. Then, we will do the same for a PCA projected dataset and compare the values. We expect that the average correlation of the projected dataset should be much closer to zero, implying that they are all linearly independent.

Let's begin by calculating the correlation matrix of the original iris dataset:

It will be a 4 x 4 matrix where the values represent the correlation coefficient of every feature versus each other:

# show how pca attempts to eliminate dependence between columns

# show the correlation matrix of the original dataset
np.corrcoef(iris_X.T)


array([[ 1.        , -0.10936925,  0.87175416,  0.81795363],
       [-0.10936925,  1.        , -0.4205161 , -0.35654409],
       [ 0.87175416, -0.4205161 ,  1.        ,  0.9627571 ],
       [ 0.81795363, -0.35654409,  0.9627571 ,  1.        ]])

We will then extract all values above the diagonal of 1s to use them to find the average correlation between all of the features:

# correlation coefficients above the diagonal
 np.corrcoef(iris_X.T)[[0, 0, 0, 1, 1], [1, 2, 3, 2, 3]]
 
 
 array([-0.10936925, 0.87175416, 0.81795363, -0.4205161 , -0.35654409])

Finally, we will take the mean of this array:

# average correlation of original iris dataset.
 np.corrcoef(iris_X.T)[[0, 0, 0, 1, 1], [1, 2, 3, 2, 3]].mean()
 
 0.16065567094168495

The average correlation coefficient of the original features is .16, which is pretty small, but definitely not zero. Now, let's create a full PCA that captures all four principal components:

# capture all four principal components
 full_pca = PCA(n_components=4)
 
 # fit our PCA to the iris dataset
 full_pca.fit(iris_X)

Once we've done this, we will use the same method as before and calculate the average correlation coefficient between the new, supposedly linearly independent columns:

 pca_iris = full_pca.transform(iris_X)
 # average correlation of PCAed iris dataset.
 np.corrcoef(pca_iris.T)[[0, 0, 0, 1, 1], [1, 2, 3, 2, 3]].mean()
 # VERY close to 0 because columns are independent from one another
 # This is an important consequence of performing an eigenvalue decomposition
 
 7.2640855025557061e-17 # very close to 0

This shows how data projected onto principal components end up having fewer correlated features, which is helpful in general in machine learning.