Case study–Python for principal component analysis
A team member collects four pieces of information on three resins–A, B, and C: pH, solid content, viscosity, and particle size. She wants to analyze this dataset to find the best way to group it. She performs a principal component analysis (PCA) to reduce the data dimension, and she wants to make sure that the number of components is enough to explain at least 90 percent of the variance in the data.
By performing PCA, she reduces the number of variables from four to two. The first principal component accounts for 72.70 percent of the total variance and the second one accounts for 23.10 percent. Together, the two components contain 95.80 percent of the information.
Constructing the structure of principal component analysis follows the following steps: [22]
  1. Construct an n x p multivariate data matrix X in which each row contains p data points of variables.
  2. Standardize the X matrix, so that mean is zero, and variance is one for every column. Each column in this new matrix, Z, is a vector variable, zi (i = 1, 2, …p).
  3. Derive the scalar algebra for the component score for the ith individual of the single element for the jth y vector, yj (j = 1, …, p):
yij = v’1 * z1i + v’2 * z2i + … + v’p * zpi
Where:
v’j = the jth column vector
  1. Derive the matrix for all y:
Y = Z * V
Where:
Y = the matrix for all y
V = the eigen vector matrix, a p x p coefficient matrix that carries the p-element variable z into the derived n-element variable y
Note: The mean of ys is zero because my = V’ * mz = V’ * 0.
  1. Derive the dispersion matrix of y:
Dy = V’ * Dz * V = V’ * R * V
Where:
Dy = the dispersion matrix of y
Dz = the dispersion matrix of z
R = the correlation matrix for z
The above equation shows that the dispersion matrix Dz of a standardized variable is a correlation matrix.
Principal component analysis using Python follows the following steps.
Step 1. Import libraries.
In [1]:
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
Scikit-learn is a machine learning library featuring various classification, regression, and clustering algorithms such as support vector machines, random forests, gradient boosting, k-means, and DBSCAN. The sklearn.decomposition.PCA( ) function performs principal component analysis. The sklearn.preprocessing package provides several common utility functions and transformer classes. Its StandardScaler( ) function standardizes features by removing the mean and scaling to unit variance.
Step 2. Load and examine the dataset.
In [2]:
df = pd.read_excel('case pca.xlsx')
In [3]:
df.head()
Out[3]:
pH
Solid
Viscosity
Particle
Resin
0
6.2
49.59
1081.2
0.7
A
1
6.0
45.09
1081.2
0.7
A
2
5.8
46.89
991.1
0.7
A
3
5.7
45.99
1171.3
0.7
A
4
6.1
50.49
1081.2
0.7
A
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 5 columns):
pH           153 non-null float64
Solid        153 non-null float64
Viscosity    153 non-null float64
Particle     153 non-null float64
Resin        153 non-null object
dtypes: float64(4), object(1)
memory usage: 6.1+ KB
In [5]:
df.describe()
Out[5]:
pH
Solid
Viscosity
Particle
count
153.000000
153.000000
153.000000
153.000000
mean
6.944444
45.543007
3202.966013
1.694118
std
0.831603
3.878148
1590.520755
0.761943
min
5.400000
36.090000
720.800000
0.600000
25%
6.200000
43.290000
1261.400000
0.800000
50%
6.900000
45.090000
3694.100000
1.800000
75%
7.500000
47.790000
4414.900000
2.300000
max
9.000000
57.690000
6036.700000
3.00000 0
The dataset has five columns, and each of them has 153 data points. The data in the first four columns are continuous data, and the data in the last column are categorical.
Step 3. Separate the features from the Data Frame.
In [6]:
x = df.iloc[:, 0:4].values
PCA is affected by scales of features, so the features need to be scaled. In this case, the features are PH, Solid, Viscosity, and Particle, and they are separated from the Resin column.
The DataFrame.iloc( ) function performs purely integer-position based indexing for selection by position. The df.iloc[:, 0:4] function slices a section out of the Data Frame that has columns 1 through 4 and all rows of the Data Frame. The DataFrame.value( ) function returns a Numpy representation of the Data Frame.
Step 4. Standardize the features.
In [7]:
x = StandardScaler().fit_transform(x)
The StandardScaler( ) function standardizes the features onto unit scale (mean = 0 and variance = 1). The fit_transform( ) function fits to data, then transforms them .
Step 5. Perform PCA.
In [8]:
pca = PCA()
In [9]:
pc = pca.fit_transform(x)
The sklearn.decomposition.PCA( ) function performs principal component analysis for linear dimensionality reduction using Singular Value Decomposition of data to project it to a lower-dimensional space. The pca.fit_transform( ) function fits the model.
Step 5. Determine the intrinsic dimension of PCA.
In [10]:
features = range(pca.n_components_)
In [11]:
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()
The intrinsic dimension is the number of features needed to approximate the dataset. It is the essential idea behind dimension reduction. For PCA, the intrinsic dimension is the number of PCA features with significant variance.
Python’s range( ) function creates a sequence of numbers, and in this case, the range(pca.n_componenets_) function creates a sequence of PCA features.
The bar plot shows that the two features have significant variances. Therefore, the intrinsic dimension is set at two.
Step 6. Reduce dimension with PCA.
In [12]:
pca = PCA(n_components=2)
In [13]:
pc = pca.fit_transform(x)
The n_components attribute is set at two to match the intrinsic dimension, and the PCA model is then fitted.
Step 7. Create a Data Frame with two principal components and examine it.
In [14]:
pcdf = pd.DataFrame(data = pc, columns = ['principal component 1', 'principal component 2'])
In [15]:
pcdf.head()
Out[15]:
principal component 1
principal component 2
0
-2.260979
0.529975
1
-2.085498
-0.638749
2
-2.365671
-0.296291
3
-2.302377
-0.554145
4
-2.384538
0.701810
In [16]:
pcdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 2 columns):
principal component 1    153 non-null float64
principal component 2    153 non-null float64
dtypes: float64(2)
memory usage: 2.5 KB
The pandas.DataFrame( ) function creates a two-dimensional size-mutable, potentially heterogeneous tabulate data structure with labeled rows and columns. The new Data Frame has two columns, and each of them has 153 data points.
Step 8. Concatenate the Data Frame with the Resin column along axis = 1.
In [17]:
finaldf = pd.concat([pcdf, df[['Resin']]], axis = 1)
In [18]:
finaldf.head()
Out[18]:
principal component 1
principal component 2
Resin
0
-2.260979
0.529975
A
1
-2.085498
-0.638749
A
2
-2.365671
-0.296291
A
3
-2.302377
-0.554145
A
4
-2.384538
0.701810
A
In [19]:
finaldf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 3 columns):
principal component 1    153 non-null float64
principal component 2    153 non-null float64
Resin                    153 non-null object
dtypes: float64(2), object(1)
memory usage: 3.7+ KB
The pandas.concat( ) function concatenates pandas objects along a particular axis with optional set logic along the other axes. In this case, the function concatenates the pcdf Data Frame to generates a final Data Frame that will be used to plot the data.
Step 9. Plot a two-dimensional plot.
In [20]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 Component PCA', fontsize = 20)
resin = ['A', 'B', 'C']
colors = ['r', 'g', 'b']
for resin, color in zip(resin, colors):
indicesToKeep = finaldf['Resin'] == resin
ax.scatter(finaldf.loc[indicesToKeep, 'principal component 1'],
finaldf.loc[indicesToKeep, 'principal component 2'],
c = color, s = 50)
ax.legend(['A', 'B', 'C'], loc='lower right')
ax.grid()
The matplotlib.puplot.legend( ) function places a legend on the axes, and the loc=‘lower right’ parameter places the legend on the lower right corner of the figure. The matplotlib.axes.grid( ) function configures the grid lines.
The plot shows that the resin classes are well separated from each other.
Step 10. Calculate the percentage of information contained by the two principal components.
In [21] :
pca.explained_variance_ratio_
Out[21]:
array([0.72700289, 0.2310556 ])
As the three-dimensional space (three resins) is converted to the two-dimensional space, some of the variance (information) is lost. The pca.explained_variance_ratio_ attribute shows how much information (variance) can be attributed to each of the principal components. The first principal component contains 72.70 percent of the variance, and the second principal component contains 23.10 percent of the variance.