Principle Components Analysis (PCA): Essence and Case Study with Python

6 min readAug 16, 2020

Data is the fuel of big data era, and we can get insightful information from data. However, tons of data in a high number of dimensions may cover valuable knowledge. Therefore, data mining and feature engineering become essential skills to uncover valuable information underneath the data.

1. What is PCA

For the dimensionality reduction, Principle Component Analysis (PCA) is the most popular algorithm. PCA is an algorithm encoding the original features into a compact representation and we can drop the “unimportant” features while still retaining most of useful information. PCA does not simply select the useful features and discard the others from original data set, the principal components resulting from PCA are linear combinations of original features and these components are good alternative to represent the original data.

The reason why PCA becomes a popular algorithm in data science field is that PCA help us:

1. Dimensionality reduction and speed up the model training time of a learning algorithm.

2. Use new combined features representation can enhance model’s performance.

3. Reduced dimensionality can be easily visualized.

2. Essence of PCA

The principle of PCA algorithm is to create a set of new features based on raw data, and rank the order of variance of new feature, finally create a set of principle components. Why the variance is considered as the most important index, it is because more variance in feature values can provide better predicting ability for machine learning model. For example, predicting car price with two features: color and mileage. If all the cars have same color, but with different mileage, then we can not predict car’s price with color (feature without variance). We can only rely on mileage to predict car’s price, the more mileage, the lower car’s price.

Fig 1 showed us the basic principle of PCA. The basic principle for PCA is to remain as much as possible of the variance, and it also means to discard as little as possible. After combining the two original features (x1 and x2), new feature of U becomes the first principle component of dataset, and V is the second principle component. The principal component transform the original data into a new dimension space, in this space, U explains most of the data variance and V explains small part of data variance.

3. Implement of PCA

From above description, we have a high level understanding about PCA algorithm. It is time to dive into the detailed implementation of PCA.

Organize all independent features into matrix X, and centralize each feature by minus the mean of the feature so then make each feature with zero mean. If different features on different scales, standardize the feature by diving the standard deviation of feature.

2. Secondly, try to obtain the new vectors (for example, U and V in fig 1). The specific procedure to implement the calculation is compute covariance matrix:

3. Thirdly, solve the eigen decomposition problem. The major idea for eigen decomposition is to solve the equation:

C is covariance matrix; x is eigenvector, is corresponding is eigenvalue to explained variance.

4. Rank the eigenvalues obtained from the step 3, select the number of eigenvectors that correspond to the number of largest eigenvalues.

5. Transform the original dataset onto the selected the number of new feature subspace.

In reality, the implementation of PCA need to compute the full covariance matrix which require extensive usage of memory. There is another beautiful algorithm can achieve the same purpose as PCA based on raw dataset without calculating covariance matrix. The new algorithm is Singular Value Decomposition (SVD). When implementing PCA eigen decomposition, a squared covariance matrix is required, however, SVD can decompose an arbitrary matrix with m rows and n columns into a set of vectors. In addition it is possible to implement truncated SVD, this makes the computing procedure of much faster and numerically stable, and actually the popular machine learning library scikit-learn use SVD for PCA package. As for singular decomposition:

Orthogonal matrices U composed by orthonormal eigenvectors of XX^T, and orthogonal matrices composed by orthonormal eigenvectors of X^TX, diagonal matrix is the root of positive eigenvalues of or (two matrices should have the same positive eigenvalues).

According to the equation

the following equation can be attained

Based on above equation, principal component scores can be computed.

4. Case Study

MNIST data set is used for this project to demonstrate the implement of PCA, meanwhile to illustrate the effects of PCA on high dimension data sets. The data set contains 70000 images of handwritten digits. For each image, the data contains 28x28 = 784 features which represent grayscale level.

Import all packages

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

2. Load and understand data set

mnist = fetch_openml('mnist_784')
for key in mnist:
    print (key)data
target
frame
feature_names
target_names
DESCR
details
categories
url

‘data’, ‘target’ et al. nine different keys are included in the MNIST data set. ‘data’, ‘target’ are independent variables and dependent variable.

3. Data structure

X, y = mnist['data'], mnist['target']
print(X.shape, y.shape)(70000, 784) (70000,)

4. Illustrate the first nine data point

import matplotlib.pyplot as pltfig = plt.figure( figsize=(6,6) )
for i in range(9):
    image = X[i]
    image_pixels = image.reshape(28,28)
    plt.subplot(3,3,i+1)
    plt.imshow(image_pixels)
    plt.title(f'Target Label {y[i]}', fontsize=12)
    plt.axis('off')

5. Standardize the data set

PCA is sensitive to the magnitude of data set, it is a good habit to scale the features of data set before implementing PCA algorithm.

scaler_train = StandardScaler()
scaler_train.fit(X)
X = scaler_train.transform(X)

6. PCA decomposition

pca = PCA(0.95)
pca.fit(X)
X_pca_reduceddimension = pca.transform(X)
pca.n_components_332X_pca_reduceddimension.shape(70000, 332)

This decomposition showed PCA is a powerful tool to properly reduce high dimension data set, and it showed that although we keep 95% original data set information, the data dimension reduced around 58% (1- 332/784).

7. Compare image which keeps different original information

The most interesting thing is that PCA has an inverse function, and through this function, the compressed representation can approximately inverse back to original high dimension data set(784 features). We will compare to keep 95%, 85% and 75% original information image to raw image.

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(15,9))
m = 0
for i in range(3):
    for j in [1, 0.95, 0.85, 0.75]:
        m += 1
        pca = PCA(j)
        pca.fit(X)
        X_pca = pca.transform(X)
        components = pca.n_components_
        X_approx = pca.inverse_transform(X_pca)
        plt.subplot(3,4,m)
        if j == 1:
            image = X[i]
            plt.title(f'Original Data; Components 784')
        else: 
            image = X_approx[i]
            plt.title(f'Variance {j}; Components {components}')
        image_pixels = image.reshape(28,28)
        plt.axis('off')
        plt.imshow(image_pixels)

Compare different image keeping different original information (variance)

5. Summary

This post mainly explained the principal of PCA, meanwhile demonstrated the implements and application of PCA. PCA is one powerful tool to reduce high dimension data set while still keep the majority of data information. If we can properly utilize PCA, it will save us computing resources and time.