Jan 27, 2015by Sebastian RaschkaPrincipal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a “black box”, and we are going to unravel its internals in 3 basic steps.This article just got a complete overhaul, the original version is still available at.Sections.IntroductionThe sheer size of data in the modern age is not only a challenge for computer hardware but also a main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.PCA Vs. LDABoth Linear Discriminant Analysis (LDA) and PCA are linear transformation methods. From sklearn.preprocessing import StandardScaler Xstd = StandardScaler.
![Pca Example Problems Pca Example Problems](http://scikit-learn.sourceforge.net/0.8/_images/plot_pca_vs_lda_11.png)
A simple principal component analysis example Brian Russell, August, 2011. Of course, in most cases, especially if there are more than two attributes, the solution is much more difficult and we need a more foolproof method. To solve for the eigenvalues, we use the determinant of the matrix in equation (3) to give a quadratic equation which.
Fittransform ( X )1 - Eigendecomposition - Computing Eigenvectors and EigenvaluesThe eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.Covariance MatrixThe classic approach to PCA is to perform the eigendecomposition on the covariance matrix, which is a matrix where each element represents the covariance between two features. The covariance between two features is calculated as follows:We can summarize the calculation of the covariance matrix via the following matrix equation:where is the mean vectorThe mean vector is a -dimensional vector where each value in this vector represents the sample mean of a feature column in the dataset. Eigenvectors 0.52237162 -0.37231836 -0.72101681 0.26199559-0.26335492 -0.92555649 0.24203288 -0.12413481 0.58125401 -0.02109478 0.14089226 -0.80115427 0.56561105 -0.06541577 0.6338014 0.52354627Eigenvalues 2.93035378 0.92740362 0.14834223 0.02074601Correlation MatrixEspecially, in the field of “Finance,” the correlation matrix typically used instead of the covariance matrix.
However, the eigendecomposition of the covariance matrix (if the input data was standardized) yields the same results as a eigendecomposition on the correlation matrix, since the correlation matrix can be understood as the normalized covariance matrix.Eigendecomposition of the standardized data based on the correlation matrix. Array(-0.52237162, -0.37231836, 0.72101681, 0.26199559, 0.26335492, -0.92555649, -0.24203288, -0.12413481,-0.58125401, -0.02109478, -0.14089226, -0.80115427,-0.56561105, -0.06541577, -0.6338014, 0.52354627)2 - Selecting Principal ComponentsSorting EigenpairsThe typical goal of a PCA is to reduce the dimensionality of the original feature space by projecting it onto a smaller subspace, where the eigenvectors will form the axes. However, the eigenvectors only define the directions of the new axis, since they have all the same unit length 1, which can confirmed by the following two lines of code. Context ( 'seaborn-whitegrid' ): plt. Figure ( figsize = ( 6, 4 )) plt. Bar ( range ( 4 ), varexp, alpha = 0.5, align = 'center', label = 'individual explained variance' ) plt. Step ( range ( 4 ), cumvarexp, where = 'mid', label = 'cumulative explained variance' ) plt.
Ylabel ( 'Explained variance ratio' ) plt. Xlabel ( 'Principal components' ) plt. Legend ( loc = 'best' ) plt. Tightlayout The plot above clearly shows that most of the variance (72.77% of the variance to be precise) can be explained by the first principal component alone. The second principal component still bears some information (23.03%) while the third and fourth principal components can safely be dropped without losing to much information. Together, the first two principal components contain 95.8% of the information.Projection MatrixIt’s about time to get to the really interesting part: The construction of the projection matrix that will be used to transform the Iris data onto the new feature subspace.
![Pca Example Problems Pca Example Problems](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1522841046/ggbiplot_pzaudv.png)
Although, the name “projection matrix” has a nice ring to it, it is basically just a matrix of our concatenated top k eigenvectors.Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our -dimensional eigenvector matrix. Context ( 'seaborn-whitegrid' ): plt. Figure ( figsize = ( 6, 4 )) for lab, col in zip (( 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica' ), ( 'blue', 'red', 'green' )): plt. Scatter ( Y y lab, 0 , Y y lab, 1 , label = lab, c = col ) plt.
Xlabel ( 'Principal Component 1' ) plt. Ylabel ( 'Principal Component 2' ) plt. Legend ( loc = 'lower center' ) plt. Tightlayout plt. Show Now, what we got after applying the linear PCA transformation is a lower dimensional subspace (from 3D to 2D in this case), where the samples are “most spread” along the new feature axes.Shortcut - PCA in scikit-learnFor educational purposes, we went a long way to apply the PCA to the Iris dataset.
But luckily, there is already implementation in scikit-learn.