Principal Component Analysis (PCA) is a powerful technique for reducing a large set of variables into a low-dimensional space. This article will show the richness of information we can glean from data by simplifying and analyzing the feature space through PCA.
We will use the UCI ML Breast Cancer Wisconsin (Diagnostic) breast cancer data set as an example, available at https://goo.gl/U2Uwz2. While the original article’s authors focused on developing a decision tree using linear programming, I will show how today’s PCA libraries greatly simplify understanding the variables’ space. Therefore, the article’s objective is not to develop yet another machine learning classifier but to show how to use PCA for a better understanding of the feature space. The technique can be generalized to any real-world example where multiple correlated variables make standard scatterplot analysis unfeasible.
What is Principal Component Analysis
I frequently use PCA in supervised and unsupervised learning because it helps simplify and understand data sets with many variables. It starts by identifying the axis, called the first principal component, along which the observations vary the most. This line is also the line closest to the data: it minimizes the distance with the data points.
For example, if we imagine data points distributed into a two-dimension ellipse, the first principal component will be along the major ellipse axis. Therefore, the points projected onto this line will result in components having the largest possible variance. Thus, this projection already reduces the complexity from two to one dimension while capturing most of the original information. In other words, PCA helps simplify the data (reducing its dimensions) while losing as little as possible of the original data information. Magic.
Indeed, PCA beautifully embeds the much-quoted aphorism: “everything should be made as simple as possible, but not simpler.”, often attributed to A. Einstein (although with no direct evidence in his writings).
Using the previous example, we can now visualize a second principal component as the minor ellipse axis, orthogonal to the first one, which describes the data variance along this independent axis. And generally speaking, if p features describe our data in a p-dimensional space, there will be p orthogonal axis along which we could completely represent our data.
Using Principal Component Analysis to simplify our data
The data set includes 569 labeled instances with 30 numerical attributes. Labels are the class WDBC-Malignant and WDBC-Benign while features come from a digitized image of a breast mass’s fine needle aspirate (FNA).
During Exploratory Data Analysis, we usually examine two-dimensional scatterplots of the various variables, which, for p features, means $\binom{p}{2} = \frac{p(p-1)}{2}$ scatterplots. Unfortunately, when the number of variables is significant such as in our case, this approach is unrealistic and meaningless. In fact, for 30 features, there are 435 scatterplots: way too many to inspect. But even if we soldier on to try and analyze them, each of them would contain just a tiny fraction of the information present in the data set.
Therefore, when p is significant, we need a low-dimensional representation of the data that captures as much information as possible to show on a two-dimensional plot or feed a predictive model. PCA is precisely the tool to do that.
The approach in practice
While we could easily handle the linear algebra behind PCA from scratch in Python, we will use the Scikit-Learn library that significantly streamlines the workflow.
Furthermore, since we will use Euclidean metrics to measure distances while maximizing variance, it is imperative to scale our variables before applying PCA.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# We scale our features with mean zero and unit variance
scaler = StandardScaler()
scaled_X = scaler.fit_transform(df)
#We fit the model to study the principal components
pca_model = PCA(n_components=2)
pca_model.fit(scaled_X)
pca_model.transform(scaled_X)
We should first ask how much of the information is retained by these two dimensions instead of the 30 dimensions of the original feature space?
As said, the level of information is normalized and available as the explained variance, which is:
pca_model.explained_variance_ratio_
array([4.42720256e-01, 1.89711820e-01])
It turns out that PC1, the first principal axis, explains 44% of the variance while PC2, the second axis, represents 19% of the variance.
Therefore our PCA greatly simplifies the analysis while retaining over 63% of the original information, which is excellent.
We see how well the two sets are well separated by plotting the data projected on these two axes, using different colors for malignant and benign tumors from the labels.
Study of the feature space
By exploring the weight given to the principal axis, also called loading vectors, we can best see which variables are correlated or more relevant.
To better clarify, PC1, the first loading vector, is the linear combination of features with the largest variance, while PC2 is the axis with maximal variance out of all linear combinations that are uncorrelated with PC1. Using a heatmap we can see which features are relatively significant in each of the two orthogonal axes.
We can further explore the feature space by plotting the weights of each loading vector, labeled with a number for better clarity.
We see PC1 put much weight on concavity variables 6, 7, 26, and 27 and much less importance on 11, 14, and 18 (texture, smoothness, and symmetry error). Variables such as mean radius, mean perimeter, mean area, worst radius, worst perimeter, and worst area are located close to each other, showing a clear and expected correlation (points 0,2,3,20,22,23).
It is interesting to note that these variables are far and therefore largely uncorrelated to smoothness and symmetry-related variables (28, 24, 8, 4). An oncology expert would probably find additional insights from these relationships between different features. Generally speaking, as I explored in my article “Six critical rules for Data Science organizations design“, Data Scientists should never work in isolation from content experts, by far the most common mistake in many organizations.
Should we add additional dimensions to our PCA?
By running the PCA with n_components = 30 we can get an array of explained variance contribution by each additional dimension. In fact, the cumulated sums of these contributions tell how each dimension adds up to retained overall information. A red line shows the choice of two axes we used in this analysis while the curve flattens towards one, meaning that the full information is kept with a number of axes equal to the number of initial features.
We have seen how two dimensions were able to keep the benign and malignant tumors groups well separated so in this case, many supervised classification models such as Support Vector Machine (SVM) may work really well after PCA with n_components=2.
However, in other cases, you may want to use more axes to retain more information prior to feeding the transformed data into a machine learning engine.
Conclusions
We have seen how PCA is a great tool to simplify data sets with many features. During Exploratory Data Analysis with many variables involved, it is sometimes impossible and futile to try analyzing dozens of scatterplots, while PCA significantly simplifies the task reducing data to relevant dimensions. The general idea behind the technique is that not all the features will add up significant information, especially when some features are heavily correlated. Inspecting the weights of the loading vectors shines a powerful light on these correlations and potentially brings additional insights to content experts’ understanding. Once data are understood, PCA can be used as a very effective component of a Machine Learning pipeline coupled with most supervised classification algorithms.