In this section we show how to restore the original data. We showcase the procedure on the simplified food-texture data set. First we load the data, extract the two featuresOil
and Density
, and thereafter we plot the data.
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Load and plot data
food = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/food-texture.csv")
#The data set contains 50 rows (observations) and 2 columns (features).
pca_toy = food[["Oil", "Density"]]
sns.scatterplot(x="Oil", y="Density", data=pca_toy)
<AxesSubplot:xlabel='Oil', ylabel='Density'>
In the next step we center and scale the data, and calculate the eigenvalue of the data covariance matrix to extract the principal components loading vector.
# Center and scale data
pca_toy_scaled = pca_toy.apply(lambda x: (x - x.mean()) / x.std(), axis=0)
# Calculate eigenvalues and eigenvectors
pca_toy_cov = pca_toy_scaled.cov()
pca_toy_eigenvalues, pca_toy_eigenvectors = np.linalg.eig(pca_toy_cov)
# Extract loading vector
pca_toy_loading = pca_toy_eigenvectors[:, :1]
pca_toy_loading
array([[ 0.70710678], [-0.70710678]])
Then, for the sake simplicity we pick only the first principal component and project the original two-dimensional data onto an one-dimensional feature space by multiplying the data matrix with the loading vector.
$$\mathbf{Z}_{n×k}=\mathbf{X}_{n×d}\mathbf{P}_{d×k}$$where $\mathbf{Z}$ is the projected data, $\mathbf{X}$ is the original data, and $\mathbf{P}$ is the loading vector. The subscript $n$ denotes the number of observations, $d$ the number of features, and $k$ the number of principal components.
# Project data onto one-dimensional feature space
pca_toy_projected = pca_toy_scaled @ pca_toy_loading
# Plot projected data
sns.scatterplot(x=np.arange(50), y=pca_toy_projected[0])
plt.axhline(y=0)
plt.xlabel("Index")
plt.ylabel("Projected data")
plt.show()
Done! We projected our original data set onto an one-dimensional feature space.
Before we get the original data back, remember that only if we took all the eigenvectors in our transformation will we get exactly the original data back. As we reduced the number of eigenvectors in the final transformation, the retrieved data has lost some information.
In order to re-project the data onto the original two-dimensional feature space we rearrange the equation from above
$$\mathbf{Z}=\mathbf{XP}⇒\mathbf{X}=\mathbf{ZP^{-1}}⇒\mathbf{X}=\mathbf{ZP^{T}}$$The inverse of the feature vector ($\mathbf{P}^{−1}$) is equal to the transpose of the feature vector ($\mathbf{P}^{T}$), because the elements of the matrix are all the unit eigenvectors of the data set and thus are orthonormal.
# Reconstruct and plot original data and projected data
to_reconstruct = pca_toy_projected @ pca_toy_loading.T
# calculate the mean and standard deviation of the original data
pca_toy_mean = pca_toy.mean()
pca_toy_std = pca_toy.std()
# multiply the reconstructed data with the standard deviation and add the mean
reconstructed = pd.DataFrame()
reconstructed["Oil"] = to_reconstruct[0] * pca_toy_std[0] + pca_toy_mean[0]
reconstructed["Density"] = to_reconstruct[1] * pca_toy_std[1] + pca_toy_mean[1]
# plot original data and reconstructed data
sns.scatterplot(x="Oil", y="Density", data=pca_toy)
sns.scatterplot(x=reconstructed["Oil"], y=reconstructed["Density"], marker="x", color="r")
plt.show()
Our data is re-projected to 50 rows (observations) and 2 columns (features) - perfect. Before we plot the result and compare it to the raw data, we first have to invert the standardization procedure, hence we have to multiply the reconstructed data with the standard deviation and add the mean.
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.