`Oil`

and `Density`

, and thereafter we plot the data.

In [1]:

```
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Load and plot data
food = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/food-texture.csv")
#The data set contains 50 rows (observations) and 2 columns (features).
pca_toy = food[["Oil", "Density"]]
sns.scatterplot(x="Oil", y="Density", data=pca_toy)
```

Out[1]:

<AxesSubplot:xlabel='Oil', ylabel='Density'>

In [3]:

```
# Center and scale data
pca_toy_scaled = pca_toy.apply(lambda x: (x - x.mean()) / x.std(), axis=0)
# Calculate eigenvalues and eigenvectors
pca_toy_cov = pca_toy_scaled.cov()
pca_toy_eigenvalues, pca_toy_eigenvectors = np.linalg.eig(pca_toy_cov)
# Extract loading vector
pca_toy_loading = pca_toy_eigenvectors[:, :1]
pca_toy_loading
```

Out[3]:

array([[ 0.70710678], [-0.70710678]])

Then, for the sake simplicity we pick only the first principal component and project the original two-dimensional data onto an one-dimensional feature space by multiplying the data matrix with the loading vector.

$$\mathbf{Z}_{n×k}=\mathbf{X}_{n×d}\mathbf{P}_{d×k}$$where $\mathbf{Z}$ is the projected data, $\mathbf{X}$ is the original data, and $\mathbf{P}$ is the loading vector. The subscript $n$ denotes the number of observations, $d$ the number of features, and $k$ the number of principal components.

In [4]:

```
# Project data onto one-dimensional feature space
pca_toy_projected = pca_toy_scaled @ pca_toy_loading
# Plot projected data
sns.scatterplot(x=np.arange(50), y=pca_toy_projected[0])
plt.axhline(y=0)
plt.xlabel("Index")
plt.ylabel("Projected data")
plt.show()
```

Done! We projected our original data set onto an one-dimensional feature space.

Before we get the original data back, **remember that only if we took all the eigenvectors in our transformation will we get exactly the original data back. As we reduced the number of eigenvectors in the final transformation, the retrieved data has lost some information**.

In order to re-project the data onto the original two-dimensional feature space we rearrange the equation from above

$$\mathbf{Z}=\mathbf{XP}⇒\mathbf{X}=\mathbf{ZP^{-1}}⇒\mathbf{X}=\mathbf{ZP^{T}}$$The inverse of the feature vector ($\mathbf{P}^{−1}$) is equal to the transpose of the feature vector ($\mathbf{P}^{T}$), because the elements of the matrix are all the unit eigenvectors of the data set and thus are orthonormal.

In [5]:

```
# Reconstruct and plot original data and projected data
to_reconstruct = pca_toy_projected @ pca_toy_loading.T
# calculate the mean and standard deviation of the original data
pca_toy_mean = pca_toy.mean()
pca_toy_std = pca_toy.std()
# multiply the reconstructed data with the standard deviation and add the mean
reconstructed = pd.DataFrame()
reconstructed["Oil"] = to_reconstruct[0] * pca_toy_std[0] + pca_toy_mean[0]
reconstructed["Density"] = to_reconstruct[1] * pca_toy_std[1] + pca_toy_mean[1]
# plot original data and reconstructed data
sns.scatterplot(x="Oil", y="Density", data=pca_toy)
sns.scatterplot(x=reconstructed["Oil"], y=reconstructed["Density"], marker="x", color="r")
plt.show()
```

**Citation**

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis
using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.*