Linear feature scaling


Linear feature scaling comprises three different operations on variable spaces: translation (shifting), scaling (compression/stretching) and rotation (orientation change). They are generally used in machine learning for adjusting distributions or metric units concerning central tendency (translation) and/or variance (scaling)

Linear feature scaling don’t change the main properties of the original distribution, e.g.:
- multi-modal distributions remain multi-modal
- skewed distributions won’t become symmetric
- non-normal frequency distributions stay non-normal
- single or double bounded scales just getting new boundaries
- singular covariance matrix stays singular

Thus, methodological or algebraic limitation will be preserved, even signs or scales are changing!

This short video visualizes linear scaling:



Translation

On a 1-D feature scale translation means constant shifting of the orign by a constant c: \[x\to x'=x+c: c\in\mathbb R\]
For c>0 x shifts in direction to \(+\infty\) (right shift ) and c<0 to \(-\infty\) (left shift).

A simple example is the transformation of temperature units between Kelvin \(x\in\mathbb R_+\) and Celsius \(y\in]-273.15,+\infty[_\mathbb R\)
\[ x[°K] \to y[°C]=x-273.15\] or the back transformation: \[ y[°C] \to x[°K]=x+273.15\]

Note: Translation just change orign, but not remove scale limits. Thus translation cannot solve algebraic limitations of bound scales. Translation is an additive transformation!

Scaling

Scaling means expand/compress of data spaces by a scalar \(s\in\mathbb R_+\): \[x\to x'=s\cdot x\] For \(s>1\) feature scales will be expanded (stretched) and for \(s<1\) compressed. A simple example is a transformation of irradiance [W/m^2] to 8-bit-brightness-value \([0,255]_{\mathbb N}\):

\[E_e[W/m^2]\to BV=\lfloor\frac{255}{max(E_e)}E_{e}\rfloor\] Here, a positive reel variable \(x\in\mathbb R_+\) is transformed to integers \(x'\in [0,255]_{\mathbb N_{0}}\) and thus becomes increasingly limited concerning algebraic and methodological approaches.

Further simple examples are currency exchange rates, kilometer to miles, gram to kilogram, etc..

As translation, also scaling (stretching/compression) cannot solve algebraic limitations of bound scales! Stretching/compression is a multiplicative transformation!

####Rotation

In a 1D-feature scale only 180° rotation is possible using \(s=-1\) (mirror at orign!.



A Linear transformation is the combination of translation, stretching and rotation. It can be simply expressed by: \[ x\to y= s\cdot x+c\]
A huge number of linear transformation are used for different purposes. Hereby, we justpresent the most common ones.

Standardization (z-scoring)

In statistics, Standardization (or z-Transformation) is the most common linear transformation with \(s=\frac{1}{\sigma}\) and \(c=-\frac{\mu}{\sigma}\):

\[x\to z=\frac{x-\mu}{\sigma}=\frac{1}{\sigma}\cdot x-\frac{\mu}{\sigma}\] Due to its properties like mean \(\bar x_z=0\) and variance/standard deviation\(s^2_z=s_z=1\), z-transformation is used for comparing variables independent of mean and variance. However standardization preserves the shape properties of the original distribution

Note: If x is non-normal, z(x) will be non-normal as well!


Special Example of standardization: Stable Isotopes

Isotope ratios play an important role in paleoclimatology, sedimentology, biology or forensic among other related disciplines. Thus, meaningful statistical treatment of isotope data is a crucial task for many applications.
Let’s have a look on the example of stable oxygene isotope ratio abreviated as \(\delta^{18}O\):
\[\delta^{18}O=\frac{\left ( \frac{^{18}O}{^{16}O}\right )_{sample}-\left ( \frac{^{18}O}{^{16}O}\right )_{standard}}{\left ( \frac{^{18}O}{^{16}O}\right )_{standard}}\cdot 1000 [^0/_{00}]\]
Hence, we have a measured variable \(x=\left ( \frac{^{18}O}{^{16}O}\right )_{sample}\) transfomed by a translation constant of \(c=-1000\) and a scaling factor:

\[s=\frac{1000}{\left ( \frac{^{18}O}{^{16}O}\right )_{standard}} [^0/_{00}]\] We may rewrite the definition above in terms of linear transformation: \[\delta^{18}O=s\cdot x+c=\frac{1000}{\left ( \frac{^{18}O}{^{16}O}\right )_{standard}}\left ( \frac{^{18}O}{^{16}O}\right )_{sample}-1000\]
Important limitations in terms of statistics:
From an algebraic and thus statistical point of view, the definition of \(\delta^{18}O\) comprises a very dangerous transformation:

1. A small naturally positive ratio is shifted, so that negative ratios become possible!

2. Any ratio is algebraically centred to 1 (neutral element concerning multiplication): \(\frac{a}{b}=1 \iff a=b\), but here the center is shifted to 0 as neutral element (center) of an additive space while the variable remains meaningful multiplicative!

3. This kind of definition for an isotope ratio suggests a variable space related to the field \((\mathbb R,+,\cdot)\)m, but is definitly not!

However, these problem stay buried as long as the real world data arise relative close to the center of the potential data range. Thus, in most cases these variables will appear as quasi-normal. But we should keep these important limitations in mind when we use e.g. distance based classification algorithms!

Further feature scaling procedures

Min-Max feature scaling

The min-max-normalization is a special case of the weight-transformation after Klovan & Imbrie, 1971: \[w_i=\frac{x_i-a}{b-a}\] The min-max normalization is given by \(a=min(x) \land b=max(x)\). It transfom \(x\in[min(x),max(x)]_\mathbb R \to w\in [0,1]_\mathbb R\).

Further variation of the weight transformation is provided by Miesch,1981.

If centering the feature space at 0 is desired, the akin Mean-Normalization can be applied:

\[x'=\frac{x-\bar x}{max(x)-min(x)}\] \[x\in[min(x),max(x)]_\mathbb R \to x'\in [-1,1]_\mathbb R\] or the median normalization: \[x'=\frac{x-median(x)}{max(x)-min(x)}\] with a similar resultig feature scale.

Changing the range against the inter-quartile-range leads to the median-quartile normalization:

\[x'= \frac{x-Q_{50}}{Q_{75}-Q_{25}}\]

However, many more linear scaling approaches are recommended for different sample distributions and applications, but none of them can solve challenges as non-normality, missing Euclidian properties, non-linear measures and other.
In most cases our real world data has to be treated by non-linear transformation in order to may apply appropriate statistical methods for meaningful results.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.