The basic logic behind a one-way ANOVA is to take independent random samples from each group, then to compute the sample means for each group and thereafter, compare the variation of sample means between the groups to the variation within the groups. Finally, a decision, whether the means of the groups are all equal or not, is made based on a test statistic.

Based on that logic we need quantitative **measures of
variability**. Therefore, we partition the total variability into
two segments: the **between group variability** and the
**within group variability**.

We introduce three quantitative measures of variation:

- Sum of squares total (SST)
- Sum of squares groups (SSG)
- Sum of squares error (SSE)

The **sum of squares total (SST)** is a measure for the
total variability of the variable. It is given by

\[SST = \sum_{i=1}^n(x_i-\bar x)^2\text{,}\] where \(x_i\) corresponds to the observations in the samples and \(\bar x\) to the overall mean of all samples.

The **sum of squares groups (SSG)** is a measure for the
variability between groups and corresponds to the squared deviation of
the group means from the overall mean, weighted by the sample size:

\[SSG = \sum_{j=1}^k n_j(\bar x_j-\bar x)^2\] Here, \(n_j\) denotes the sample size for group \(j\), \(\bar x_j\) denotes the mean of group \(j\) and \(\bar x\) denotes the overall mean of the sample.

Finally, the **sum of squares error (SSE)** is a measure
for the variability within groups. It is associated with the unexplained
variability, which is the variability that cannot be explained by the
group variable. The sum of squares error is given by

\[SSE = \sum_{j=1}^k (n_j-1)s_j^2\text{,}\]

where \(n_j\) denotes the sample size for group \(j\) and \(s_j^2\) the variance of group \(j\). Alternatively, one may calculate the \(SSE\) as the difference between \(SST\) and \(SSG\):

\[SSE = SST-SSG\text{.}\]

So far, we calculated measures of total variability \((SST)\), in between group variability \((SSG)\) and within groups variability \((SSE)\). In the next step, in order to get an average variability, we scale these measures of variability by the sample size (more precisely by the degrees of freedoms, \(df\)).

The **degrees of freedom** are defined for each
partition of variability (total, in between groups, within groups
variability).

- Total variability

\[df_T = n-1\text{,}\]

where \(n\) denotes the overall sample size.

- In between group variability

\[df_G=k-1\text{,}\]

where \(k\) denotes the number of groups.

- Within group variability

\[df_E = n-k\text{.}\]

Now, we may calculate the **mean squares** for in
between group variability and within group variability. The average
variability in between and within groups is calculated as the total
variability scaled by the associated degrees of freedom.

- Mean in between group variability

\[MSG = \frac{SSG}{df_G}\]

- Mean within group variability

\[MSE = \frac{SSE}{df_E}\]

Finally, we compare the mean variation between the groups, \(MSG\), to the variation within the group, \(MSE\). Therefore, we calculate the ratio of the average between group \((MSG)\) and within group variability \((MSE)\), which is denoted as \(F\):

\[F= \frac{MSG}{MSE}\]

The \(F\)-statistic follows the
**\(F\)-distribution**
(named after Sir Ronald A. Fisher) with

\[df = (k-1, n-k)\text{,}\]

where \(k\) corresponds to the
number of groups and \(n\) to the
sample size. Large values of \(F\)-values indicate, that the variation in
between the group sample means is large relative to the variation within
the group. Further, we may calculate the *p*-value for any given
\(F\)-value. If the *p*-value is
small, the data provides convincing evidence that at least one pair of
group means is different from each other. If the *p*-value is
large, the data does not provide convincing evidence that at least one
pair of group means is different from each other and thus the observed
differences in sample means are attributable to sampling variability (or
chance).

As seen above, the one-way analysis of variance includes several
analytic steps. Therefore, a common way to display a one-way ANOVA is
the so-called **one-way ANOVA table**. The general design
of such a table is shown below:

\[ \begin{array}{|l|c|} \hline \ \text{Source} & df & \text{Sum of Squares }(SS) & \text{Mean Squares }(MS) & F\text{-statistic} & p\text{-value}\\ \hline \ \text{Group/Class} & k-1 & SSG & MSG=\frac{SSG}{k-1} & F = \frac{MSG}{MSE} & p\\ \ \text{Error/Residuals} & n-k & SSE & MSE=\frac{SSE}{n-k} & & \\ \hline \ \text{Total} & n-1 & SST & & & \\ \hline \end{array} \]

**Citation**

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: *Hartmann,
K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis
using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.*