30240_Exercises_multiple_linear

In this exercise we want to explore a climate data set from France. First, we download the data and get an overview.

clim <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/Climfrance.csv", sep = ";")
str(clim)

## 'data.frame':    36 obs. of  12 variables:
##  $ station          : chr  "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
##  $ altitude         : chr  "3" "5" "8" "10" ...
##  $ lat              : num  43.5 43.7 49.6 42.5 47.2 ...
##  $ lon              : num  5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
##  $ t_mean           : num  14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
##  $ t_max            : num  24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
##  $ t_min            : num  2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
##  $ relhumidity      : int  68 72 80 70 82 82 89 66 82 81 ...
##  $ p_mean           : chr  "546" "862" "931" "735" ...
##  $ p_max24h         : int  86 147 62 201 62 45 81 186 49 54 ...
##  $ rainydays        : int  76 86 171 91 172 168 147 85 171 162 ...
##  $ sun_shine_hperyrs: int  2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...

The data set consists of data for 36 climate stations in France. The data includes:

station name
altitude
latitude
longitude
mean temperature
maximum temperature
minimum temperature
relative humidity
mean precipitation
maximum precipitation within 24 hours
number of rainy days
number of sunshine hours per year

There is a problem with the variables “altitude” and “p_mean”. Obviously R cannot interpret the thousand separator. We have to clean up with gsub(). The function as.numeric() transforms the factor to numeric. You can also use as.integer() if there are no decimals included:

clim$altitude <- as.numeric(gsub(",", "", clim$altitude))
clim$p_mean <- as.numeric(gsub(",", "", clim$p_mean))
str(clim)

## 'data.frame':    36 obs. of  12 variables:
##  $ station          : chr  "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
##  $ altitude         : num  3 5 8 10 26 35 37 43 44 47 ...
##  $ lat              : num  43.5 43.7 49.6 42.5 47.2 ...
##  $ lon              : num  5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
##  $ t_mean           : num  14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
##  $ t_max            : num  24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
##  $ t_min            : num  2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
##  $ relhumidity      : int  68 72 80 70 82 82 89 66 82 81 ...
##  $ p_mean           : num  546 862 931 735 741 669 637 639 637 900 ...
##  $ p_max24h         : int  86 147 62 201 62 45 81 186 49 54 ...
##  $ rainydays        : int  76 86 171 91 172 168 147 85 171 162 ...
##  $ sun_shine_hperyrs: int  2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...

We want to explain the temperature by spatial attributes using multiple linear regression. Let us plot the stations on a map to get an idea of their locations.

G1 <- raster::getData(country = "France", level = 1)
library(ggplot2)
ggplot() +
  geom_polygon(
    data = G1,
    aes(x = long, y = lat, group = group),
    colour = "grey10", fill = "#fff7bc"
  ) +
  geom_point(
    data = clim,
    aes(x = lon, y = lat),
    alpha = .5,
    size = 2
  ) +
  theme_bw() +
  xlab("Longitude") +
  ylab("Latitude") +
  coord_map()

## Regions defined for each Polygons

Exercise 1: Test all three spatial attributes, i.e. latitude, longitude and altitude, as explanatory variables for the mean annual temperature.

First, we exclude the two high mountain extremes.

climfrar <- clim[1:34, ]

### Your code here ..

Model:

Show code

We use the lm() function to build the multiple linear regression model.

model <- lm(t_mean ~ altitude + lat + lon, data = climfrar)
model

## 
## Call:
## lm(formula = t_mean ~ altitude + lat + lon, data = climfrar)
## 
## Coefficients:
## (Intercept)     altitude          lat          lon  
##   37.265036    -0.006414    -0.533960     0.032101

Consequently, the linear model can be written as

\[\text{Mean Annual Temperature} = 37.265 + (-0.006) \times \text{Altitude} + (-0.534) \times \text{Latitude} +0.032 \times \text{Longitude}{.} \]

Test of significance:

Show code

summary(model)

## 
## Call:
## lm(formula = t_mean ~ altitude + lat + lon, data = climfrar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76492 -0.32755  0.04337  0.24787  2.58927 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.2650364  2.6220099  14.212 7.29e-15 ***
## altitude    -0.0064139  0.0008688  -7.383 3.17e-08 ***
## lat         -0.5339603  0.0557546  -9.577 1.24e-10 ***
## lon          0.0321010  0.0395728   0.811    0.424    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7308 on 30 degrees of freedom
## Multiple R-squared:  0.8329, Adjusted R-squared:  0.8162 
## F-statistic: 49.84 on 3 and 30 DF,  p-value: 9.112e-12

The test shows that the mean annual temperature in France depends significantly on altitude and latitude with p-values of \(3.17 \times 10⁻⁸\) and \(1.24 \times 10⁻¹⁰\) respectively, whereas longitude does not seem to be relevant.

Exercise 2: Exclude the non-significant variable and predict the temperature for Mont-Ventoux and Pic-du-midi. Compare the measured means concerning the prediction confidence bounces.

## Your code here ..

Model:

Show code

model_t_mean <- lm(t_mean ~ altitude + lat, data = climfrar)
summary(model_t_mean)

## 
## Call:
## lm(formula = t_mean ~ altitude + lat, data = climfrar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.79206 -0.27571 -0.00556  0.30536  2.71871 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.9147567  2.4828724   15.27 5.68e-16 ***
## altitude    -0.0062643  0.0008443   -7.42 2.34e-08 ***
## lat         -0.5465325  0.0532610  -10.26 1.72e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7268 on 31 degrees of freedom
## Multiple R-squared:  0.8292, Adjusted R-squared:  0.8182 
## F-statistic: 75.26 on 2 and 31 DF,  p-value: 1.268e-12

So, the linear model for the mean annual temperature is:

\[\text{Mean Annual Temperature} = 37.915 + (-0.006) \times \text{Altitude} + (-0.547) \times \text{Latitude}\]

Temperature prediction for Mont-Ventoux and Pic-du-midi:

Show code

pred_t_mean <- predict(model_t_mean, newdata = list("altitude" = c(1212, 2860), "lat" = c(44.2, 42.9)), interval = "p", level = 0.95)
pred_t_mean

##         fit       lwr      upr
## 1  6.165713  3.792341 8.539085
## 2 -3.447331 -8.347964 1.453302

The predicted mean for Mont-Ventoux is \(6.17°C\) with a \(95%\)-prediction-interval of \([3.79°C, 8.54°C]\). Since the measured mean is \(3.6°C\), our model is not accurate enough to reproduce the temperature for Mont-Ventoux.
The predicted mean for Pic-du-midi is \(-3.45°C\) with a \(95%\)-prediction-interval of \([-8.35°C, 1.45°C]\). So, our prediction still covers the measured mean of \(-1.2°C\).

Exercise 3: Evaluate the model results by i) a 3D-scatterplot and ii) by the summary output.

## Your code here ..

3D scatter plot:

Show code

library(scatterplot3d)

## Warning: package 'scatterplot3d' was built under R version 4.2.3

scatter_3d <- with(climfrar, scatterplot3d(altitude, lat, t_mean,
  pch = 16, highlight.3d = TRUE,
  angle = 45,
))
scatter_3d$plane3d(model_t_mean)

By inspecting the 3D scatter plot we see that most of the point data is very well approximated by the plane except for data points with low altitude and low latitude. This indicates that the linear model does not fit all of the data well.

Summary output:

Show code

summary(model_t_mean)

## 
## Call:
## lm(formula = t_mean ~ altitude + lat, data = climfrar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.79206 -0.27571 -0.00556  0.30536  2.71871 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.9147567  2.4828724   15.27 5.68e-16 ***
## altitude    -0.0062643  0.0008443   -7.42 2.34e-08 ***
## lat         -0.5465325  0.0532610  -10.26 1.72e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7268 on 31 degrees of freedom
## Multiple R-squared:  0.8292, Adjusted R-squared:  0.8182 
## F-statistic: 75.26 on 2 and 31 DF,  p-value: 1.268e-12

The minimum residuum from the regression hyper-plane is \(-1.79°C\) and the maximum residuum is \(2.72°C\). A good model fit should have a residual distribution that is symmetrical around zero. In our case, it is not strongly symmetrical. That means that the model predicts certain points that are far away from the observed points.
The Residual Standard Error is a measure of the quality of a linear regression fit. On average, the actual mean temperature can deviate from the regression plane by \(0.73°C\). Given that the mean of all data points is \(11.68°C\), the percentage error is \(6.2 \%\).
The adjusted R² value is \(0.82\). That means that roughly \(82\%\) of the variance found in the variable “t_mean” can be explained by the variables latitude and altitude.

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.