In this exercise we want to explore a climate data set from France. First, we download the data and get an overview.
clim <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/Climfrance.csv", sep = ";")
str(clim)
## 'data.frame': 36 obs. of 12 variables:
## $ station : chr "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
## $ altitude : chr "3" "5" "8" "10" ...
## $ lat : num 43.5 43.7 49.6 42.5 47.2 ...
## $ lon : num 5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
## $ t_mean : num 14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
## $ t_max : num 24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
## $ t_min : num 2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
## $ relhumidity : int 68 72 80 70 82 82 89 66 82 81 ...
## $ p_mean : chr "546" "862" "931" "735" ...
## $ p_max24h : int 86 147 62 201 62 45 81 186 49 54 ...
## $ rainydays : int 76 86 171 91 172 168 147 85 171 162 ...
## $ sun_shine_hperyrs: int 2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...
The data set consists of data for 36 climate stations in France. The data includes:
There is a problem with the variables “altitude” and “p_mean”.
Obviously R
cannot interpret the thousand separator. We
have to clean up with gsub()
. The function
as.numeric()
transforms the factor to numeric. You can also
use as.integer()
if there are no decimals included:
clim$altitude <- as.numeric(gsub(",", "", clim$altitude))
clim$p_mean <- as.numeric(gsub(",", "", clim$p_mean))
str(clim)
## 'data.frame': 36 obs. of 12 variables:
## $ station : chr "Marseille" "Nice" "Cherbourg" "Bastia (Korsika)" ...
## $ altitude : num 3 5 8 10 26 35 37 43 44 47 ...
## $ lat : num 43.5 43.7 49.6 42.5 47.2 ...
## $ lon : num 5.21 7.2 -1.63 9.48 -1.61 -1.71 -3.21 2.86 3.1 -0.7 ...
## $ t_mean : num 14.2 14.8 11.4 14.9 11.7 11.3 12.3 12.5 9.7 12.3 ...
## $ t_max : num 24.4 23.7 19.2 25 22.6 22.1 19.9 26.2 21.3 24.9 ...
## $ t_min : num 2.3 7.1 4.3 7.5 1.6 0.6 4.5 5.4 -0.2 0.6 ...
## $ relhumidity : int 68 72 80 70 82 82 89 66 82 81 ...
## $ p_mean : num 546 862 931 735 741 669 637 639 637 900 ...
## $ p_max24h : int 86 147 62 201 62 45 81 186 49 54 ...
## $ rainydays : int 76 86 171 91 172 168 147 85 171 162 ...
## $ sun_shine_hperyrs: int 2764 2775 1608 2603 1952 1805 2111 2644 1574 2052 ...
We want to explain the temperature by spatial attributes using multiple linear regression. Let us plot the stations on a map to get an idea of their locations.
G1 <- raster::getData(country = "France", level = 1)
library(ggplot2)
ggplot() +
geom_polygon(
data = G1,
aes(x = long, y = lat, group = group),
colour = "grey10", fill = "#fff7bc"
) +
geom_point(
data = clim,
aes(x = lon, y = lat),
alpha = .5,
size = 2
) +
theme_bw() +
xlab("Longitude") +
ylab("Latitude") +
coord_map()
## Regions defined for each Polygons
Exercise 1: Test all three spatial attributes, i.e. latitude, longitude and altitude, as explanatory variables for the mean annual temperature.
First, we exclude the two high mountain extremes.
climfrar <- clim[1:34, ]
### Your code here ..
Model:
We use the lm()
function to build the multiple linear
regression model.
model <- lm(t_mean ~ altitude + lat + lon, data = climfrar)
model
##
## Call:
## lm(formula = t_mean ~ altitude + lat + lon, data = climfrar)
##
## Coefficients:
## (Intercept) altitude lat lon
## 37.265036 -0.006414 -0.533960 0.032101
Consequently, the linear model can be written as
\[\text{Mean Annual Temperature} = 37.265 + (-0.006) \times \text{Altitude} + (-0.534) \times \text{Latitude} +0.032 \times \text{Longitude}{.} \]
summary(model)
##
## Call:
## lm(formula = t_mean ~ altitude + lat + lon, data = climfrar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76492 -0.32755 0.04337 0.24787 2.58927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2650364 2.6220099 14.212 7.29e-15 ***
## altitude -0.0064139 0.0008688 -7.383 3.17e-08 ***
## lat -0.5339603 0.0557546 -9.577 1.24e-10 ***
## lon 0.0321010 0.0395728 0.811 0.424
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7308 on 30 degrees of freedom
## Multiple R-squared: 0.8329, Adjusted R-squared: 0.8162
## F-statistic: 49.84 on 3 and 30 DF, p-value: 9.112e-12
The test shows that the mean annual temperature in France depends
significantly on altitude and latitude with p-values of \(3.17 \times 10⁻⁸\) and \(1.24 \times 10⁻¹⁰\) respectively, whereas
longitude does not seem to be relevant.
Exercise 2: Exclude the non-significant variable and predict the temperature for Mont-Ventoux and Pic-du-midi. Compare the measured means concerning the prediction confidence bounces.
## Your code here ..
Model:
model_t_mean <- lm(t_mean ~ altitude + lat, data = climfrar)
summary(model_t_mean)
##
## Call:
## lm(formula = t_mean ~ altitude + lat, data = climfrar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.79206 -0.27571 -0.00556 0.30536 2.71871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.9147567 2.4828724 15.27 5.68e-16 ***
## altitude -0.0062643 0.0008443 -7.42 2.34e-08 ***
## lat -0.5465325 0.0532610 -10.26 1.72e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7268 on 31 degrees of freedom
## Multiple R-squared: 0.8292, Adjusted R-squared: 0.8182
## F-statistic: 75.26 on 2 and 31 DF, p-value: 1.268e-12
So, the linear model for the mean annual temperature is:
\[\text{Mean Annual Temperature} = 37.915 + (-0.006) \times \text{Altitude} + (-0.547) \times \text{Latitude}\]
pred_t_mean <- predict(model_t_mean, newdata = list("altitude" = c(1212, 2860), "lat" = c(44.2, 42.9)), interval = "p", level = 0.95)
pred_t_mean
## fit lwr upr
## 1 6.165713 3.792341 8.539085
## 2 -3.447331 -8.347964 1.453302
The predicted mean for Mont-Ventoux is \(6.17°C\) with a \(95%\)-prediction-interval of \([3.79°C, 8.54°C]\). Since the measured mean
is \(3.6°C\), our model is not accurate
enough to reproduce the temperature for Mont-Ventoux.
Exercise 3: Evaluate the model results by i) a 3D-scatterplot and ii) by the summary output.
## Your code here ..
3D scatter plot:
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.2.3
scatter_3d <- with(climfrar, scatterplot3d(altitude, lat, t_mean,
pch = 16, highlight.3d = TRUE,
angle = 45,
))
scatter_3d$plane3d(model_t_mean)
By inspecting the 3D scatter plot we see that most of the point data is
very well approximated by the plane except for data points with low
altitude and low latitude. This indicates that the linear model does not
fit all of the data well.
summary(model_t_mean)
##
## Call:
## lm(formula = t_mean ~ altitude + lat, data = climfrar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.79206 -0.27571 -0.00556 0.30536 2.71871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.9147567 2.4828724 15.27 5.68e-16 ***
## altitude -0.0062643 0.0008443 -7.42 2.34e-08 ***
## lat -0.5465325 0.0532610 -10.26 1.72e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7268 on 31 degrees of freedom
## Multiple R-squared: 0.8292, Adjusted R-squared: 0.8182
## F-statistic: 75.26 on 2 and 31 DF, p-value: 1.268e-12
The minimum residuum from the regression hyper-plane is \(-1.79°C\) and the maximum residuum is \(2.72°C\). A good model fit should have a
residual distribution that is symmetrical around zero. In our case, it
is not strongly symmetrical. That means that the model predicts certain
points that are far away from the observed points.
The Residual Standard Error is a measure of the quality of a linear
regression fit. On average, the actual mean temperature can deviate from
the regression plane by \(0.73°C\).
Given that the mean of all data points is \(11.68°C\), the percentage error is \(6.2 \%\).
The adjusted R² value is \(0.82\). That
means that roughly \(82\%\) of the
variance found in the variable “t_mean” can be explained by the
variables latitude and altitude.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.