In the following exercises we want to analyze some climate data from the climate station Berlin-Dahlem. For this purpose, you can download the data here or import it directly into your Python environment using the pandas
package to store the information as dataframe
object over the pd.read_csv()
function.
Note: Ensure
pandas
andnumpy
are installed in yourmamba
environment!
import pandas as pd
import numpy as np
dahlem_weather = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/dahlem_station_weather_data.csv").dropna()
The dataset contains the following parameters:
Column | Meaning |
---|---|
humidity |
mean absolute humidity in g/kg. |
temperature |
mean daily temperature in 2 m above ground in °C. |
cloudiness |
mean daily cloud coverage in 1/8. |
sunlight |
mean daily sunlight hours. |
season |
winter = DJF, spring = MAM, summer = JJA, autumn = SON. |
We want to explore a possible correlation between temperature and humidity. Let’s have a look at the scatter plot to get a first impression of the relationship between the variables:
Note: Ensure
matplotlib
andseaborn
are part of yourmamba
environment!
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12,7))
sns.scatterplot(
data=dahlem_weather,
x="temperature", y="humidity",
hue="season"
).set(
xlabel='Temperature in 2 m above ground (°C)',
ylabel='Humidity (g/kg)'
)
[Text(0.5, 0, 'Temperature in 2 m above ground (°C)'), Text(0, 0.5, 'Humidity (g/kg)')]
Exercise 1: Calculate the Pearson correlation coefficient for the variables temperature and humidity, during summer (a) and winter (b)!
### your solution
from scipy.stats import pearsonr
summer = dahlem_weather.loc[dahlem_weather.season == "summer"]
print("r summer season: {}".
format(round(pearsonr(summer["temperature"], summer["humidity"]).statistic, 4)))
r summer season: 0.2913
winter = dahlem_weather.loc[dahlem_weather.season == "winter"]
print("r winter season: {}".
format(round(pearsonr(winter["temperature"], winter["humidity"]).statistic, 4)))
r winter season: 0.9724
Which is higher? Does the graph above support your findings?
Exercise 2: Plot a panel of correlation plots for the variables humidity, temperature, cloudiness and sunlight using the provided UDFs! Try different methods (Spearman, Pearson). Which one yields better results and why?
from scipy.stats import pearsonr, spearmanr
def reg_coef(x, y , label=None, color=None, **kwargs):
ax = plt.gca()
r,p = pearsonr(x,y)
ax.annotate('r = {:.2f}'.format(r), xy=(0.5,0.5), xycoords='axes fraction', ha='center', size = 20)
ax.set_axis_off()
def reg_spear_coef(x, y , label=None, color=None, **kwargs):
ax = plt.gca()
r,p = spearmanr(x,y)
ax.annotate('r = {:.2f}'.format(r), xy=(0.5,0.5), xycoords='axes fraction', ha='center', size = 20)
ax.set_axis_off()
### your solution
plt.figure(figsize=(12,7))
g = sns.PairGrid(dahlem_weather[["humidity", "temperature", "cloudiness", "sunlight"]])
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.map_upper(reg_coef)
<seaborn.axisgrid.PairGrid at 0x1ce3e3cfa30>
<Figure size 1200x700 with 0 Axes>
### your solution
plt.figure(figsize=(12,7))
g = sns.PairGrid(dahlem_weather[["humidity", "temperature", "cloudiness", "sunlight"]])
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.map_upper(reg_spear_coef)
<seaborn.axisgrid.PairGrid at 0x1ce3ecd3f10>
<Figure size 1200x700 with 0 Axes>
Exercise 3: Conduct a correlation t-test with a confidence level of 99 % for the variables temperature and humidity! Differentiate between summer (a) and winter (b). Go through all 6 steps of hypothesis testing. Choose an appropriate coefficent to quantify the strength of the relation (Pearson or Spearman). What is your conclusion?
### your solution
from scipy.stats import spearmanr
def perform_and_print_cor_test(x, y, season, alpha = 0.05):
test_result = spearmanr(x, y)
if (test_result.pvalue < alpha):
print("".join(["At a significance level of {} %, the data provide very strong evidence ",
"to conclude a linear relation between the temperature and the humidity for the {}. ",
"The results are statistically significant with a p-value of {}"]).
format(int(alpha * 100), season, test_result.pvalue))
else:
print("".join(["At a significance level of {} %, the data provide no evidence for rejecting H0. ",
"Hence there we can not conclue a linear relation between ",
"the temperature and the humidity for the {}. The p-value is {}"]).
format(int(alpha * 100), season, test_result.pvalue))
perform_and_print_cor_test(summer["temperature"], summer["humidity"], "summer", alpha = 0.01)
print("\n")
perform_and_print_cor_test(winter["temperature"], winter["humidity"], "summer", alpha = 0.01)
At a significance level of 1 %, the data provide no evidence for rejecting H0. Hence there we can not conclue a linear relation between the temperature and the humidity for the summer. The p-value is 0.041974605926915605 At a significance level of 1 %, the data provide very strong evidence to conclude a linear relation between the temperature and the humidity for the summer. The results are statistically significant with a p-value of 1.6007008282313898e-40
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.