In the following exercises we want to analyze some climate data from the climate station Berlin-Dahlem. For this purpose, you can download the data here or import it directly into your Python environment using the pandas package to store the information as dataframe object over the pd.read_csv()function.

Note: Ensure pandas and numpy are installed in your mamba environment!

In [1]:
import pandas as pd
import numpy as np

dahlem_weather = pd.read_csv("https://userpage.fu-berlin.de/soga/data/raw-data/dahlem_station_weather_data.csv").dropna()

The dataset contains the following parameters:

Column Meaning
humidity mean absolute humidity in g/kg.
temperature mean daily temperature in 2 m above ground in °C.
cloudiness mean daily cloud coverage in 1/8.
sunlight mean daily sunlight hours.
season winter = DJF, spring = MAM, summer = JJA, autumn = SON.

We want to explore a possible correlation between temperature and humidity. Let’s have a look at the scatter plot to get a first impression of the relationship between the variables:

Note: Ensure matplotlib and seaborn are part of your mamba environment!

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,7))

sns.scatterplot(
    data=dahlem_weather,
    x="temperature", y="humidity", 
    hue="season"
).set(
    xlabel='Temperature in 2 m above ground (°C)',
    ylabel='Humidity (g/kg)'
)
Out[2]:
[Text(0.5, 0, 'Temperature in 2 m above ground (°C)'),
 Text(0, 0.5, 'Humidity (g/kg)')]

Exercise 1: Calculate the Pearson correlation coefficient for the variables temperature and humidity, during summer (a) and winter (b)!

In [ ]:
### your solution
In [3]:
Show code
from scipy.stats import pearsonr

summer = dahlem_weather.loc[dahlem_weather.season == "summer"]

print("r summer season: {}".
      format(round(pearsonr(summer["temperature"], summer["humidity"]).statistic, 4)))
r summer season: 0.2913
In [4]:
Show code
winter = dahlem_weather.loc[dahlem_weather.season == "winter"]

print("r winter season: {}".
      format(round(pearsonr(winter["temperature"], winter["humidity"]).statistic, 4)))
r winter season: 0.9724

Which is higher? Does the graph above support your findings?

Exercise 2: Plot a panel of correlation plots for the variables humidity, temperature, cloudiness and sunlight using the provided UDFs! Try different methods (Spearman, Pearson). Which one yields better results and why?

In [5]:
from scipy.stats import pearsonr, spearmanr
def reg_coef(x, y , label=None, color=None, **kwargs):
    ax = plt.gca()
    r,p = pearsonr(x,y)
    ax.annotate('r = {:.2f}'.format(r), xy=(0.5,0.5), xycoords='axes fraction', ha='center', size = 20)
    ax.set_axis_off()

def reg_spear_coef(x, y , label=None, color=None, **kwargs):
    ax = plt.gca()
    r,p = spearmanr(x,y)
    ax.annotate('r = {:.2f}'.format(r), xy=(0.5,0.5), xycoords='axes fraction', ha='center', size = 20)
    ax.set_axis_off()
In [ ]:
### your solution
In [6]:
Show code
plt.figure(figsize=(12,7))
g = sns.PairGrid(dahlem_weather[["humidity", "temperature", "cloudiness", "sunlight"]])
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.map_upper(reg_coef)
Out[6]:
<seaborn.axisgrid.PairGrid at 0x1ce3e3cfa30>
<Figure size 1200x700 with 0 Axes>
In [ ]:
### your solution
In [7]:
Show code
plt.figure(figsize=(12,7))
g = sns.PairGrid(dahlem_weather[["humidity", "temperature", "cloudiness", "sunlight"]])
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.map_upper(reg_spear_coef)
Out[7]:
<seaborn.axisgrid.PairGrid at 0x1ce3ecd3f10>
<Figure size 1200x700 with 0 Axes>

Exercise 3: Conduct a correlation t-test with a confidence level of 99 % for the variables temperature and humidity! Differentiate between summer (a) and winter (b). Go through all 6 steps of hypothesis testing. Choose an appropriate coefficent to quantify the strength of the relation (Pearson or Spearman). What is your conclusion?

In [ ]:
### your solution
In [8]:
Show code
from scipy.stats import spearmanr

def perform_and_print_cor_test(x, y, season, alpha = 0.05):
    test_result = spearmanr(x, y)
    
    if (test_result.pvalue < alpha):
        print("".join(["At a significance level of {} %, the data provide very strong evidence ",
                       "to conclude a linear relation between the temperature and the humidity for the {}. ",
                       "The results are statistically significant with a p-value of {}"]).
              format(int(alpha * 100), season, test_result.pvalue))
    else:
        print("".join(["At a significance level of {} %, the data provide no evidence for rejecting H0. ",
                       "Hence there we can not conclue a linear relation between ",
                       "the temperature and the humidity for the {}. The p-value is {}"]).
             format(int(alpha * 100), season, test_result.pvalue))

perform_and_print_cor_test(summer["temperature"], summer["humidity"], "summer", alpha = 0.01)
print("\n")
perform_and_print_cor_test(winter["temperature"], winter["humidity"], "summer", alpha = 0.01)
At a significance level of 1 %, the data provide no evidence for rejecting H0. Hence there we can not conclue a linear relation between the temperature and the humidity for the summer. The p-value is 0.041974605926915605


At a significance level of 1 %, the data provide very strong evidence to conclude a linear relation between the temperature and the humidity for the summer. The results are statistically significant with a p-value of 1.6007008282313898e-40

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.