Python provides access to the t-distribution within the scipy.stats package
by the t.pdf()
, t.cdf()
, t.ppf()
and t.rvs()
functions. Apply the help()
function on these functions for further information or see the documentation here.
The t.rvs()
function generates random deviates of the t-distribution and is written as t.rvs(df, loc=0, scale=1, size=1)
. We may easily generate n
number of random samples (size). Recall that he number of degrees of freedom (df
) for a t-distribution is equal to the sample size minus one, that is,
# First, let's import all the needed libraries.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
n = 30
df = n - 1
stats.t.rvs(df, loc=0, scale=1, size=n)
array([ 0.95544828, 0.37078233, -0.18390197, -1.12389442, 0.66589859, 1.35468731, 0.4142192 , 0.9110111 , -0.37727128, 0.81129497, 0.34850557, 0.60532365, -0.20578076, -1.29671915, 0.13190164, -2.16378482, -0.26282869, -0.2769083 , -1.54400695, 0.28656318, -0.22978442, -0.01746411, -0.23847418, -0.83386572, 1.06076188, 0.91941202, -0.45496332, 0.29924681, -0.63563942, 1.19456363])
Further we may generate a very large number of random samples and plot them as a histogram.
n = 10000
df = n - 1
rand_t_samples = stats.t.rvs(df, loc=0, scale=1, size=n)
plt.figure(figsize=(10, 5))
plt.hist(
rand_t_samples, density=True, color="lightgrey", edgecolor="darkgrey", bins="scott"
)
plt.xlabel("samples")
plt.ylabel("Density")
plt.show()
By using the t.pdf()
function we may calculate the probability density function, and thus, the vertical distance between the horizontal axis and the t-curve at any point. For the purpose of demonstration we construct a t-distribution with $df=5$ and calculate the probability density function at $t = -4,-2,0,2,4$.
x = np.arange(-4, 5, 2)
stats.t.pdf(x, df=5)
array([0.00512373, 0.06509031, 0.37960669, 0.06509031, 0.00512373])
x = np.arange(-5, 5.01, 0.01)
yy = stats.t.pdf(x, df=5)
plt.figure(figsize=(10, 5))
plt.plot(x, yy, color="black")
plt.title("t-Distribution, df = 5")
plt.ylim(-0.01, 0.5)
plt.xlabel("t-values")
plt.ylabel("Probability density")
xpos = [-4, -2, 0, 2, 4]
ypos = [0.005, 0.065, round(max(yy), 3), 0.065, 0.005]
for px, py in zip(xpos, ypos):
plt.vlines(x=px, ymin=0, ymax=py, color="red", linestyle="--")
plt.text(px - 0.3, py + 0.025, f"{py}", fontsize=14)
plt.show()
Another very useful function is the t.cdf()
function, which returns the area under the t-curve for any given interval. Let us calculate the area under the curve for the intervals $j_i = (-\infty, -2], (-\infty, 0], (-\infty, 2]$ and $k_i = [-2, \infty),[0, \infty), [2, \infty)$ for a random variable following a t-distribution with $df=5$.
df = 5
ji = [-2, 0, 2]
stats.t.cdf(ji, df=df)
array([0.05096974, 0.5 , 0.94903026])
df = 5
fx = [-2, 0, 2]
fy = stats.t.pdf(fx, df)
out = stats.t.cdf(ji, df=df)
x = np.arange(-6, 6.01, 0.05)
yy = stats.t.pdf(x, df)
fig, ax = plt.subplots(1, 3, figsize=(14, 5))
plt.title("Area under the curve for $j_i$")
###### Plot 1
ax[0].plot(x, yy, color="black")
ax[0].set_ylim(-0.01, 0.45)
ax[0].set_ylabel("Probability density")
ax[0].fill_between(
x=x,
y1=yy,
where=(x <= -2),
color="red",
alpha=0.75,
)
ax[0].text(
1,
0.25,
f"Colored area is \n {round(out[0]*100,2)}%",
fontsize=12,
)
ax[0].text(
-4,
0.4,
"$P(X \leq j_i) =\\int_{-\infty}^{j_i}f(x)dx $",
fontsize=12,
)
#### Plot 2
ax[1].plot(x, yy, color="black")
ax[1].set_ylim(-0.01, 0.45)
ax[1].fill_between(
x=x,
y1=yy,
where=(x <= 0),
color="red",
alpha=0.75,
)
ax[1].text(
1,
0.25,
f"Colored area is \n {round(out[1]*100,2)}%",
fontsize=12,
)
ax[1].text(
-4,
0.4,
"$P(X \leq j_i) =\\int_{-\infty}^{j_i}f(x)dx $",
fontsize=12,
)
##### Plot 3
ax[2].plot(x, yy, color="black")
ax[2].set_ylim(-0.01, 0.45)
ax[2].fill_between(
x=x,
y1=yy,
where=(x <= 2),
color="red",
alpha=0.75,
)
ax[2].text(
1,
0.25,
f"Colored area is \n {round(out[2]*100,2)}%",
fontsize=12,
)
ax[2].text(
-4,
0.4,
"$P(X \leq j_i) =\\int_{-\infty}^{j_i}f(x)dx $",
fontsize=12,
)
plt.show()
df = 5
fx = [-2, 0, 2]
fy = stats.t.pdf(fx, df)
out = stats.t.cdf(ji, df=df)
x = np.arange(-6, 6.01, 0.05)
yy = stats.t.pdf(x, df)
fig, ax = plt.subplots(1, 3, figsize=(14, 5))
plt.title("Area under the curve for $k_i$")
###### Plot 1
ax[0].plot(x, yy, color="black")
ax[0].set_ylim(-0.01, 0.45)
ax[0].set_ylabel("Probability density")
ax[0].fill_between(
x=x,
y1=yy,
where=(x >= -2),
color="red",
alpha=0.75,
)
ax[0].text(
1,
0.25,
f"Colored area is \n {round(out[2]*100,2)}%",
fontsize=12,
)
ax[0].text(
-4,
0.4,
"$P(X \leq j_i) =\\int_{k_i}^{\infty}f(x)dx $",
fontsize=12,
)
#### Plot 2
ax[1].plot(x, yy, color="black")
ax[1].set_ylim(-0.01, 0.45)
ax[1].fill_between(
x=x,
y1=yy,
where=(x >= 0),
color="red",
alpha=0.75,
)
ax[1].text(
1,
0.25,
f"Colored area is \n {round(out[1]*100,2)}%",
fontsize=12,
)
ax[1].text(
-4,
0.4,
"$P(X \leq j_i) =\\int_{k_i}^{\infty}f(x)dx $",
fontsize=12,
)
##### Plot 3
ax[2].plot(x, yy, color="black")
ax[2].set_ylim(-0.01, 0.45)
ax[2].fill_between(
x=x,
y1=yy,
where=(x >= 2),
color="red",
alpha=0.75,
)
ax[2].text(
1,
0.25,
f"Colored area is \n {round(out[0]*100,2)}%",
fontsize=12,
)
ax[2].text(
-4,
0.4,
"$P(X \leq j_i) =\\int_{k_i}^{\infty}f(x)dx $",
fontsize=12,
)
plt.show()
The t.ppf(q, df)
function returns the quantile function, and thus it is the reversing function of t.cdf()
. For the intervals $j_i = (-\infty, -2], (-\infty, 0], (-\infty, 2]$ of a random variable following a t-distribution with $df=5$, the t.ppf()
function yields...
x = np.arange(-2, 2.01, 2)
ji = stats.t.cdf(x, 5)
ji
array([0.05096974, 0.5 , 0.94903026])
stats.t.ppf(ji, df=5)
array([-2.00000000e+00, 6.97600362e-17, 2.00000000e+00])
... and for the intervals $k_i = [-2, \infty),[0, \infty), [2, \infty)$ of a random variable following a t-distribution with $df=5$, the t.ppf()
function returns
x = np.arange(-2, 2.01, 2)
ki = 1 - stats.t.cdf(x, 5)
ki
array([0.94903026, 0.5 , 0.05096974])
stats.t.ppf(ki, df=5)
array([ 2.00000000e+00, 6.97600362e-17, -2.00000000e+00])
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.