Spatial statistics on point patterns are often concerned with determining how regular a distribution of points is. This means, for example, whether the points are clustered or evenly distributed. Questions such as these relate to the intensity or dispersion of the dot pattern as a whole.
Intensity is the average density of points or in other words the expected number of points per unit area. The intensity may be constant (uniform) or may vary from location to location (non-uniform or inhomogeneous).
One approach to evaluate the intensity is to divide the study area into quadrats, and count the numbers of points falling into each quadrat. If the points have uniform intensity, and are completely random, then the quadrat counts should be Poisson random numbers with constant mean. We may test that hypothesis by using the $\chi^2$ goodness-of-fit test statistic
$$\chi^2 = \sum\frac{(\text{observed}-\text{expected})^2}{\text{expected}}$$In Python we make use of the pointpats.quadrat_statistics
module, including the qs.QStatistic()
function. First, we load the Berlin city data sets we created in the previous section.
# First, let's import the needed libraries.
import matplotlib.pyplot as plt
import numpy as np
import random
import pandas as pd
import geopandas as gpd
import folium
# load berlin shapefile
berlin_district = gpd.read_file(
"http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_district.zip"
)
# load data from the previous section
berlin_locations = gpd.read_file(
"http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_locations.geojson"
)
berlin_random = gpd.read_file(
"http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_random.geojson"
)
berlin_regular = gpd.read_file(
"http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_regular.geojson"
)
Quadrat statistics
The first set of techniques, quadrat statistics, receive their name after their approach to split the data up into small areas (quadrants). Once created, these “buckets” are used to examinee the uniformity of counts across them. We employ the pointpats.quadrat_statistics
module for this type of analysis.
Find the documentation of
pointpats
here.
import pointpats.quadrat_statistics as qs
For the qs.QStatistic()
function, the coordinates must be provided as an array.
## reshape coordinates to right input format ([x1,y1], [x2,y2],....[xn, yn])
## extract coordinates from GeoPandas as array
coord_list_loc = np.array(
[(x, y) for x, y in zip(berlin_locations.geometry.x, berlin_locations.geometry.y)]
)
coord_list_ran = np.array(
[(x, y) for x, y in zip(berlin_random.geometry.x, berlin_random.geometry.y)]
)
coord_list_reg = np.array(
[(x, y) for x, y in zip(berlin_regular.geometry.x, berlin_regular.geometry.y)]
)
We can visualize the result using the QStatistic.plot()
method.
qstat = qs.QStatistic(coord_list_ran)
qstat.plot()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1492: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal. warnings.warn(dep_msg, FutureWarning) C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1208: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal. warnings.warn(dep_msg, FutureWarning)
By default, the qs.QStatistic()
function uses a 3x3 grid. We see that the total count of points per square varies alot. We expect the chi-squared test ($\chi$) to be statistically significant, indicated by a very small p-value:
$\chi^2$ tests, how likely this distribution is if the cell counts are uniform.
print(f"chi2 ={qstat.chi2}, df ={qstat.df}, p-value ={qstat.chi2_pvalue} ")
chi2 =39.6822429906542, df =8, p-value =3.6711911449602187e-06
In contrast, our regular point process will have the nearly same number of points in each square:
qstat = qs.QStatistic(coord_list_reg)
qstat.plot()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1492: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal. warnings.warn(dep_msg, FutureWarning) C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1208: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal. warnings.warn(dep_msg, FutureWarning)
print(f"chi2 ={qstat.chi2}, df ={qstat.df}, p-value ={qstat.chi2_pvalue} ")
chi2 =8.666666666666668, df =8, p-value =0.37119089551026835
The high p-value $(p > 0.05)$ indicates a HPP, where the intensity is a constant $\lambda(u) = \lambda = \frac{n}{\vert A \vert}$, with $n$ being the number of points observed in region $A$, and $\vert A \vert$ being the area of region $A$.
Let us repeat that once more for the berlin_locations
data set, which seems to be highly clustered.
qstat = qs.QStatistic(coord_list_loc)
qstat.plot()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1492: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal. warnings.warn(dep_msg, FutureWarning) C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1208: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal. warnings.warn(dep_msg, FutureWarning)
print(f"chi2 ={qstat.chi2}, df ={qstat.df}, p-value ={qstat.chi2_pvalue} ")
chi2 =3997.276414087514, df =8, p-value =0.0
The very low p-value indicates that the intensity is non-constant. Such an point process is referred to as IPP and its intensity can be estimated non-parametrically, e.g., with kernel smoothing (Bivand et al. 2008).
There are different ways to estimate the intensity for an IPP. One of them is a non-parametrically estimation by means of kernel smoothing. If we observe $n$ points $\{x_i\}_{i=1}^n$ then the form of a kernel smoothing estimator is
$$\hat{\lambda}(x) = \frac{1}{h^2}\sum_{i=1}^n\frac{\kappa(\frac{\Vert x-x_i\Vert}{h})}{q(\Vert x \Vert)}\text{,} $$where $x_i \in \{x_1, x_2,..., x_n\}$ is an observed point, $h$ is the bandwidth, $q(\Vert x \Vert)$ is a border correction to compensate for observations missing due to edge effects, and $\kappa(u)$ is a bivariate and symmetrical kernel function.
R currently implements a two-dimensional quartic kernel function:
$$ \kappa(u) = \begin{cases} \frac{3}{\pi}(1-\Vert u \Vert^2)^2 & \text{if } u \in (-1, 1) \\ 0 & \text{otherwise} \end{cases} $$where $\Vert u \Vert^2=u_1^2+u_2^2$ is the squared norm of point $u=(u_1, u_2)$.
There is no general rule for selecting the bandwidth $h$, which governs the level of smoothing. Small bandwidths result in spikier maps and large bandwidths result in smoother map. It is reasonable to use several values depending on the process under consideration, and choose a value that seems plausible (Bivand et al. 2008).
A kernel density plot is the graphical representation of kernel smoothing. When using the Python the seaborn
package provides a nice implementation for visualizing spatial kernel density plots. Here we use the kdeplot()
function.
The seaborn.kdeplot()
funciton provides an estimate of the intensity function of the point process that generated the point pattern data. The argument bw_method
will determine the smoothing bandwidth to use, we will pass this argument as sigma
. Sigma
is taken as the standard deviation of the isotropic Gaussian kernel, corresponding to the bandwidth parameter. A small standard deviation results in spikier density plot and a large standard deviation results in smoother density plot.
We code a loop to apply the sns.kdeplot()
function on our three data sets, berlin_locations
, berlin_regular
, and berlin_random
with tree different values for sigma $(0.2, 0.4,0.9)$.
import seaborn as sns
berlin_locations.name
0 Albert's 1 Bellboy Berlin 2 Newton Bar 3 No Limit Shishabar 4 en passant ... 932 uni keller 933 Zur Schmalzstulle 934 spindler & klatt 935 800A 936 Bon Vivant Name: name, Length: 937, dtype: object
# Set up figure and axis
f, axs = plt.subplots(3, 3, figsize=(14, 10))
axs = axs.ravel()
sigma = [0.2, 0.4, 0.9]
plotaxis = [0, 3, 6]
main = ["Locations", "Regular", "Random"]
for i, name, (idx, my_df) in zip(
plotaxis, main, enumerate([berlin_locations, berlin_random, berlin_regular])
):
for ax, s in zip(range(0, 3), sigma):
sns.kdeplot(
x=my_df.geometry.x.values,
y=my_df.geometry.y.values,
n_levels=30,
bw_method=s,
fill=True,
alpha=0.6,
cmap="viridis_r",
ax=axs[ax + i],
legend=True,
cbar=True,
)
berlin_district.plot(ax=axs[ax + i], color="none", edgecolor="grey")
axs[ax + i].set_axis_off()
axs[ax + i].set_title(f"{name}, sigma = {s}")
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.