Randomness and Clustering¶

Spatial statistics on point patterns are often concerned with determining how regular a distribution of points is. This means, for example, whether the points are clustered or evenly distributed. Questions such as these relate to the intensity or dispersion of the dot pattern as a whole.

Intensity is the average density of points or in other words the expected number of points per unit area. The intensity may be constant (uniform) or may vary from location to location (non-uniform or inhomogeneous).

One approach to evaluate the intensity is to divide the study area into quadrats, and count the numbers of points falling into each quadrat. If the points have uniform intensity, and are completely random, then the quadrat counts should be Poisson random numbers with constant mean. We may test that hypothesis by using the $\chi^2$ goodness-of-fit test statistic

$$\chi^2 = \sum\frac{(\text{observed}-\text{expected})^2}{\text{expected}}$$

In Python we make use of the pointpats.quadrat_statistics module, including the qs.QStatistic() function. First, we load the Berlin city data sets we created in the previous section.

In [2]:
# First, let's import the needed libraries.
import matplotlib.pyplot as plt
import numpy as np
import random
import pandas as pd
import geopandas as gpd
import folium
In [3]:
# load berlin shapefile
berlin_district = gpd.read_file(
    "http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_district.zip"
)
# load data from the previous section
berlin_locations = gpd.read_file(
    "http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_locations.geojson"
)
berlin_random = gpd.read_file(
    "http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_random.geojson"
)

berlin_regular = gpd.read_file(
    "http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_regular.geojson"
)

Quadrat statistics

The first set of techniques, quadrat statistics, receive their name after their approach to split the data up into small areas (quadrants). Once created, these “buckets” are used to examinee the uniformity of counts across them. We employ the pointpats.quadrat_statistics module for this type of analysis.

Find the documentation of pointpats here.

In [4]:
import pointpats.quadrat_statistics as qs

For the qs.QStatistic() function, the coordinates must be provided as an array.

In [5]:
## reshape coordinates to right input format ([x1,y1], [x2,y2],....[xn, yn])
## extract coordinates from GeoPandas as array

coord_list_loc = np.array(
    [(x, y) for x, y in zip(berlin_locations.geometry.x, berlin_locations.geometry.y)]
)
coord_list_ran = np.array(
    [(x, y) for x, y in zip(berlin_random.geometry.x, berlin_random.geometry.y)]
)
coord_list_reg = np.array(
    [(x, y) for x, y in zip(berlin_regular.geometry.x, berlin_regular.geometry.y)]
)

We can visualize the result using the QStatistic.plot() method.

In [6]:
qstat = qs.QStatistic(coord_list_ran)
qstat.plot()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1492: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
  warnings.warn(dep_msg, FutureWarning)
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1208: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
  warnings.warn(dep_msg, FutureWarning)

By default, the qs.QStatistic() function uses a 3x3 grid. We see that the total count of points per square varies alot. We expect the chi-squared test ($\chi$) to be statistically significant, indicated by a very small p-value:

$\chi^2$ tests, how likely this distribution is if the cell counts are uniform.

In [7]:
print(f"chi2 ={qstat.chi2}, df ={qstat.df}, p-value ={qstat.chi2_pvalue} ")
chi2 =39.6822429906542, df =8, p-value =3.6711911449602187e-06 

In contrast, our regular point process will have the nearly same number of points in each square:

In [8]:
qstat = qs.QStatistic(coord_list_reg)
qstat.plot()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1492: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
  warnings.warn(dep_msg, FutureWarning)
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1208: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
  warnings.warn(dep_msg, FutureWarning)
In [9]:
print(f"chi2 ={qstat.chi2}, df ={qstat.df}, p-value ={qstat.chi2_pvalue} ")
chi2 =8.666666666666668, df =8, p-value =0.37119089551026835 

The high p-value $(p > 0.05)$ indicates a HPP, where the intensity is a constant $\lambda(u) = \lambda = \frac{n}{\vert A \vert}$, with $n$ being the number of points observed in region $A$, and $\vert A \vert$ being the area of region $A$.

Let us repeat that once more for the berlin_locations data set, which seems to be highly clustered.

In [10]:
qstat = qs.QStatistic(coord_list_loc)
qstat.plot()
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1492: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
  warnings.warn(dep_msg, FutureWarning)
C:\Users\mceck\miniconda3\envs\rasterdata\lib\site-packages\libpysal\cg\shapes.py:1208: FutureWarning: Objects based on the `Geometry` class will deprecated and removed in a future version of libpysal.
  warnings.warn(dep_msg, FutureWarning)
In [11]:
print(f"chi2 ={qstat.chi2}, df ={qstat.df}, p-value ={qstat.chi2_pvalue} ")
chi2 =3997.276414087514, df =8, p-value =0.0 

The very low p-value indicates that the intensity is non-constant. Such an point process is referred to as IPP and its intensity can be estimated non-parametrically, e.g., with kernel smoothing (Bivand et al. 2008).


Kernel density estimate¶

There are different ways to estimate the intensity for an IPP. One of them is a non-parametrically estimation by means of kernel smoothing. If we observe $n$ points $\{x_i\}_{i=1}^n$ then the form of a kernel smoothing estimator is

$$\hat{\lambda}(x) = \frac{1}{h^2}\sum_{i=1}^n\frac{\kappa(\frac{\Vert x-x_i\Vert}{h})}{q(\Vert x \Vert)}\text{,} $$

where $x_i \in \{x_1, x_2,..., x_n\}$ is an observed point, $h$ is the bandwidth, $q(\Vert x \Vert)$ is a border correction to compensate for observations missing due to edge effects, and $\kappa(u)$ is a bivariate and symmetrical kernel function.

R currently implements a two-dimensional quartic kernel function:

$$ \kappa(u) = \begin{cases} \frac{3}{\pi}(1-\Vert u \Vert^2)^2 & \text{if } u \in (-1, 1) \\ 0 & \text{otherwise} \end{cases} $$

where $\Vert u \Vert^2=u_1^2+u_2^2$ is the squared norm of point $u=(u_1, u_2)$.

There is no general rule for selecting the bandwidth $h$, which governs the level of smoothing. Small bandwidths result in spikier maps and large bandwidths result in smoother map. It is reasonable to use several values depending on the process under consideration, and choose a value that seems plausible (Bivand et al. 2008).


Kernel density plot (2-D)¶

A kernel density plot is the graphical representation of kernel smoothing. When using the Python the seaborn package provides a nice implementation for visualizing spatial kernel density plots. Here we use the kdeplot() function.

The seaborn.kdeplot() funciton provides an estimate of the intensity function of the point process that generated the point pattern data. The argument bw_method will determine the smoothing bandwidth to use, we will pass this argument as sigma. Sigma is taken as the standard deviation of the isotropic Gaussian kernel, corresponding to the bandwidth parameter. A small standard deviation results in spikier density plot and a large standard deviation results in smoother density plot.

We code a loop to apply the sns.kdeplot() function on our three data sets, berlin_locations, berlin_regular, and berlin_random with tree different values for sigma $(0.2, 0.4,0.9)$.

In [12]:
import seaborn as sns
In [13]:
berlin_locations.name
Out[13]:
0                Albert's
1          Bellboy Berlin
2              Newton Bar
3      No Limit Shishabar
4              en passant
              ...        
932            uni keller
933     Zur Schmalzstulle
934      spindler & klatt
935                  800A
936            Bon Vivant
Name: name, Length: 937, dtype: object
In [16]:
# Set up figure and axis
f, axs = plt.subplots(3, 3, figsize=(14, 10))

axs = axs.ravel()
sigma = [0.2, 0.4, 0.9]
plotaxis = [0, 3, 6]
main = ["Locations", "Regular", "Random"]


for i, name, (idx, my_df) in zip(
    plotaxis, main, enumerate([berlin_locations, berlin_random, berlin_regular])
):
    for ax, s in zip(range(0, 3), sigma):
        sns.kdeplot(
            x=my_df.geometry.x.values,
            y=my_df.geometry.y.values,
            n_levels=30,
            bw_method=s,
            fill=True,
            alpha=0.6,
            cmap="viridis_r",
            ax=axs[ax + i],
            legend=True,
            cbar=True,
        )
        berlin_district.plot(ax=axs[ax + i], color="none", edgecolor="grey")
        axs[ax + i].set_axis_off()
        axs[ax + i].set_title(f"{name}, sigma = {s}")