Intensity is the average density of points or, in other words, the expected number of points per unit area. The intensity may be constant (uniform) or may vary from location to location (non-uniform or inhomogeneous).
One approach to evaluate the intensity is to divide the study area into quadrats and count the numbers of points falling into each quadrat. If the points have uniform intensity, and are completely random, then the quadrat counts should be Poisson random numbers with constant mean. We may test that hypothesis by using the \(\chi^2\) goodness-of-fit test statistic
\[\chi^2 = \sum\frac{(\text{observed}-\text{expected})^2}{\text{expected}}.\]
In R, we use the quadratcount()
and
quadrat.test()
functions of the spatstat
package. First, we load the Berlin city data sets we created in
the previous section.
load(url("https://userpage.fu-berlin.de/soga/data/r-data/ppp_data_berlin.RData"))
load(url("https://userpage.fu-berlin.de/soga/data/r-data/berlin_district.RData"))
library(spatstat)
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
qc_locations <- quadratcount(ppp_locations, nx = 3, ny = 4)
plot(ppp_locations, pch = 3, cex = 0.6)
plot(qc_locations, add = TRUE, textargs = list(col = "red"))
quadrat.test(qc_locations)
##
## Chi-squared test of CSR using quadrat counts
##
## data:
## X2 = 1659.4, df = 10, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
## Quadrats: 11 tiles (irregular windows)
## Tessellation is marked
The very low p-value indicates that the intensity is non-constant. Such a point process is referred to as an inhomogeneous Poisson process (IPP), and its intensity can be estimated non-parametrically, e.g. with kernel smoothing (Bivand et al. 2008).
Let us repeat this approach for the ppp_random
data
set.
Exercise: Evaluate the intensity of the data set
ppp_random
by dividing the area into quadrats, counting the points and performing the \(\chi^2\)-Test. Which kind of Poisson point process do you expect?
## Your code here...
qc_random <- NULL
## configurate figure extends accordingly
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
## create ppp object
qc_random <- quadratcount(ppp_random, nx = 3, ny = 4)
plot(ppp_random, pch = 3, cex = 0.6)
plot(qc_random, add = TRUE, textargs = list(col = "red"))
## perform Chi-squared test
quadrat.test(qc_random)
## Warning: Some expected counts are small; chi^2 approximation may be inaccurate
##
## Chi-squared test of CSR using quadrat counts
##
## data:
## X2 = 12.402, df = 10, p-value = 0.5181
## alternative hypothesis: two.sided
##
## Quadrats: 11 tiles (irregular windows)
## Tessellation is marked
The high p-value \((p > 0.05)\) indicates a homogeneous Poisson process (HPP).
The intensity of homogeneous Poisson processes is constant with \(\lambda(u) = \lambda = \frac{n}{\vert A \vert}\), where \(n\) is the number of points observed in region \(A\) and \(\vert A \vert\) is the area of region \(A\).
There are different ways to estimate the intensity of an IPP. One of them is a non-parametrically estimation by means of kernel smoothing. If we observe \(n\) points \(\{x_i\}_{i=1}^n\), then the form of a kernel smoothing estimator is
\[\hat{\lambda}(x) = \frac{1}{h^2}\sum_{i=1}^n\frac{\kappa(\frac{\Vert x-x_i\Vert}{h})}{q(\Vert x \Vert)}\text{,} \]
where \(x_i \in \{x_1, x_2,..., x_n\}\) is the observed point, \(h\) is the bandwidth, \(q(\Vert x \Vert)\) is a border correction to compensate for observations missing due to edge effects, and \(\kappa(u)\) is a bivariate and symmetrical kernel function.
R currently implements a two-dimensional quartic kernel function:
\[ \kappa(u) = \begin{cases} \frac{3}{\pi}(1-\Vert u \Vert^2)^2 & \text{if } u \in (-1, 1) \\ 0 & \text{otherwise} \end{cases} \]
where \(\Vert u \Vert^2=u_1^2+u_2^2\) is the squared norm of point \(u=(u_1, u_2)\).
There is no general rule for selecting the bandwidth \(h\), which governs the level of smoothing. Small bandwidths result in spikier maps, and large bandwidths result in smoother maps. It is reasonable to use several values depending on the process under consideration and then to choose a value that seems plausible (Bivand et al. 2008).
A kernel density plot is the graphical representation of kernel
smoothing. Several functions are available in R and contributed packages
to visualise kernel density plots. Here, we use the
density.ppp()
function from the spatstat
package.
The density.ppp()
function estimates the intensity
function of the point process that generated the point pattern data. The
argument sigma
is taken as the standard deviation of the
isotropic Gaussian kernel, corresponding to the bandwidth parameter. A
small standard deviation results in a spikier density plot and a large
standard deviation results in a smoother density plot.
We code a loop to apply the density.ppp()
function on
our three data sets ppp_locations
, ppp_regular
and ppp_random
with three different values for sigma \((1000, 2500, 5000)\), respectively.
par(mfrow = c(3, 3), mar = c(0, 0, 1, 2))
sigma <- c(1000, 2500, 5000)
data <- list(ppp_locations, ppp_regular, ppp_random)
main <- c("Locations", "Regular", "Random")
for (i in 1:3) {
for (j in 1:3) {
ds <- density.ppp(data[[i]], sigma = sigma[j])
plot(ds,
main = paste0(main[i], ", sigma: ", sigma[j])
)
plot(data[[i]], add = TRUE, cex = 0.01, regular = FALSE)
}
}
Alternatively, to the non-parametrically estimation of the intensity by means of kernel smoothing we may propose a specific function for the intensity whose parameters are estimated by maximizing the likelihood of the point process (Baddeley 2010).
The log-likelihood for the homogeneous Poisson process with intensity \(\lambda\) is
\[\log L(\lambda; x) = n(x)\log \lambda-\lambda \text{ area}(W)\] where \(n(x)\) is the number of points in the data set \(x\). The maximum likelihood estimator of \(\lambda\) is \[\hat{\lambda} = \frac{n(x)}{\text{area}(W)}\] The variance of \(\hat{\lambda}\) is
\[\text{var}[\hat{\lambda}] = \frac{\lambda}{\text{area}(W)}\]
For an inhomogeneous Poisson process with intensity function \(\lambda_\theta(u)\) depending on a parameter \(\theta\), the log-likelihood for \(\theta\) is
\[\log L(\theta;x) = \sum_{i=1}^n \log \lambda_\theta(x_i)- \int_W \lambda_\theta(u)du\;,\]
where \(\int_W \lambda_\theta(u)du\) is the expected number of cases of the IPP with intensity \(\lambda_\theta(u)\) in region \(W\).
The maximum likelihood estimation \(\text{MLE } \hat{\theta}\) is not analytically tractable, so it must be computed using numerical algorithms such as Newton’s method (Baddeley 2010).
spatstat
By fitting a model, we need to describe how the point pattern intensity \(\lambda(u)\) depends on the spatial location \(u\) or spatial covariates \(Z(u)\). The intensity function \(\lambda_\theta(u)\) is log-linear in the parameter \(\theta\):
\[\log\lambda_\theta(u)= \theta \cdot S(u)\;,\] where \(S(u)\) is a real-valued or vector-valued function of location \(u\). \(S(u)\) could be a function of the spatial coordinates of \(u\), or an observed covariate \(Z(u)\), or a mixture of both.
The spatstat
package provides the ppm()
function (point process model) to fit a point process model to
an observed point pattern. The first argument in the ppm()
function is a formula in the R language describing the spatial trend
model to be fitted. It has the general form
pattern ~ trend
, where the left-hand side,
pattern
, is usually the name of a spatial point pattern to
which the model should be fitted, and the right-hand side,
trend
, is an expression specifying the spatial trend of the
model.
The table outlines the basic formulations of a point process model.
\[ \begin{array}{ll} \mathtt{R} & \text{Intensity} \\ \hline \mathtt{ppm(P, \tilde{}1)} & \log \lambda((x,y)) = \theta_0\\ \mathtt{ppm(P, \tilde{}x)} & \log \lambda((x,y)) = \theta_0+\theta_1x\\ \mathtt{ppm(P, \tilde{}x+y)} & \log \lambda((x,y)) = \beta_0+\theta_1x+\theta_2y\\ \mathtt{ppm(P, \tilde{}polynom(x,y,3))} & \text{3rd order polynomial in } x \text{ and } y\\ \mathtt{ppm(P, \tilde{}I(y > 23))} & \text{different constants above and below the line } y = 23\\ \end{array} \]
The ppm()
function documentation is quite extensive:
Type help(ppm)
into your console for further
information.
Now let us fit some models.
A homogeneous Poisson model, e.g. \(\lambda_\theta((x,y)) = \exp(\theta_0)\) can be fitted by
base_loc <- ppm(ppp_locations ~1)
base_loc
## Stationary Poisson process
## Fitted to point pattern dataset 'ppp_locations'
## Intensity: 1.04584e-06
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## log(lambda) -13.77069 0.032721 -13.83482 -13.70656 *** -420.8518
An inhomogeneous Poisson model with an intensity that is log-linear, the means linear in the Cartesian coordinates, e.g. \(\lambda_\theta((x, y)) = \exp(\theta_0 + \theta_1x + \theta_2y)\), is fitted by
loglin_loc <- ppm(ppp_locations ~ x + y)
We can use the coef()
function to extract the model
coefficients.
coef(loglin_loc)
## (Intercept) x y
## -1.493209e+02 2.852400e-06 2.287357e-05
This specific model can be written in the form
\(\lambda_\theta((x, y)) = \exp(-149.3209128 + 2.8524004\times 10^{-6}x + 2.2873568\times 10^{-5}y)\)
An inhomogeneous Poisson model with an intensity that is log-quadratic, that means it is quadratic in the Cartesian coordinates \(x\) and \(y\); e.g. \(\lambda_\theta((x, y)) = \exp(\theta_0 + \theta_1x + \theta_2y+ \theta_3x^2 + \theta_4y^2 +\theta_5x y)\), is fitted by
logquad_loc <- ppm(ppp_locations ~ polynom(x, y, 2))
coef(logquad_loc)
## (Intercept) x y I(x^2) I(x * y)
## -1.379287e+06 8.132716e-02 4.622414e-01 -2.583714e-08 -6.873808e-09
## I(y^2)
## -3.919100e-08
So far, we fitted three different models for the
ppp_locations
data set. For the purpose of comparison,
repeat this procedure and fit the same models on the random point data
set ppp_random
.
Exercise: Fit the previous three models on the random point data set
ppp_random
and compare them with the models for theppp_locations
data set.
## Your code here...
base_rand <- NULL
loglin_rand <- NULL
logquad_rand <- NULL
base_rand <- ppm(ppp_random ~1)
base_rand
## Stationary Poisson process
## Fitted to point pattern dataset 'ppp_random'
## Intensity: 2.138709e-07
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## log(lambda) -15.35789 0.07235746 -15.49971 -15.21608 *** -212.2503
loglin_rand <- ppm(ppp_random ~ x + y)
coef(loglin_rand)
## (Intercept) x y
## -1.230230e+02 -1.440396e-06 1.867609e-05
logquad_rand <- ppm(ppp_random ~ polynom(x, y, 2))
coef(logquad_rand)
## (Intercept) x y I(x^2) I(x * y)
## -1.621871e+04 7.177928e-03 4.559182e-03 8.747276e-11 -1.256704e-09
## I(y^2)
## -3.034431e-10
Now that we fitted a bunch of models on two different data sets, we use these models for prediction.
The value returned by the model-fitting function ppm
is
an object of class ppm
representing the fitted model,
analogous to other model objects in R. Thus, standard operations such as
plot
, predict
or coef
can be
applied.
We combine the plot()
command with the
predict()
command to visualize the log-linear and the
log-quadratic models for the ppp_random
data set. We add
the point coordinates to enhance the visualization.
par(mar = c(0, 0, 1.5, 1.5),
mfrow = c(1, 2),
cex.main = 2)
plot(predict(loglin_rand), main = "log-linear")
points(ppp_random, pch = 16, cex = 0.2)
plot(predict(logquad_rand), main = "log-quadratic")
points(ppp_random, pch = 16, cex = 0.2)
Now we repeat the procedure, but this time we use the log-linear and
the log-quadratic models for the ppp_locations
data set.
Again, we add the point coordinates to enhance the visualization.
par(mar = c(0, 0, 1.5, 1.5),
mfrow = c(1, 2),
cex.main = 2)
plot(predict(loglin_loc), main = "log-linear")
points(ppp_locations, pch = 16, cex = 0.2)
plot(predict(logquad_loc), main = "log-quadratic")
points(ppp_locations, pch = 16, cex = 0.2)
Exercise: Consider the
lansing
data set, which includes the locations and botanical classification of trees in the Lansing Woods. Model the intensity of the maple trees by fitting the model \[\lambda_\theta((x, y)) = \exp(\theta_0 + \theta_1x + \theta_2y+\theta_3x y+ \theta_4x^2 + \theta_5y^2 +\theta_6x^2 y+\theta_7xy^2 + \theta_8x^3 + \theta_9y^3).\] Plot the model predictions and report the model coefficients. Theplot()
function for objects of classppm
is described in more detail here. Please make sure the selected version fits your local package version!
The data comes from an investigation by D.J. Gerrard of a 924 ft x 924 ft (19.6 acres) area in Lansing Woods, Clinton County, Michigan, USA. The data gives the locations of 2251 trees and their botanical classification (hickories, maples, red oaks, white oaks, black oaks and miscellaneous trees). The original plot size (924 x 924 feet) has been rescaled to one square.
data("lansing")
lansing
## Marked planar point pattern: 2251 points
## Multitype, with levels = blackoak, hickory, maple, misc, redoak, whiteoak
## window: rectangle = [0, 1] x [0, 1] units (one unit = 924 feet)
Subsetting: First, subset the data such that it only
includes the maple trees. Do not forget to remove unused botanical
classifications (denoted as marks
) by using the
droplevels()
function. Then, plot the data set.
## Your code here...
maple <- lansing[lansing$marks == "maple", ]
maple$marks <- droplevels(maple$marks)
maple <- unmark(maple)
plot(maple, pch = 16, cex = 0.5)
Model fitting
## Your code here...
m1 <- NULL
m1 <- ppm(maple ~ polynom(x, y, 3))
Model prediction
## Your code here...
plot(m1, pause = FALSE, se = FALSE)
coef(m1)
## (Intercept) x y I(x^2) I(x * y) I(y^2)
## 6.567611 3.167834 -7.194997 -7.822896 18.774315 6.782554
## I(x^3) I(x^2 * y) I(x * y^2) I(y^3)
## 5.814738 -19.079815 3.469783 -7.614337
After fitting a point process model to a point pattern data set, we should check that the model is a good fit and that each component assumption of the model was appropriate.
An informal, non-probabilistic approach is to examine the model residuals.
\[\text{residual} = \text{observed} - \text{fitted}\]
If the model is a good fit, the residuals should be centered around zero.
The diagnose.ppm()
function in the spatstat
package provides a convenient way to visualize the residuals by
computing a smoothed residual field \(s(u)\):
\[s(u) = \hat\lambda(u)-\lambda^*(u)\;,\] where \(\hat\lambda(u)\) is the non-parametric, kernel estimate of the intensity
\[\hat\lambda(u) = e(u)\sum_{i=1}^{n(x)}\kappa(u-x_i)\;,\] while \(\lambda^*(u)\) is a correspondingly-smoothed version of the parametric estimate of the intensity according to the fitted model.
\[\lambda^*(u) = e(u)\int_W\kappa(u-v)\lambda_\theta(v)dv\]
Here, \(\kappa\) is the smoothing kernel and \(e(u)\) is an edge correction. The difference \(s(u) = \hat\lambda(u)-\lambda^*(u)\) should be approximately zero if the model is true.
For the purpose of demonstration, we apply the
diagnose.ppm()
function on the four models,
base_loc
, logquad_loc
, base_rand
and loglin_rand
.
par(mar = c(0, 0, 1.5, 1.5),
mfrow = c(1, 2),
cex.main = 1.5)
diagnose.ppm(base_loc, which = "smooth", main = "Base model for locations")
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-1.03e-06, 3.248e-06]
diagnose.ppm(logquad_loc, which = "smooth", main = "Log-quatratic model for locations")
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-1.796e-07, 2.893e-07]
par(mar = c(0, 0, 1.5, 1.5),
mfrow = c(1, 2),
cex.main = 1.5)
diagnose.ppm(base_rand, which = "smooth", main = "Base model for random points")
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-8.443e-08, 9.404e-08]
diagnose.ppm(loglin_rand, which = "smooth", main = "Log-linear model for random points")
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-8.357e-08, 7.183e-08]
Exercise: Create a log-quadratic model for the
maple
data set and visualize the residuals of this and and the log-cubic model from the previous exercise.
## Your code here...
m2 <- NULL
# plot
par(mfrow = c(1, 2), mar = c(1, 1, 2, 2))
## Your code here...
## log-quadratic model
m2 <- ppm(maple ~ polynom(x, y, 2))
par(mar = c(0, 0, 1.5, 1.5),
mfrow = c(1, 2),
cex.main = 2)
diagnose.ppm(m2, which = "smooth", main = "Log-quadratic")
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-158, 304.7]
diagnose.ppm(m1, which = "smooth", main = "Log-cubic")
## Model diagnostics (raw residuals)
## Diagnostics available:
## smoothed residual field
## range of smoothed field = [-131.1, 142.2]
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.