Berlin is famous worldwide for its nightlife. The data set we will work with stores the geographical location of clubs and bars located in Berlin. The data source is gathered from OpenStreetMap (OSM). The data was downloaded from GEOFABRIK on August 27, 2022, and contains OpenStreetMap data as of August 26, 2022 (see here).
We download the data and read the
gis_osm_pois_free_1.shp
file, which corresponds to
points of interests in Berlin, using the st_read()
function from the sf
package.
## build a temporary folder on disk
temp <- tempfile()
download_url <- "https://userpage.fu-berlin.de/soga/data/raw-data/spatial/"
zipfile <- "berlin-latest-free.shp.zip"
## download the file
download.file(paste0(download_url, zipfile), temp, mode = "wb")
## unzip the file(s)
unzip(temp)
## close file connection
unlink(temp)
library(sf)
berlin_features <- st_read("./gis_osm_pois_free_1.shp", quiet = TRUE)
Applying the str()
function gives us more information
about the structure of the data set.
str(berlin_features)
## Classes 'sf' and 'data.frame': 83123 obs. of 5 variables:
## $ osm_id : chr "16541597" "26735749" "26735753" "26735759" ...
## $ code : int 2907 2301 2006 2301 2301 2701 2307 2031 2724 2907 ...
## $ fclass : chr "camera_surveillance" "restaurant" "telephone" "restaurant" ...
## $ name : chr "Aral" "Aida" NA "Madame Ngo" ...
## $ geometry:sfc_POINT of length 83123; first list element: 'XY' num 13.3 52.5
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
## ..- attr(*, "names")= chr [1:4] "osm_id" "code" "fclass" "name"
The data set of class sf
and data.frame
contains 83123 records (denoted as observations) represented as
rows. The 5 variables are represented by columns, with the first one
containing the osm_id
number. From the fifth variable
geometry
we know that the data set stores information about
83123 two-dimensional ('XY'
) surface points. The variable
fclass
contains the category related to a point, such as
“restaurant” or “camera_surveillance”.
By applying the unique()
function, we get an overview of
all the different categories represented in the data set.
unique(berlin_features$fclass)
## [1] "camera_surveillance" "restaurant" "telephone"
## [4] "tourist_info" "biergarten" "recycling_glass"
## [7] "memorial" "fire_station" "artwork"
## [10] "bank" "post_office" "toilet"
## [13] "cinema" "library" "police"
## [16] "post_box" "chemist" "motel"
## [19] "kiosk" "arts_centre" "cafe"
## [22] "fast_food" "car_sharing" "viewpoint"
## [25] "pub" "hotel" "supermarket"
## [28] "bakery" "pharmacy" "atm"
## [31] "convenience" "theatre" "hostel"
## [34] "car_rental" "gift_shop" "museum"
## [37] "guesthouse" "shelter" "stationery"
## [40] "bar" "travel_agent" "embassy"
## [43] "butcher" "beverages" "pitch"
## [46] "outdoor_shop" "playground" "kindergarten"
## [49] "attraction" "school" "monument"
## [52] "hunting_stand" "sports_centre" "tower"
## [55] "bench" "recycling_clothes" "optician"
## [58] "fountain" "nightclub" "drinking_water"
## [61] "computer_shop" "furniture_shop" "recycling"
## [64] "community_centre" "bicycle_shop" "doctors"
## [67] "veterinary" "bookshop" "clothes"
## [70] "florist" "doityourself" "laundry"
## [73] "comms_tower" "hairdresser" "university"
## [76] "video_shop" "picnic_site" "car_dealership"
## [79] "beauty_shop" "dentist" "vending_any"
## [82] "vending_machine" "mobile_phone_shop" "track"
## [85] "newsagent" "shoe_shop" "waste_basket"
## [88] "toy_shop" "town_hall" "sports_shop"
## [91] "vending_cigarette" "greengrocer" "water_well"
## [94] "mall" "jeweller" "bicycle_rental"
## [97] "department_store" "garden_centre" "water_tower"
## [100] "vending_parking" "theme_park" "nursing_home"
## [103] "car_wash" "graveyard" "college"
## [106] "recycling_paper" "water_works" "chalet"
## [109] "ruins" "clinic" "zoo"
## [112] "market_place" "general" "battlefield"
## [115] "stadium" "archaeological" "hospital"
## [118] "camp_site" "swimming_pool" "prison"
## [121] "observation_tower" "wayside_shrine" "water_mill"
## [124] "caravan_site" "golf_course" "courthouse"
## [127] "food_court" "dog_park" "wayside_cross"
## [130] "park"
There are 130 different categories.
Since we want to have a closer look at Berlin’s nightlife, we subset
our data to include only the category nightclub
.
berlin_locations <-
berlin_features[
berlin_features$fclass == "nightclub",
c("fclass", "name")
]
Now, we also want to include bars in our data set.
Exercise: Subset the data set
berlin_features
to only include the categoriesnightclub
andbar
using the boolean operatorOR
.
# your code here
berlin_locations <- NULL
berlin_locations <-
berlin_features[
berlin_features$fclass == "nightclub" | berlin_features$fclass == "bar", # "OR" and "|" are equivalent
c("fclass", "name")
]
We rename the column fclass
into loc
and
plot the relative frequency of the categories in our data set by
combining the table()
, the prop.table()
and
the barplot()
functions.
# rename column from fclass to loc
colnames(berlin_locations)[1] <- "loc"
# plot
barplot(prop.table(table(berlin_locations$loc)),
main = "Comparison between bars and nightsclubs in Berlin",
col = c("chocolate2", "khaki1"))
As expected, there are significantly more locations denoted as bar in our data set than locations denoted as nightclub. Let us compare the absolute figures.
Exercise: Determine the absolute numbers of the locations
bar
andnightclub
of our data set.
# your code here
table(berlin_locations$loc)
##
## bar nightclub
## 807 131
We visualize the data using the mapview
package.
library(mapview)
mapview(berlin_locations,
zcol = "loc",
label = berlin_locations$name,
legend = TRUE
)
Finally, we project the berlin_locations
data set to the
European Terrestrial Reference System 1989 (ETRS89/UTM
zone 32N).
Exercise: Transform the coordinate reference system of the data set
berlin_locations
into ERTS89/UTM zone 32N (EPSG 25832). Please visit the section about coordinate reference systems and transformations on the previous page if you want to look it up again.
# your code here
berlin_locations <- NULL
berlin_locations <- st_transform(berlin_locations, 25832)
st_geometry(berlin_locations)
## Geometry set for 938 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 780782 ymin: 5813041 xmax: 814411.7 ymax: 5841533
## Projected CRS: ETRS89 / UTM zone 32N
## First 5 geometries:
## POINT (793357 5832558)
## POINT (797981.4 5827060)
## POINT (797927.1 5827157)
## POINT (792808.9 5832719)
## POINT (799189.1 5830723)
To contrast particular statistical methods for spatial point pattern
analysis, we construct two artificial data sets. One data set,
berlin_random
, consists of point data randomly distributed
over the area of Berlin. The other data set,
berlin_regular
, consists of point data fairly regularly
distributed across Berlin.
Before we start with the data generation we have to set the
geographical extend. Hence, we load the the sf
object
containing Berlin’s district borders (berlin_sf
).
load(url("https://userpage.fu-berlin.de/soga/data/r-data/berlin_district.RData"))
plot(berlin_sf, main = "")
We take the bounding box specifications of the berlin_sf
variable to set the geographic extend for the data generation
process.
xmin <- as.vector(st_bbox(berlin_sf)[1])
ymin <- as.vector(st_bbox(berlin_sf)[2])
xmax <- as.vector(st_bbox(berlin_sf)[3])
ymax <- as.vector(st_bbox(berlin_sf)[4])
First, we generate random uniform numbers using the
runif()
function.
set.seed(1111) # set seed for reproducibility
## generate random uniform numbers
x_random <- runif(
n = nrow(berlin_locations) * 0.4,
min = xmin,
max = xmax
)
y_random <- runif(
n = nrow(berlin_locations) * 0.4,
min = ymin,
max = ymax
)
Then we combine the random number vector data in a
data.frame
object and plot it together with the district
border of Berlin.
berlin_random <- data.frame("x" = x_random, "y" = y_random)
plot(
x = berlin_random$x,
y = berlin_random$y,
main = "Random Points",
xlab = "x", ylab = "y", cex = .5,
asp = TRUE
)
plot(berlin_sf, add = TRUE, border = "blue", col = NA)
First, we generate a sequence of numbers using the seq()
function.
x_regular <- seq(
from = xmin, to = xmax,
length.out = sqrt(nrow(berlin_locations)) * 0.35
)
y_regular <- seq(
from = ymin, to = ymax,
length = sqrt(nrow(berlin_locations)) * 0.35
)
Then we create a data frame from all combinations of variables using
the expand.grid()
function.
regular <- expand.grid(x_regular, y_regular)
plot(regular, asp = TRUE)
Finally, we add some noise to the points using the
jitter()
function, combine the data into a
data.frame
object, berlin_regular
, and plot it
together with the district border of Berlin.
x_regular <- jitter(regular[, 1], factor = 1)
y_regular <- jitter(regular[, 2], factor = 1)
berlin_regular <- data.frame("x" = x_regular, "y" = y_regular)
plot(
x = berlin_regular$x,
y = berlin_regular$y,
main = "Regular Points",
xlab = "x", ylab = "y", cex = .5,
asp = TRUE
)
plot(berlin_sf, add = TRUE, border = "blue", col = NA)
spatstat
objectsIn spatstat
a spatial point pattern in two-dimensional
space is stored as an object of class ppp
. Such a point
pattern object contains spatial coordinates of the
points, marks attached to them (if any), a
window in which the points were observed and
optionally, the name of the unit of length for the
spatial coordinates.
A ppp
object is created using the ppp()
function. The argument window
must have a window object of
class owin
. Including information about the units of length
in which the \(x\) and \(y\) coordinates are recorded enables the
spatstat
package to print better reports and annotate the
axes in plots.
library(spatstat)
For this tutorial, we will build three ppp
objects: one
for the berlin_locations
data set, which we denote as
ppp_locations
, one for the berlin_random
data
set (ppp_random
) and one for the
berlin_regular
data set (ppp_regular
).
All three data sets share the same length unit (meter) in
which the \(x\) and \(y\) coordinates are recorded, as well as
the window in which the points were observed (the bounding box of the
Berlin district border berlin_sf
).
We start with assigning a placeholder variable which holds the string representation of the coordinate’s units.
coordinate_units <- "meters"
An object of class owin
(observation window)
represents a region or window in a two-dimensional space. The window may
be a rectangle, a polygon or polygons, with polygonal holes, or an
irregular shape represented by a binary pixel image mask.
Recall that berlin_sf
is of class sf
.
class(berlin_sf)
## [1] "sf" "data.frame"
To set the window to the polygon berlin_sf
, we first
need to coerce berlin_sf
into a Spatial
object, which is then coerced into an
owin' object. This may sound complicated, but by using the
as()`
function in sequence, the objective becomes fairly simple.
class(as(berlin_sf, "Spatial"))
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
class(as.owin(berlin_sf))
## [1] "owin"
Here we go, done! Now we assign the owin
object to a
variable denoted as win
.
win <- as.owin(berlin_sf)
win
## window: polygonal boundary
## enclosing rectangle: [777974.1, 823510.5] x [5808837, 5845580] units
ppp
objectFinally, we construct ppp
objects using the
ppp()
function from our three different data sets
berlin_locations
, berlin_regular
and
berlin_random
. Recall that berlin_locations
is
an sf
object, whereas berlin_regular
and
berlin_random
are data.frame
objects.
class(berlin_locations)
## [1] "sf" "data.frame"
class(berlin_regular)
## [1] "data.frame"
class(berlin_random)
## [1] "data.frame"
For the sf
object, we use the
st_coordinates()
function to extract the \(xy\)-coordinates in form of a
matrix
object. Let us review the first five point
coordinates in berlin_locations
in matrix format.
st_coordinates(berlin_locations)[1:5, ]
## X Y
## [1,] 793357.0 5832558
## [2,] 797981.4 5827060
## [3,] 797927.1 5827157
## [4,] 792808.9 5832719
## [5,] 799189.1 5830723
Now, let us construct the ppp
object.
ppp_locations <- ppp(
x = st_coordinates(berlin_locations)[, 1],
y = st_coordinates(berlin_locations)[, 2],
window = win
)
## Warning: 4 points were rejected as lying outside the specified window
Now, we assign the name of the unit of length for the spatial
coordinates to the ppp
object and plot it.
unitname(ppp_locations) <- coordinate_units
koords = c(5809500, 5850000)
ppp_locations$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_locations)
## Warning in plot.ppp(ppp_locations): 4 illegal points also plotted
Note that we received a warning that 4 points were rejected as
lying outside the specified window. Do not worry, the rejected
points in a point pattern will be ignored by all other functions except
for plot
. They are still accessible and stored as an
attribute of the point pattern called rejects
.
It is difficult to spot the rejected points in the plot above. Thus,
let us write some code to highlight them. We use the
gIntersects()
function from the rgeos
package.
points_intersect <- lengths(st_intersects(berlin_locations, berlin_sf)) > 0
plot(berlin_sf$geometry)
points(as(berlin_locations, "Spatial"))
plot(berlin_locations[!points_intersect, ]$geometry,
col = "red", cex = 3, pch = 3, add = T)
Now we can spot rejected points easily as they are marked with a red
cross. If we want to get rid of rejected points, we have to apply a
small trick: Applying the as.ppp()
function on the
ppp
object removes rejected points.
ppp_locations <- as.ppp(ppp_locations)
koords = c(5809500, 5850000)
ppp_locations$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_locations)
No more warnings, great!
We continue constructing ppp
objects for the
berlin_random
data set and the berlin_regular
data set. Since both data sets are data.frame
objects, we
simply index their columns with the \(\$\)-notation.
Exercise: Construct a
ppp
object for theberlin_random
data set and one forthe berlin_regular
data set, assign a unit for the coordinates and remove rejected points.
# your code here...
ppp_random <- NULL
ppp_regular <- NULL
## create ppp object
ppp_random <- ppp(
x = berlin_random$x,
y = berlin_random$y,
window = win
)
## Warning: 184 points were rejected as lying outside the specified window
## assign unit
unitname(ppp_random) <- coordinate_units
## removal of rejected points
ppp_random <- as.ppp(ppp_random)
koords = c(5809500, 5850000)
ppp_random$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_random)
## create ppp object
ppp_regular <- ppp(
x = berlin_regular$x,
y = berlin_regular$y,
window = win
)
## Warning: 70 points were rejected as lying outside the specified window
## assign unit
unitname(ppp_regular) <- coordinate_units
## removal of rejected points
ppp_regular <- as.ppp(ppp_regular)
koords = c(5809500, 5850000)
ppp_regular$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_regular)
Finally, we save all three data sets for further usage.
save(
file = "ppp_data_berlin.RData",
ppp_locations, ppp_random, ppp_regular, berlin_locations
)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.