Berlin is famous worldwide for its nightlife. The data set we will work with stores the geographical location of clubs and bars located in Berlin. The data source is gathered from OpenStreetMap (OSM). The data was downloaded from GEOFABRIK on August 27, 2022, and contains OpenStreetMap data as of August 26, 2022 (see here).
We download the data and read the
gis_osm_pois_free_1.shp file, which corresponds to
points of interests in Berlin, using the st_read()
function from the sf package.
## build a temporary folder on disk
temp <- tempfile()
download_url <- "https://userpage.fu-berlin.de/soga/data/raw-data/spatial/"
zipfile <- "berlin-latest-free.shp.zip"
## download the file
download.file(paste0(download_url, zipfile), temp, mode = "wb")
## unzip the file(s)
unzip(temp)
## close file connection
unlink(temp)
library(sf)
berlin_features <- st_read("./gis_osm_pois_free_1.shp", quiet = TRUE)
Applying the str() function gives us more information
about the structure of the data set.
str(berlin_features)
## Classes 'sf' and 'data.frame': 83123 obs. of 5 variables:
## $ osm_id : chr "16541597" "26735749" "26735753" "26735759" ...
## $ code : int 2907 2301 2006 2301 2301 2701 2307 2031 2724 2907 ...
## $ fclass : chr "camera_surveillance" "restaurant" "telephone" "restaurant" ...
## $ name : chr "Aral" "Aida" NA "Madame Ngo" ...
## $ geometry:sfc_POINT of length 83123; first list element: 'XY' num 13.3 52.5
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
## ..- attr(*, "names")= chr [1:4] "osm_id" "code" "fclass" "name"
The data set of class sf and data.frame
contains 83123 records (denoted as observations) represented as
rows. The 5 variables are represented by columns, with the first one
containing the osm_id number. From the fifth variable
geometry we know that the data set stores information about
83123 two-dimensional ('XY') surface points. The variable
fclass contains the category related to a point, such as
“restaurant” or “camera_surveillance”.
By applying the unique() function, we get an overview of
all the different categories represented in the data set.
unique(berlin_features$fclass)
## [1] "camera_surveillance" "restaurant" "telephone"
## [4] "tourist_info" "biergarten" "recycling_glass"
## [7] "memorial" "fire_station" "artwork"
## [10] "bank" "post_office" "toilet"
## [13] "cinema" "library" "police"
## [16] "post_box" "chemist" "motel"
## [19] "kiosk" "arts_centre" "cafe"
## [22] "fast_food" "car_sharing" "viewpoint"
## [25] "pub" "hotel" "supermarket"
## [28] "bakery" "pharmacy" "atm"
## [31] "convenience" "theatre" "hostel"
## [34] "car_rental" "gift_shop" "museum"
## [37] "guesthouse" "shelter" "stationery"
## [40] "bar" "travel_agent" "embassy"
## [43] "butcher" "beverages" "pitch"
## [46] "outdoor_shop" "playground" "kindergarten"
## [49] "attraction" "school" "monument"
## [52] "hunting_stand" "sports_centre" "tower"
## [55] "bench" "recycling_clothes" "optician"
## [58] "fountain" "nightclub" "drinking_water"
## [61] "computer_shop" "furniture_shop" "recycling"
## [64] "community_centre" "bicycle_shop" "doctors"
## [67] "veterinary" "bookshop" "clothes"
## [70] "florist" "doityourself" "laundry"
## [73] "comms_tower" "hairdresser" "university"
## [76] "video_shop" "picnic_site" "car_dealership"
## [79] "beauty_shop" "dentist" "vending_any"
## [82] "vending_machine" "mobile_phone_shop" "track"
## [85] "newsagent" "shoe_shop" "waste_basket"
## [88] "toy_shop" "town_hall" "sports_shop"
## [91] "vending_cigarette" "greengrocer" "water_well"
## [94] "mall" "jeweller" "bicycle_rental"
## [97] "department_store" "garden_centre" "water_tower"
## [100] "vending_parking" "theme_park" "nursing_home"
## [103] "car_wash" "graveyard" "college"
## [106] "recycling_paper" "water_works" "chalet"
## [109] "ruins" "clinic" "zoo"
## [112] "market_place" "general" "battlefield"
## [115] "stadium" "archaeological" "hospital"
## [118] "camp_site" "swimming_pool" "prison"
## [121] "observation_tower" "wayside_shrine" "water_mill"
## [124] "caravan_site" "golf_course" "courthouse"
## [127] "food_court" "dog_park" "wayside_cross"
## [130] "park"
There are 130 different categories.
Since we want to have a closer look at Berlin’s nightlife, we subset
our data to include only the category nightclub.
berlin_locations <-
berlin_features[
berlin_features$fclass == "nightclub",
c("fclass", "name")
]
Now, we also want to include bars in our data set.
Exercise: Subset the data set
berlin_featuresto only include the categoriesnightclubandbarusing the boolean operatorOR.
# your code here
berlin_locations <- NULL
berlin_locations <-
berlin_features[
berlin_features$fclass == "nightclub" | berlin_features$fclass == "bar", # "OR" and "|" are equivalent
c("fclass", "name")
]
We rename the column fclass into loc and
plot the relative frequency of the categories in our data set by
combining the table(), the prop.table() and
the barplot() functions.
# rename column from fclass to loc
colnames(berlin_locations)[1] <- "loc"
# plot
barplot(prop.table(table(berlin_locations$loc)),
main = "Comparison between bars and nightsclubs in Berlin",
col = c("chocolate2", "khaki1"))
As expected, there are significantly more locations denoted as bar in our data set than locations denoted as nightclub. Let us compare the absolute figures.
Exercise: Determine the absolute numbers of the locations
barandnightclubof our data set.
# your code here
table(berlin_locations$loc)
##
## bar nightclub
## 807 131
We visualize the data using the mapview package.
library(mapview)
mapview(berlin_locations,
zcol = "loc",
label = berlin_locations$name,
legend = TRUE
)
Finally, we project the berlin_locations data set to the
European Terrestrial Reference System 1989 (ETRS89/UTM
zone 32N).
Exercise: Transform the coordinate reference system of the data set
berlin_locationsinto ERTS89/UTM zone 32N (EPSG 25832). Please visit the section about coordinate reference systems and transformations on the previous page if you want to look it up again.
# your code here
berlin_locations <- NULL
berlin_locations <- st_transform(berlin_locations, 25832)
st_geometry(berlin_locations)
## Geometry set for 938 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 780782 ymin: 5813041 xmax: 814411.7 ymax: 5841533
## Projected CRS: ETRS89 / UTM zone 32N
## First 5 geometries:
## POINT (793357 5832558)
## POINT (797981.4 5827060)
## POINT (797927.1 5827157)
## POINT (792808.9 5832719)
## POINT (799189.1 5830723)
To contrast particular statistical methods for spatial point pattern
analysis, we construct two artificial data sets. One data set,
berlin_random, consists of point data randomly distributed
over the area of Berlin. The other data set,
berlin_regular, consists of point data fairly regularly
distributed across Berlin.
Before we start with the data generation we have to set the
geographical extend. Hence, we load the the sf object
containing Berlin’s district borders (berlin_sf).
load(url("https://userpage.fu-berlin.de/soga/data/r-data/berlin_district.RData"))
plot(berlin_sf, main = "")
We take the bounding box specifications of the berlin_sf
variable to set the geographic extend for the data generation
process.
xmin <- as.vector(st_bbox(berlin_sf)[1])
ymin <- as.vector(st_bbox(berlin_sf)[2])
xmax <- as.vector(st_bbox(berlin_sf)[3])
ymax <- as.vector(st_bbox(berlin_sf)[4])
First, we generate random uniform numbers using the
runif() function.
set.seed(1111) # set seed for reproducibility
## generate random uniform numbers
x_random <- runif(
n = nrow(berlin_locations) * 0.4,
min = xmin,
max = xmax
)
y_random <- runif(
n = nrow(berlin_locations) * 0.4,
min = ymin,
max = ymax
)
Then we combine the random number vector data in a
data.frame object and plot it together with the district
border of Berlin.
berlin_random <- data.frame("x" = x_random, "y" = y_random)
plot(
x = berlin_random$x,
y = berlin_random$y,
main = "Random Points",
xlab = "x", ylab = "y", cex = .5,
asp = TRUE
)
plot(berlin_sf, add = TRUE, border = "blue", col = NA)
First, we generate a sequence of numbers using the seq()
function.
x_regular <- seq(
from = xmin, to = xmax,
length.out = sqrt(nrow(berlin_locations)) * 0.35
)
y_regular <- seq(
from = ymin, to = ymax,
length = sqrt(nrow(berlin_locations)) * 0.35
)
Then we create a data frame from all combinations of variables using
the expand.grid() function.
regular <- expand.grid(x_regular, y_regular)
plot(regular, asp = TRUE)
Finally, we add some noise to the points using the
jitter() function, combine the data into a
data.frame object, berlin_regular, and plot it
together with the district border of Berlin.
x_regular <- jitter(regular[, 1], factor = 1)
y_regular <- jitter(regular[, 2], factor = 1)
berlin_regular <- data.frame("x" = x_regular, "y" = y_regular)
plot(
x = berlin_regular$x,
y = berlin_regular$y,
main = "Regular Points",
xlab = "x", ylab = "y", cex = .5,
asp = TRUE
)
plot(berlin_sf, add = TRUE, border = "blue", col = NA)
spatstat objectsIn spatstat a spatial point pattern in two-dimensional
space is stored as an object of class ppp. Such a point
pattern object contains spatial coordinates of the
points, marks attached to them (if any), a
window in which the points were observed and
optionally, the name of the unit of length for the
spatial coordinates.
A ppp object is created using the ppp()
function. The argument window must have a window object of
class owin. Including information about the units of length
in which the \(x\) and \(y\) coordinates are recorded enables the
spatstat package to print better reports and annotate the
axes in plots.
library(spatstat)
For this tutorial, we will build three ppp objects: one
for the berlin_locations data set, which we denote as
ppp_locations, one for the berlin_random data
set (ppp_random) and one for the
berlin_regular data set (ppp_regular).
All three data sets share the same length unit (meter) in
which the \(x\) and \(y\) coordinates are recorded, as well as
the window in which the points were observed (the bounding box of the
Berlin district border berlin_sf).
We start with assigning a placeholder variable which holds the string representation of the coordinate’s units.
coordinate_units <- "meters"
An object of class owin (observation window)
represents a region or window in a two-dimensional space. The window may
be a rectangle, a polygon or polygons, with polygonal holes, or an
irregular shape represented by a binary pixel image mask.
Recall that berlin_sf is of class sf.
class(berlin_sf)
## [1] "sf" "data.frame"
To set the window to the polygon berlin_sf, we first
need to coerce berlin_sf into a Spatial
object, which is then coerced into an
owin' object. This may sound complicated, but by using theas()`
function in sequence, the objective becomes fairly simple.
class(as(berlin_sf, "Spatial"))
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
class(as.owin(berlin_sf))
## [1] "owin"
Here we go, done! Now we assign the owin object to a
variable denoted as win.
win <- as.owin(berlin_sf)
win
## window: polygonal boundary
## enclosing rectangle: [777974.1, 823510.5] x [5808837, 5845580] units
ppp objectFinally, we construct ppp objects using the
ppp() function from our three different data sets
berlin_locations, berlin_regular and
berlin_random. Recall that berlin_locations is
an sf object, whereas berlin_regular and
berlin_random are data.frame objects.
class(berlin_locations)
## [1] "sf" "data.frame"
class(berlin_regular)
## [1] "data.frame"
class(berlin_random)
## [1] "data.frame"
For the sf object, we use the
st_coordinates() function to extract the \(xy\)-coordinates in form of a
matrix object. Let us review the first five point
coordinates in berlin_locations in matrix format.
st_coordinates(berlin_locations)[1:5, ]
## X Y
## [1,] 793357.0 5832558
## [2,] 797981.4 5827060
## [3,] 797927.1 5827157
## [4,] 792808.9 5832719
## [5,] 799189.1 5830723
Now, let us construct the ppp object.
ppp_locations <- ppp(
x = st_coordinates(berlin_locations)[, 1],
y = st_coordinates(berlin_locations)[, 2],
window = win
)
## Warning: 4 points were rejected as lying outside the specified window
Now, we assign the name of the unit of length for the spatial
coordinates to the ppp object and plot it.
unitname(ppp_locations) <- coordinate_units
koords = c(5809500, 5850000)
ppp_locations$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_locations)
## Warning in plot.ppp(ppp_locations): 4 illegal points also plotted
Note that we received a warning that 4 points were rejected as
lying outside the specified window. Do not worry, the rejected
points in a point pattern will be ignored by all other functions except
for plot. They are still accessible and stored as an
attribute of the point pattern called rejects.
It is difficult to spot the rejected points in the plot above. Thus,
let us write some code to highlight them. We use the
gIntersects() function from the rgeos
package.
points_intersect <- lengths(st_intersects(berlin_locations, berlin_sf)) > 0
plot(berlin_sf$geometry)
points(as(berlin_locations, "Spatial"))
plot(berlin_locations[!points_intersect, ]$geometry,
col = "red", cex = 3, pch = 3, add = T)
Now we can spot rejected points easily as they are marked with a red
cross. If we want to get rid of rejected points, we have to apply a
small trick: Applying the as.ppp() function on the
ppp object removes rejected points.
ppp_locations <- as.ppp(ppp_locations)
koords = c(5809500, 5850000)
ppp_locations$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_locations)
No more warnings, great!
We continue constructing ppp objects for the
berlin_random data set and the berlin_regular
data set. Since both data sets are data.frame objects, we
simply index their columns with the \(\$\)-notation.
Exercise: Construct a
pppobject for theberlin_randomdata set and one forthe berlin_regulardata set, assign a unit for the coordinates and remove rejected points.
# your code here...
ppp_random <- NULL
ppp_regular <- NULL
## create ppp object
ppp_random <- ppp(
x = berlin_random$x,
y = berlin_random$y,
window = win
)
## Warning: 184 points were rejected as lying outside the specified window
## assign unit
unitname(ppp_random) <- coordinate_units
## removal of rejected points
ppp_random <- as.ppp(ppp_random)
koords = c(5809500, 5850000)
ppp_random$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_random)
## create ppp object
ppp_regular <- ppp(
x = berlin_regular$x,
y = berlin_regular$y,
window = win
)
## Warning: 70 points were rejected as lying outside the specified window
## assign unit
unitname(ppp_regular) <- coordinate_units
## removal of rejected points
ppp_regular <- as.ppp(ppp_regular)
koords = c(5809500, 5850000)
ppp_regular$window$yrange = koords
par(mar = c(0, 0, 1.5, 1.5),
cex.main = 2)
plot(ppp_regular)
Finally, we save all three data sets for further usage.
save(
file = "ppp_data_berlin.RData",
ppp_locations, ppp_random, ppp_regular, berlin_locations
)
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.