Berlin is famous worldwide for its nightlife. The data set we will work with stores the geographical location of clubs and bars located in Berlin. The data source is gathered from OpenStreetMap (OSM). The data was downloaded from GEOFABRIK on August 27, 2022, and contains OpenStreetMap data as of August 26, 2022 (see here).

We download the data and read the gis_osm_pois_free_1.shp file, which corresponds to points of interests in Berlin, using the st_read() function from the sf package.

## build a temporary folder on disk
temp <- tempfile()
download_url <- "https://userpage.fu-berlin.de/soga/data/raw-data/spatial/"
zipfile <- "berlin-latest-free.shp.zip"

## download the file
download.file(paste0(download_url, zipfile), temp, mode = "wb")
## unzip the file(s)
unzip(temp)

## close file connection
unlink(temp)
library(sf)
berlin_features <- st_read("./gis_osm_pois_free_1.shp", quiet = TRUE)

Applying the str() function gives us more information about the structure of the data set.

str(berlin_features)
## Classes 'sf' and 'data.frame':   83123 obs. of  5 variables:
##  $ osm_id  : chr  "16541597" "26735749" "26735753" "26735759" ...
##  $ code    : int  2907 2301 2006 2301 2301 2701 2307 2031 2724 2907 ...
##  $ fclass  : chr  "camera_surveillance" "restaurant" "telephone" "restaurant" ...
##  $ name    : chr  "Aral" "Aida" NA "Madame Ngo" ...
##  $ geometry:sfc_POINT of length 83123; first list element:  'XY' num  13.3 52.5
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
##   ..- attr(*, "names")= chr [1:4] "osm_id" "code" "fclass" "name"

The data set of class sf and data.frame contains 83123 records (denoted as observations) represented as rows. The 5 variables are represented by columns, with the first one containing the osm_id number. From the fifth variable geometry we know that the data set stores information about 83123 two-dimensional ('XY') surface points. The variable fclass contains the category related to a point, such as “restaurant” or “camera_surveillance”.

By applying the unique() function, we get an overview of all the different categories represented in the data set.

unique(berlin_features$fclass)
##   [1] "camera_surveillance" "restaurant"          "telephone"          
##   [4] "tourist_info"        "biergarten"          "recycling_glass"    
##   [7] "memorial"            "fire_station"        "artwork"            
##  [10] "bank"                "post_office"         "toilet"             
##  [13] "cinema"              "library"             "police"             
##  [16] "post_box"            "chemist"             "motel"              
##  [19] "kiosk"               "arts_centre"         "cafe"               
##  [22] "fast_food"           "car_sharing"         "viewpoint"          
##  [25] "pub"                 "hotel"               "supermarket"        
##  [28] "bakery"              "pharmacy"            "atm"                
##  [31] "convenience"         "theatre"             "hostel"             
##  [34] "car_rental"          "gift_shop"           "museum"             
##  [37] "guesthouse"          "shelter"             "stationery"         
##  [40] "bar"                 "travel_agent"        "embassy"            
##  [43] "butcher"             "beverages"           "pitch"              
##  [46] "outdoor_shop"        "playground"          "kindergarten"       
##  [49] "attraction"          "school"              "monument"           
##  [52] "hunting_stand"       "sports_centre"       "tower"              
##  [55] "bench"               "recycling_clothes"   "optician"           
##  [58] "fountain"            "nightclub"           "drinking_water"     
##  [61] "computer_shop"       "furniture_shop"      "recycling"          
##  [64] "community_centre"    "bicycle_shop"        "doctors"            
##  [67] "veterinary"          "bookshop"            "clothes"            
##  [70] "florist"             "doityourself"        "laundry"            
##  [73] "comms_tower"         "hairdresser"         "university"         
##  [76] "video_shop"          "picnic_site"         "car_dealership"     
##  [79] "beauty_shop"         "dentist"             "vending_any"        
##  [82] "vending_machine"     "mobile_phone_shop"   "track"              
##  [85] "newsagent"           "shoe_shop"           "waste_basket"       
##  [88] "toy_shop"            "town_hall"           "sports_shop"        
##  [91] "vending_cigarette"   "greengrocer"         "water_well"         
##  [94] "mall"                "jeweller"            "bicycle_rental"     
##  [97] "department_store"    "garden_centre"       "water_tower"        
## [100] "vending_parking"     "theme_park"          "nursing_home"       
## [103] "car_wash"            "graveyard"           "college"            
## [106] "recycling_paper"     "water_works"         "chalet"             
## [109] "ruins"               "clinic"              "zoo"                
## [112] "market_place"        "general"             "battlefield"        
## [115] "stadium"             "archaeological"      "hospital"           
## [118] "camp_site"           "swimming_pool"       "prison"             
## [121] "observation_tower"   "wayside_shrine"      "water_mill"         
## [124] "caravan_site"        "golf_course"         "courthouse"         
## [127] "food_court"          "dog_park"            "wayside_cross"      
## [130] "park"

There are 130 different categories.

Since we want to have a closer look at Berlin’s nightlife, we subset our data to include only the category nightclub.

berlin_locations <-
  berlin_features[
    berlin_features$fclass == "nightclub",
    c("fclass", "name")
  ]

Now, we also want to include bars in our data set.

Exercise: Subset the data set berlin_features to only include the categories nightclub and bar using the boolean operator OR.

# your code here
berlin_locations <- NULL
Show code
berlin_locations <-
  berlin_features[
    berlin_features$fclass == "nightclub" | berlin_features$fclass == "bar", # "OR" and "|" are equivalent
    c("fclass", "name")
  ]

We rename the column fclass into loc and plot the relative frequency of the categories in our data set by combining the table(), the prop.table() and the barplot() functions.

# rename column from fclass to loc
colnames(berlin_locations)[1] <- "loc"

# plot
barplot(prop.table(table(berlin_locations$loc)), 
        main = "Comparison between bars and nightsclubs in Berlin", 
        col = c("chocolate2", "khaki1"))

As expected, there are significantly more locations denoted as bar in our data set than locations denoted as nightclub. Let us compare the absolute figures.

Exercise: Determine the absolute numbers of the locations bar and nightclub of our data set.

# your code here
Show code
table(berlin_locations$loc)
## 
##       bar nightclub 
##       807       131

We visualize the data using the mapview package.

library(mapview)
mapview(berlin_locations,
  zcol = "loc",
  label = berlin_locations$name,
  legend = TRUE
)

Finally, we project the berlin_locations data set to the European Terrestrial Reference System 1989 (ETRS89/UTM zone 32N).

Exercise: Transform the coordinate reference system of the data set berlin_locations into ERTS89/UTM zone 32N (EPSG 25832). Please visit the section about coordinate reference systems and transformations on the previous page if you want to look it up again.

# your code here
berlin_locations <- NULL
Show code
berlin_locations <- st_transform(berlin_locations, 25832)
st_geometry(berlin_locations)
## Geometry set for 938 features 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 780782 ymin: 5813041 xmax: 814411.7 ymax: 5841533
## Projected CRS: ETRS89 / UTM zone 32N
## First 5 geometries:
## POINT (793357 5832558)
## POINT (797981.4 5827060)
## POINT (797927.1 5827157)
## POINT (792808.9 5832719)
## POINT (799189.1 5830723)

Regular spaced and random data

To contrast particular statistical methods for spatial point pattern analysis, we construct two artificial data sets. One data set, berlin_random, consists of point data randomly distributed over the area of Berlin. The other data set, berlin_regular, consists of point data fairly regularly distributed across Berlin.

Before we start with the data generation we have to set the geographical extend. Hence, we load the the sf object containing Berlin’s district borders (berlin_sf).

load(url("https://userpage.fu-berlin.de/soga/data/r-data/berlin_district.RData"))

plot(berlin_sf, main = "")

We take the bounding box specifications of the berlin_sf variable to set the geographic extend for the data generation process.

xmin <- as.vector(st_bbox(berlin_sf)[1])
ymin <- as.vector(st_bbox(berlin_sf)[2])
xmax <- as.vector(st_bbox(berlin_sf)[3])
ymax <- as.vector(st_bbox(berlin_sf)[4])

Random points

First, we generate random uniform numbers using the runif() function.

set.seed(1111) # set seed for reproducibility
## generate random uniform numbers
x_random <- runif(
  n = nrow(berlin_locations) * 0.4,
  min = xmin,
  max = xmax
)
y_random <- runif(
  n = nrow(berlin_locations) * 0.4,
  min = ymin,
  max = ymax
)

Then we combine the random number vector data in a data.frame object and plot it together with the district border of Berlin.

berlin_random <- data.frame("x" = x_random, "y" = y_random)

plot(
  x = berlin_random$x,
  y = berlin_random$y,
  main = "Random Points",
  xlab = "x", ylab = "y", cex = .5,
  asp = TRUE
)
plot(berlin_sf, add = TRUE, border = "blue", col = NA)

Regular points

First, we generate a sequence of numbers using the seq() function.

x_regular <- seq(
  from = xmin, to = xmax,
  length.out = sqrt(nrow(berlin_locations)) * 0.35
)
y_regular <- seq(
  from = ymin, to = ymax,
  length = sqrt(nrow(berlin_locations)) * 0.35
)

Then we create a data frame from all combinations of variables using the expand.grid() function.

regular <- expand.grid(x_regular, y_regular)
plot(regular, asp = TRUE)

Finally, we add some noise to the points using the jitter() function, combine the data into a data.frame object, berlin_regular, and plot it together with the district border of Berlin.

x_regular <- jitter(regular[, 1], factor = 1)
y_regular <- jitter(regular[, 2], factor = 1)
berlin_regular <- data.frame("x" = x_regular, "y" = y_regular)

plot(
  x = berlin_regular$x,
  y = berlin_regular$y,
  main = "Regular Points",
  xlab = "x", ylab = "y", cex = .5,
  asp = TRUE
)
plot(berlin_sf, add = TRUE, border = "blue", col = NA)


Creating spatstat objects

In spatstat a spatial point pattern in two-dimensional space is stored as an object of class ppp. Such a point pattern object contains spatial coordinates of the points, marks attached to them (if any), a window in which the points were observed and optionally, the name of the unit of length for the spatial coordinates.

A ppp object is created using the ppp() function. The argument window must have a window object of class owin. Including information about the units of length in which the \(x\) and \(y\) coordinates are recorded enables the spatstat package to print better reports and annotate the axes in plots.

library(spatstat)

For this tutorial, we will build three ppp objects: one for the berlin_locations data set, which we denote as ppp_locations, one for the berlin_random data set (ppp_random) and one for the berlin_regular data set (ppp_regular).

All three data sets share the same length unit (meter) in which the \(x\) and \(y\) coordinates are recorded, as well as the window in which the points were observed (the bounding box of the Berlin district border berlin_sf).

Coordinate units

We start with assigning a placeholder variable which holds the string representation of the coordinate’s units.

coordinate_units <- "meters"
Window

An object of class owin (observation window) represents a region or window in a two-dimensional space. The window may be a rectangle, a polygon or polygons, with polygonal holes, or an irregular shape represented by a binary pixel image mask.

Recall that berlin_sf is of class sf.

class(berlin_sf)
## [1] "sf"         "data.frame"

To set the window to the polygon berlin_sf, we first need to coerce berlin_sf into a Spatial object, which is then coerced into an owin' object. This may sound complicated, but by using theas()` function in sequence, the objective becomes fairly simple.

class(as(berlin_sf, "Spatial"))
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
class(as.owin(berlin_sf))
## [1] "owin"

Here we go, done! Now we assign the owin object to a variable denoted as win.

win <- as.owin(berlin_sf)
win
## window: polygonal boundary
## enclosing rectangle: [777974.1, 823510.5] x [5808837, 5845580] units
The ppp object

Finally, we construct ppp objects using the ppp() function from our three different data sets berlin_locations, berlin_regular and berlin_random. Recall that berlin_locations is an sf object, whereas berlin_regular and berlin_random are data.frame objects.

class(berlin_locations)
## [1] "sf"         "data.frame"
class(berlin_regular)
## [1] "data.frame"
class(berlin_random)
## [1] "data.frame"

For the sf object, we use the st_coordinates() function to extract the \(xy\)-coordinates in form of a matrix object. Let us review the first five point coordinates in berlin_locations in matrix format.

st_coordinates(berlin_locations)[1:5, ]
##             X       Y
## [1,] 793357.0 5832558
## [2,] 797981.4 5827060
## [3,] 797927.1 5827157
## [4,] 792808.9 5832719
## [5,] 799189.1 5830723

Now, let us construct the ppp object.

ppp_locations <- ppp(
  x = st_coordinates(berlin_locations)[, 1],
  y = st_coordinates(berlin_locations)[, 2],
  window = win
)
## Warning: 4 points were rejected as lying outside the specified window

Now, we assign the name of the unit of length for the spatial coordinates to the ppp object and plot it.

unitname(ppp_locations) <- coordinate_units

koords = c(5809500, 5850000)
ppp_locations$window$yrange = koords

par(mar = c(0, 0, 1.5, 1.5),
    cex.main = 2)

plot(ppp_locations)
## Warning in plot.ppp(ppp_locations): 4 illegal points also plotted

Note that we received a warning that 4 points were rejected as lying outside the specified window. Do not worry, the rejected points in a point pattern will be ignored by all other functions except for plot. They are still accessible and stored as an attribute of the point pattern called rejects.

It is difficult to spot the rejected points in the plot above. Thus, let us write some code to highlight them. We use the gIntersects() function from the rgeos package.

points_intersect <- lengths(st_intersects(berlin_locations, berlin_sf)) > 0 

plot(berlin_sf$geometry)
points(as(berlin_locations, "Spatial"))
plot(berlin_locations[!points_intersect, ]$geometry, 
       col = "red", cex = 3, pch = 3, add = T)

Now we can spot rejected points easily as they are marked with a red cross. If we want to get rid of rejected points, we have to apply a small trick: Applying the as.ppp() function on the ppp object removes rejected points.

ppp_locations <- as.ppp(ppp_locations)

koords = c(5809500, 5850000)
ppp_locations$window$yrange = koords

par(mar = c(0, 0, 1.5, 1.5),
    cex.main = 2)

plot(ppp_locations)

No more warnings, great!

We continue constructing ppp objects for the berlin_random data set and the berlin_regular data set. Since both data sets are data.frame objects, we simply index their columns with the \(\$\)-notation.

Exercise: Construct a ppp object for the berlin_random data set and one for the berlin_regular data set, assign a unit for the coordinates and remove rejected points.

# your code here...
ppp_random <- NULL
ppp_regular <- NULL
Show code
## create ppp object
ppp_random <- ppp(
  x = berlin_random$x,
  y = berlin_random$y,
  window = win
)
## Warning: 184 points were rejected as lying outside the specified window
Show code
## assign unit
unitname(ppp_random) <- coordinate_units
Show code
## removal of rejected points
ppp_random <- as.ppp(ppp_random)

koords = c(5809500, 5850000)
ppp_random$window$yrange = koords

par(mar = c(0, 0, 1.5, 1.5),
    cex.main = 2)

plot(ppp_random)
Show code
## create ppp object
ppp_regular <- ppp(
  x = berlin_regular$x,
  y = berlin_regular$y,
  window = win
)
## Warning: 70 points were rejected as lying outside the specified window
Show code
## assign unit
unitname(ppp_regular) <- coordinate_units
Show code
## removal of rejected points
ppp_regular <- as.ppp(ppp_regular)

koords = c(5809500, 5850000)
ppp_regular$window$yrange = koords

par(mar = c(0, 0, 1.5, 1.5),
    cex.main = 2)

plot(ppp_regular)

Finally, we save all three data sets for further usage.

save(
  file = "ppp_data_berlin.RData",
  ppp_locations, ppp_random, ppp_regular, berlin_locations
)

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.