Worldwide Berlin is famously known for its nightlife. The data set we will work with stores the geographical location of clubs and bars located in Berlin. The data source is gathered from OpenStreetMap (OSM). The data was downloaded from GEOFABRIK on August 15, 2022, and contains OpenStreetMap data as of August 14, 2022 (see here).

We download the data and read the osm_pois_p.shp file, which corresponds to point of interests in Berlin, using the read_file() function from the GeoPandas package.

In [2]:
# First, let's import the needed libraries.
import matplotlib.pyplot as plt
import numpy as np
import random
import pandas as pd
import geopandas as gpd
import folium

The GeoPandas package¶

GeoPandas provides two main data structures, namely GeoSeries and GeoDataFrame, which correspond to the pandas.Series and pandas.DataFrame, respectively. The GeoSeries is a vector where each entry is a set of shapes corresponding to one observation. The GeoDataFrame is a tabular data structure that contains a GeoSeries as a column containing the spacial information and additional information in the other columns. This spacial column is often called geometry and can be accessed through the geometry attribute (gdf.geometry).

The package provides three basic classes of geometric objects:

  • Points / Multi-Points
  • Lines / Multi-Lines
  • Polygons / Multi-Polygons

See the documentation of the GeoPandas package here.

Reading Files to GeoPandas

GeoPandas can read almost any vector-based spatial data format (e.g. ESRI shapefile, GeoJSON files, ...). For this purpose, we use geopandas.read_file(). Here, we will read a shapefile stored as .zip file to GeoDataFrame as follows.

In [3]:
berlin_features = gpd.read_file("C:/Users/mceck/soga/Soga-Py/300/data/osm_pois_p.zip")

berlin_features.head()
Out[3]:
osm_id code fclass name geometry
0 16541597 2907 camera_surveillance Aral POINT (13.34544 52.54644)
1 26735749 2301 restaurant Aida POINT (13.32282 52.50691)
2 26735753 2006 telephone None POINT (13.32214 52.50645)
3 26735759 2301 restaurant Madame Ngo POINT (13.31808 52.50621)
4 26735763 2301 restaurant Thanh Long POINT (13.32078 52.50732)
In [4]:
berlin_features.dtypes
Out[4]:
osm_id        object
code           int64
fclass        object
name          object
geometry    geometry
dtype: object

There are 82491 records (denoted as features), represented as rows, and 5 attributes (denoted as fields), represented as columns.

By looking at the column names we realize that the category related to the point data is stored in the fclass column. Note, that berlin_features provides a geometry column, that provides geometric points. Even if that column is not named geometry, GeoPandas will be able to detect that column through the .geometry command.

In [5]:
berlin_features.geometry
Out[5]:
0        POINT (13.34544 52.54644)
1        POINT (13.32282 52.50691)
2        POINT (13.32214 52.50645)
3        POINT (13.31808 52.50621)
4        POINT (13.32078 52.50732)
                   ...            
82486    POINT (13.29562 52.43827)
82487    POINT (13.29558 52.43829)
82488    POINT (13.31360 52.47638)
82489    POINT (13.63331 52.52282)
82490    POINT (13.49256 52.54817)
Name: geometry, Length: 82491, dtype: geometry
In [6]:
berlin_features.columns
Out[6]:
Index(['osm_id', 'code', 'fclass', 'name', 'geometry'], dtype='object')

By applying the set() function we get an overview of the different categories represented in the data set.

In [7]:
set(berlin_features["fclass"])  ## get unique values
Out[7]:
{'archaeological',
 'arts_centre',
 'artwork',
 'atm',
 'attraction',
 'bakery',
 'bank',
 'bar',
 'battlefield',
 'beauty_shop',
 'bench',
 'beverages',
 'bicycle_rental',
 'bicycle_shop',
 'biergarten',
 'bookshop',
 'butcher',
 'cafe',
 'camera_surveillance',
 'camp_site',
 'car_dealership',
 'car_rental',
 'car_sharing',
 'car_wash',
 'caravan_site',
 'chalet',
 'chemist',
 'cinema',
 'clinic',
 'clothes',
 'college',
 'comms_tower',
 'community_centre',
 'computer_shop',
 'convenience',
 'courthouse',
 'dentist',
 'department_store',
 'doctors',
 'dog_park',
 'doityourself',
 'drinking_water',
 'embassy',
 'fast_food',
 'fire_station',
 'florist',
 'food_court',
 'fountain',
 'furniture_shop',
 'garden_centre',
 'general',
 'gift_shop',
 'golf_course',
 'graveyard',
 'greengrocer',
 'guesthouse',
 'hairdresser',
 'hospital',
 'hostel',
 'hotel',
 'hunting_stand',
 'jeweller',
 'kindergarten',
 'kiosk',
 'laundry',
 'library',
 'mall',
 'market_place',
 'memorial',
 'mobile_phone_shop',
 'monument',
 'motel',
 'museum',
 'newsagent',
 'nightclub',
 'nursing_home',
 'observation_tower',
 'optician',
 'outdoor_shop',
 'park',
 'pharmacy',
 'picnic_site',
 'pitch',
 'playground',
 'police',
 'post_box',
 'post_office',
 'prison',
 'pub',
 'recycling',
 'recycling_clothes',
 'recycling_glass',
 'recycling_paper',
 'restaurant',
 'ruins',
 'school',
 'shelter',
 'shoe_shop',
 'sports_centre',
 'sports_shop',
 'stadium',
 'stationery',
 'supermarket',
 'swimming_pool',
 'telephone',
 'theatre',
 'theme_park',
 'toilet',
 'tourist_info',
 'tower',
 'town_hall',
 'toy_shop',
 'track',
 'travel_agent',
 'university',
 'vending_any',
 'vending_cigarette',
 'vending_machine',
 'vending_parking',
 'veterinary',
 'video_shop',
 'viewpoint',
 'waste_basket',
 'water_mill',
 'water_tower',
 'water_well',
 'water_works',
 'wayside_cross',
 'wayside_shrine',
 'zoo'}

Now we subset our data set to include only the category nightclub and bar using the | operator.

In [8]:
berlin_locations = berlin_features[
    (berlin_features["fclass"] == "nightclub") | (berlin_features["fclass"] == "bar")
]

We plot the relative frequency of the categories in our data set by combing the value_counts(normalize = True) and the plt.plot() function.

In [9]:
plt.figure(figsize=(8, 6))
berlin_locations["fclass"].value_counts(normalize=True).plot(
    kind="bar"
)  ## plot relative frequency
plt.xticks(rotation=0)
Out[9]:
(array([0, 1]), [Text(0, 0, 'bar'), Text(1, 0, 'nightclub')])

OK, as expected there are significantly more locations denoted as bar in our data set compared to locations denoted as nightclub. To get the absolute figures we use the value_counts() function again.

In [10]:
berlin_locations["fclass"].value_counts()
Out[10]:
bar          806
nightclub    131
Name: fclass, dtype: int64

To visualize the data we plot it using the folium package.

In [11]:
berlinmap = folium.Map([52.5464450, 13.34544], zoom_start=10, tiles="cartodbpositron")

group0 = folium.FeatureGroup(name='<span style=\\"color: red;\\">nightclub</span>')
group1 = folium.FeatureGroup(name='<span style=\\"color: darkblue;\\">bar</span>')

for _, row in berlin_locations.iterrows():
    if row["fclass"] == "nightclub":
        folium.CircleMarker(
            [row.geometry.y, row.geometry.x],
            popup=row["name"],
            radius=3,
            color="red",
        ).add_to(group0)
    elif row["fclass"] == "bar":
        folium.CircleMarker(
            [row.geometry.y, row.geometry.x],
            popup=row["name"],
            radius=3,
            color="darkblue",
        ).add_to(group1)

group0.add_to(berlinmap)
group1.add_to(berlinmap)
folium.map.LayerControl("topright", collapsed=False).add_to(berlinmap)

berlinmap
Out[11]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Re-Projecting¶

Finally we project the berlin_locations data set to the European Terrestrial Reference System 1989 (ETRS89/UTM zone 32N) for further usage. Re-projecting is the process of changing the representation of locations from one coordinate system to another. Therefore we use the GeoDataFrame.to_crs() function and provide the EPSG identifier for ERTS89/UTM zone 32N, which is $25832$ (see here)

In [12]:
berlin_locations = berlin_locations.to_crs(
    "EPSG:25832"
)  # or use world.to_crs(epsg=25832)

# get information about crs
berlin_locations.crs
Out[12]:
<Derived Projected CRS: EPSG:25832>
Name: ETRS89 / UTM zone 32N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Europe between 6°E and 12°E: Austria; Belgium; Denmark - onshore and offshore; Germany - onshore and offshore; Norway including - onshore and offshore; Spain - offshore.
- bounds: (6.0, 38.76, 12.0, 84.33)
Coordinate Operation:
- name: UTM zone 32N
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

Regular spaced and random data¶

In order to contrast particular statistical methods for spatial point pattern analysis we construct two artificial data sets. One data set, berlin_random, consists of point data randomly distributed over the area of Berlin. The other data set, berlin_regular consists of point data that is fairly regular distributed across the area of Berlin.

Before we start with the data generation we have to set the geographical extend. Hence, we load the district borders (berlin_district) as a GeoPandas object, and plot it. You can download the shapefile here.

In [13]:
berlin_district = gpd.read_file(
    "http://userpage.fu-berlin.de/soga/soga-py/300/30800_spatial_point_patterns/berlin_district.zip"
)
In [14]:
berlin_district.plot(figsize=(12, 5), color="red")

plt.show()

We take the bounding box specifications of the berlin_district variable to set the geographic extend for the data generation process.

In [15]:
# find bounds of geodataframe
x_min, y_min, x_max, y_max = berlin_district.total_bounds

Random points¶

First we generate random uniform numbers using the random.uniform() function from the NumPy.

In [16]:
# set seed for reproducibility
random.seed(1111)

# set sample size
n = round(berlin_locations.shape[0] * 0.4)

## generate random uniform numbers
x_random = np.random.uniform(x_min, x_max, n)
y_random = np.random.uniform(y_min, y_max, n)

Then we combine the random number vector data in a GeoSeries object and plot it together with the district border of Berlin.

In [17]:
# convert to GeoSeries
random_points = gpd.GeoSeries(gpd.points_from_xy(x_random, y_random))
# only keep those points within berlin_district
berlin_random = random_points[random_points.within(berlin_district.unary_union)]
In [18]:
fig, ax = plt.subplots(figsize=(10, 5))
berlin_district.plot(ax=ax, edgecolor="darkblue", linewidth=1, color="white")
berlin_random.plot(
    ax=ax,
    edgecolor="black",
    color="lightgrey",
)
plt.title("Random Points")

plt.show()

Regular points¶

First we generate a sequence of numbers using the np.linspace() function.

In [19]:
n = round(np.sqrt(berlin_locations.shape[0]) * 0.35)
x_regular = np.linspace(x_min, x_max, n)
y_regular = np.linspace(y_min, y_max, n)

Then we create a GeoSeries from all combinations of variables using np.meshgrid() and the points_from_xy() function.

In [20]:
xx = np.meshgrid(x_regular, y_regular)[0].flatten()
yy = np.meshgrid(x_regular, y_regular)[1].flatten()


random_points = gpd.GeoSeries(gpd.points_from_xy(xx, yy))

## cut to size
berlin_regular = random_points[random_points.within(berlin_district.unary_union)]
In [21]:
fig, ax = plt.subplots(figsize=(10, 5))
berlin_district.plot(ax=ax, edgecolor="darkblue", linewidth=1, color="white")
berlin_regular.plot(
    ax=ax,
    edgecolor="black",
    color="lightgrey",
)
plt.title("Regular Points")

plt.show()

Writing Spatial Data¶

GeoDataFrames can be exported to many different standard formats using the geopandas.GeoDataFrame.to_file() function. Shapefiles and GeoJSON are commonly used like so:

In [22]:
## Writing to Shapefile:

berlin_random.to_file("../data/berlin_random.shp")

## Writing to GeoJSON:

berlin_random.to_file("../data/berlin_random.geojson", driver="GeoJSON")

For further usage we save the four data sets. Therefore we will store them in a dictionary and finally save that dictionary as .geojson file for further processing.

In [23]:
## Writing to GeoJSON:

berlin_locations.to_file("../data/berlin_locations.geojson", driver="GeoJSON")

berlin_regular.to_file("../data/berlin_regular.geojson", driver="GeoJSON")

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.