In R, graphs are typically created interactively. Creating a new
graph by issuing a plotting command, such as plot()
,
hist()
, boxplot()
, among others, will
typically overwrite a previous graph. In addition, one can specify
fonts, colors, line styles, axes, reference lines, etc. by specifying
graphical parameters. We will walk you through the most important
concepts and commands during the subsequent sections.
During this section we will explore a data set called
students. You may download the students.csv
file
here
or request the data set directly in R using read.csv()
:
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
str(students)
## 'data.frame': 8239 obs. of 16 variables:
## $ stud.id : int 833917 898539 379678 807564 383291 256074 754591 146494 723584 314281 ...
## $ name : chr "Gonzales, Christina" "Lozano, T'Hani" "Williams, Hanh" "Nem, Denzel" ...
## $ gender : chr "Female" "Female" "Female" "Male" ...
## $ age : int 19 19 22 19 21 19 21 21 18 18 ...
## $ height : int 160 172 168 183 175 189 156 167 195 165 ...
## $ weight : num 64.8 73 70.6 79.7 71.4 85.8 65.9 65.7 94.4 66 ...
## $ religion : chr "Muslim" "Other" "Protestant" "Other" ...
## $ nc.score : num 1.91 1.56 1.24 1.37 1.46 1.34 1.11 2.03 1.29 1.19 ...
## $ semester : chr "1st" "2nd" "3rd" "2nd" ...
## $ major : chr "Political Science" "Social Sciences" "Social Sciences" "Environmental Sciences" ...
## $ minor : chr "Social Sciences" "Mathematics and Statistics" "Mathematics and Statistics" "Mathematics and Statistics" ...
## $ score1 : int NA NA 45 NA NA NA NA 58 57 NA ...
## $ score2 : int NA NA 46 NA NA NA NA 62 67 NA ...
## $ online.tutorial: int 0 0 0 0 0 0 0 0 0 0 ...
## $ graduated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ salary : num NA NA NA NA NA NA NA NA NA NA ...
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.
Histograms are created with the hist()
function. By setting the argument freq = FALSE
the plot
returns probability densities instead of frequencies. The option
breaks
controls the number of bins.
hist(students$age)
hist(students$age, freq = FALSE)
hist(students$age, freq = FALSE, breaks = 50)
The shape of histograms is strongly affected by the number of bins
used, hence it is sometimes useful to plot a kernel density plot instead. In R we can create a
kernel density plot by using the density()
function and by
plotting the resulting object.
plot(density(students$age))
Note that a density plot is a smoothed histogram. There are many
optional parameters in the density()
function (type
help(density)
into your console for more details). We can
easily change the bandwidth parameter by tweaking the bw
argument.
plot(density(students$age, bw = 0.5))
Barplots are used to plot categorical data. In order to construct a barplot we
apply the barplot()
function. Note that we first apply the
table()
function to count the entries for each particular
category in the data column of interest.
counts <- table(students$religion)
counts
##
## Catholic Muslim Orthodox Other Protestant
## 2797 330 585 2688 1839
barplot(counts)
With minor adjustments we may produce a stacked bar plot with columns
corresponding to the students semester, colors corresponding to the
religious belief and a legend. For coloring we use the RColorBrewer
package, hence make sure
that this package is already installed on your machine, if not type
install.packages("RColorBrewer")
into your console before
you continue.
counts <- table(students$religion, students$semester)
barplot(counts,
col = RColorBrewer::brewer.pal(length(rownames(counts)), "Set1"),
legend = rownames(counts)
)
By adding the argument beside = TRUE
to the function
call we get a grouped bar plot.
barplot(counts,
col = RColorBrewer::brewer.pal(length(rownames(counts)), "Set1"),
legend = rownames(counts),
beside = TRUE
)
Box
plots are useful for displaying the distribution of data. The
boxplot()
function is used to create box plots in R.
boxplot(students$salary)
By using the ~
syntax we split the box plots by groups.
In order to generate a box plot of the variable salary
conditioned on the religious groups, we use the following command:
boxplot(salary ~ religion, data = students)
Again we may tweak the plot by adding some additional arguments (type
help(boxplot)
into your console for further details). For
coloring we use the RColorBrewer
package, hence make sure
that this package is already installed on your machine. If not, type
install.packages("RColorBrewer")
into your console before
you continue.
boxplot(salary ~ religion,
data = students,
notch = TRUE,
col = RColorBrewer::brewer.pal(length(unique(students$religion)), "Set1")
)
As you can see, the additional argument notch = TRUE
creates two triangular notches on both sides of the median lines. The
vertical mouth width indicates roughly the 95 % confidence interval for the median.
Thus, if the notches of two box-plots do not intersect, you may state a
significant difference of the medians with an error probability of less
5 %.
Exercise: Are there any differences concerning nc.scores between students’ major subjects? Analyze graphically using appropriate boxplots.
### your code here
boxplot(nc.score ~ major,
data = students,
notch = TRUE,
col = RColorBrewer::brewer.pal(length(unique(students$major)), "Set1")
)
In R line charts and scatter plots are built in the same way, using the
plot()
command. They only differ with respect to the data
provided and the choice whether a line or dotted features are plotted.
This behavior is specified by the line type argument
(type
). Note that by default plot()
plots
points.
The type
argument can take the following values:
p
pointsl
lineso
over plotted points and linesb
, c
points (empty if “c”) joined by
liness
, S
stair stepsh
histogram-like vertical linesn
does not produce any points or linesFor the sake of simplicity let us plot a simple cosine curve:
x <- seq(from = -2 * pi, to = 2 * pi, length.out = 100)
y <- cos(x)
plot(x, y)
By specifying the type
argument we change the line type
of the plot.
plot(x, y, type = "l")
plot(x, y, type = "h")
Note that the command plot()
creates a new plot,
overwriting the existing one. In order to add a line graph feature to an
existing plot we call the lines()
command.
plot(x, y)
lines(x, sin(x))
Above, we display just one variable (cos(x)
) in our
graph. In order to display two (continuous) variables we generally refer
to a scatter plot. In R we construct a scatter plot using the
plot()
command.
Let us return to the students data set and construct a scatter plot. For a less packed visualization we restrict the data set to the first 100 entries.
students100 <- students[1:100, ]
plot(students100$height, students100$weight)
We can easily select a different point type character by tweaking the
pch
argument of the plot()
function. For
example by setting pch = 19
we set the points to be filled
circles.
plot(students100$height, students100$weight, pch = 19)
Other pch
values correspond to other point types (see
figure below).
A nice feature of R is the simplicity to construct high quality
graphs layer by layer. One such approach includes adding points to an
existing plot. This can be achieved using the points()
function.
Let us add one red dot, corresponding to the mean of both variables
of interest, to the plot from above. Note that we use the
pch
, the col
and the cex
arguments to define the visualization properties of the point.
mean.weight <- mean(students100$weight)
mean.height <- mean(students100$height)
plot(students100$height, students100$weight, pch = 19)
points(mean.height, mean.weight, pch = 15, col = "red", cex = 1.5)
Another approach includes adding regression lines. There are several
ways to compute regression lines in R. One of them makes use of the
lm()
function (a function to fit linear models; type
help(lm)
into your console for further information),
another is to make use of the lowess()
function (a function
that uses locally-weighted polynomial regression; type
help(lowess)
into your console for further information). In
R it is very easy to add such modeled regression lines. Note that we add
the col
argument to give each line a different color.
plot(students100$height, students100$weight, pch = 19)
# regression line (y~x)
abline(lm(weight ~ height, data = students100), col = "red")
# lowess line (x,y)
lines(lowess(students100$height, students100$weight), col = "blue")
Scatter plots are extremely useful for explanatory data analysis and
hence many contributed R packages provide enhanced or specialized
scatter plotting capabilities. In base R there exists the
pairs()
function, which produces a matrix of scatter plots.
Note the special notation using the ~
operator.
pairs(~ weight + height + age + salary, data = students100)
One further contributed package is the car
package. The provided
scatterplotMatrix()
function allows us to condition the
scatterplot matrix on a factor and optionally include lowess and linear best fit lines, box plots,
densities or histograms in the principal diagonal, as well as rug plots
in the margins of the cells.
In the plot below we plot the variables weight
,
height
, age
and salary
conditioned on the factor variable gender
:
library(car)
scatterplotMatrix(~ weight + height + age + salary | gender,
data = students100,
regLine = FALSE,
smooth = FALSE
)
Feel free to explore the different kinds of layout options by typing
??car::scatterplotMatrix
.
Citation
The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.