In R, graphs are typically created interactively. Creating a new graph by issuing a plotting command, such as plot(), hist(), boxplot(), among others, will typically overwrite a previous graph. In addition, one can specify fonts, colors, line styles, axes, reference lines, etc. by specifying graphical parameters. We will walk you through the most important concepts and commands during the subsequent sections.

During this section we will explore a data set called students. You may download the students.csv file here or request the data set directly in R using read.csv():

students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
str(students)
## 'data.frame':    8239 obs. of  16 variables:
##  $ stud.id        : int  833917 898539 379678 807564 383291 256074 754591 146494 723584 314281 ...
##  $ name           : chr  "Gonzales, Christina" "Lozano, T'Hani" "Williams, Hanh" "Nem, Denzel" ...
##  $ gender         : chr  "Female" "Female" "Female" "Male" ...
##  $ age            : int  19 19 22 19 21 19 21 21 18 18 ...
##  $ height         : int  160 172 168 183 175 189 156 167 195 165 ...
##  $ weight         : num  64.8 73 70.6 79.7 71.4 85.8 65.9 65.7 94.4 66 ...
##  $ religion       : chr  "Muslim" "Other" "Protestant" "Other" ...
##  $ nc.score       : num  1.91 1.56 1.24 1.37 1.46 1.34 1.11 2.03 1.29 1.19 ...
##  $ semester       : chr  "1st" "2nd" "3rd" "2nd" ...
##  $ major          : chr  "Political Science" "Social Sciences" "Social Sciences" "Environmental Sciences" ...
##  $ minor          : chr  "Social Sciences" "Mathematics and Statistics" "Mathematics and Statistics" "Mathematics and Statistics" ...
##  $ score1         : int  NA NA 45 NA NA NA NA 58 57 NA ...
##  $ score2         : int  NA NA 46 NA NA NA NA 62 67 NA ...
##  $ online.tutorial: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ graduated      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ salary         : num  NA NA NA NA NA NA NA NA NA NA ...

The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary.


Histogram

Histograms are created with the hist() function. By setting the argument freq = FALSE the plot returns probability densities instead of frequencies. The option breaks controls the number of bins.

hist(students$age)

hist(students$age, freq = FALSE)

hist(students$age, freq = FALSE, breaks = 50)

The shape of histograms is strongly affected by the number of bins used, hence it is sometimes useful to plot a kernel density plot instead. In R we can create a kernel density plot by using the density() function and by plotting the resulting object.

plot(density(students$age))

Note that a density plot is a smoothed histogram. There are many optional parameters in the density() function (type help(density) into your console for more details). We can easily change the bandwidth parameter by tweaking the bw argument.

plot(density(students$age, bw = 0.5))


Barplot

Barplots are used to plot categorical data. In order to construct a barplot we apply the barplot() function. Note that we first apply the table() function to count the entries for each particular category in the data column of interest.

counts <- table(students$religion)
counts
## 
##   Catholic     Muslim   Orthodox      Other Protestant 
##       2797        330        585       2688       1839
barplot(counts)

With minor adjustments we may produce a stacked bar plot with columns corresponding to the students semester, colors corresponding to the religious belief and a legend. For coloring we use the RColorBrewer package, hence make sure that this package is already installed on your machine, if not type install.packages("RColorBrewer") into your console before you continue.

counts <- table(students$religion, students$semester)
barplot(counts,
  col = RColorBrewer::brewer.pal(length(rownames(counts)), "Set1"),
  legend = rownames(counts)
)

By adding the argument beside = TRUE to the function call we get a grouped bar plot.

barplot(counts,
  col = RColorBrewer::brewer.pal(length(rownames(counts)), "Set1"),
  legend = rownames(counts),
  beside = TRUE
)


Box plot

Box plots are useful for displaying the distribution of data. The boxplot() function is used to create box plots in R.

boxplot(students$salary)

By using the ~ syntax we split the box plots by groups. In order to generate a box plot of the variable salary conditioned on the religious groups, we use the following command:

boxplot(salary ~ religion, data = students)

Again we may tweak the plot by adding some additional arguments (type help(boxplot) into your console for further details). For coloring we use the RColorBrewer package, hence make sure that this package is already installed on your machine. If not, type install.packages("RColorBrewer") into your console before you continue.

boxplot(salary ~ religion,
  data = students,
  notch = TRUE,
  col = RColorBrewer::brewer.pal(length(unique(students$religion)), "Set1")
)

As you can see, the additional argument notch = TRUE creates two triangular notches on both sides of the median lines. The vertical mouth width indicates roughly the 95 % confidence interval for the median. Thus, if the notches of two box-plots do not intersect, you may state a significant difference of the medians with an error probability of > 5 %.


Exercise: Are there any differences concerning nc.scores between students’ major subjects? Analyze graphically using appropriate boxplots.

### your code here
Show code
boxplot(nc.score ~ major,
  data = students,
  notch = TRUE,
  col = RColorBrewer::brewer.pal(length(unique(students$major)), "Set1")
)


Line and scatter plot

In R line charts and scatter plots are built in the same way, using the plot() command. They only differ with respect to the data provided and the choice whether a line or dotted features are plotted. This behavior is specified by the line type argument (type). Note that by default plot() plots points.

The type argument can take the following values:

For the sake of simplicity let us plot a simple cosine curve:

x <- seq(from = -2 * pi, to = 2 * pi, length.out = 100)
y <- cos(x)
plot(x, y)

By specifying the type argument we change the line type of the plot.

plot(x, y, type = "l")

plot(x, y, type = "h")

Note that the command plot() creates a new plot, overwriting the existing one. In order to add a line graph feature to an existing plot we call the lines() command.

plot(x, y)
lines(x, sin(x))

Above, we display just one variable (cos(x)) in our graph. In order to display two (continuous) variables we generally refer to a scatter plot. In R we construct a scatter plot using the plot() command.

Let us return to the students data set and construct a scatter plot. For a less packed visualization we restrict the data set to the first 100 entries.

students100 <- students[1:100, ]
plot(students100$height, students100$weight)

We can easily select a different point type character by tweaking the pch argument of the plot() function. For example by setting pch = 19 we set the points to be filled circles.

plot(students100$height, students100$weight, pch = 19)

Other pch values correspond to other point types (see figure below).

A nice feature of R is the simplicity to construct high quality graphs layer by layer. One such approach includes adding points to an existing plot. This can be achieved using the points() function.

Let us add one red dot, corresponding to the mean of both variables of interest, to the plot from above. Note that we use the pch, the col and the cex arguments to define the visualization properties of the point.

mean.weight <- mean(students100$weight)
mean.height <- mean(students100$height)
plot(students100$height, students100$weight, pch = 19)
points(mean.height, mean.weight, pch = 15, col = "red", cex = 1.5)

Another approach includes adding regression lines. There are several ways to compute regression lines in R. One of them makes use of the lm() function (a function to fit linear models; type help(lm) into your console for further information), another is to make use of the lowess() function (a function that uses locally-weighted polynomial regression; type help(lowess) into your console for further information). In R it is very easy to add such modeled regression lines. Note that we add the col argument to give each line a different color.

plot(students100$height, students100$weight, pch = 19)
# regression line (y~x)
abline(lm(weight ~ height, data = students100), col = "red")
# lowess line (x,y)
lines(lowess(students100$height, students100$weight), col = "blue")

Scatter plots are extremely useful for explanatory data analysis and hence many contributed R packages provide enhanced or specialized scatter plotting capabilities. In base R there exists the pairs() function, which produces a matrix of scatter plots. Note the special notation using the ~ operator.

pairs(~ weight + height + age + salary, data = students100)

One further contributed package is the car package. The provided scatterplotMatrix() function allows us to condition the scatterplot matrix on a factor and optionally include lowess and linear best fit lines, box plots, densities or histograms in the principal diagonal, as well as rug plots in the margins of the cells.

In the plot below we plot the variables weight, height, age and salary conditioned on the factor variable gender:

library(car)
scatterplotMatrix(~ weight + height + age + salary | gender,
  data = students100,
  regLine = FALSE,
  smooth = FALSE
)

Feel free to explore the different kinds of layout options by typing ??car::scatterplotMatrix.


Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

Creative Commons License
You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.