In R, graphs are typically created interactively. Creating a new
graph by issuing a plotting command, such as `plot()`

,
`hist()`

, `boxplot()`

, among others, will
typically overwrite a previous graph. In addition, one can specify
fonts, colors, line styles, axes, reference lines, etc. by specifying
graphical parameters. We will walk you through the most important
concepts and commands during the subsequent sections.

During this section we will explore a data set called
*students*. You may download the `students.csv`

file
here
or request the data set directly in R using `read.csv()`

:

```
students <- read.csv("https://userpage.fu-berlin.de/soga/data/raw-data/students.csv")
str(students)
```

```
## 'data.frame': 8239 obs. of 16 variables:
## $ stud.id : int 833917 898539 379678 807564 383291 256074 754591 146494 723584 314281 ...
## $ name : chr "Gonzales, Christina" "Lozano, T'Hani" "Williams, Hanh" "Nem, Denzel" ...
## $ gender : chr "Female" "Female" "Female" "Male" ...
## $ age : int 19 19 22 19 21 19 21 21 18 18 ...
## $ height : int 160 172 168 183 175 189 156 167 195 165 ...
## $ weight : num 64.8 73 70.6 79.7 71.4 85.8 65.9 65.7 94.4 66 ...
## $ religion : chr "Muslim" "Other" "Protestant" "Other" ...
## $ nc.score : num 1.91 1.56 1.24 1.37 1.46 1.34 1.11 2.03 1.29 1.19 ...
## $ semester : chr "1st" "2nd" "3rd" "2nd" ...
## $ major : chr "Political Science" "Social Sciences" "Social Sciences" "Environmental Sciences" ...
## $ minor : chr "Social Sciences" "Mathematics and Statistics" "Mathematics and Statistics" "Mathematics and Statistics" ...
## $ score1 : int NA NA 45 NA NA NA NA 58 57 NA ...
## $ score2 : int NA NA 46 NA NA NA NA 62 67 NA ...
## $ online.tutorial: int 0 0 0 0 0 0 0 0 0 0 ...
## $ graduated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ salary : num NA NA NA NA NA NA NA NA NA NA ...
```

The students data set consists of 8239 rows, each of them
representing a particular student, and 16 columns, each of them
corresponding to a variable/feature related to that particular student.
These self-explaining variables are: *stud.id, name, gender, age,
height, weight, religion, nc.score, semester, major, minor, score1,
score2, online.tutorial, graduated, salary*.

Histograms are created with the `hist()`

function. By setting the argument `freq = FALSE`

the plot
returns probability densities instead of frequencies. The option
`breaks`

controls the number of bins.

`hist(students$age)`

`hist(students$age, freq = FALSE)`

`hist(students$age, freq = FALSE, breaks = 50)`

The shape of histograms is strongly affected by the number of bins
used, hence it is sometimes useful to plot a kernel density plot instead. In R we can create a
kernel density plot by using the `density()`

function and by
plotting the resulting object.

`plot(density(students$age))`

Note that a density plot is a smoothed histogram. There are many
optional parameters in the `density()`

function (type
`help(density)`

into your console for more details). We can
easily change the bandwidth parameter by tweaking the `bw`

argument.

`plot(density(students$age, bw = 0.5))`

Barplots are used to plot categorical data. In order to construct a barplot we
apply the `barplot()`

function. Note that we first apply the
`table()`

function to count the entries for each particular
category in the data column of interest.

```
counts <- table(students$religion)
counts
```

```
##
## Catholic Muslim Orthodox Other Protestant
## 2797 330 585 2688 1839
```

`barplot(counts)`

With minor adjustments we may produce a stacked bar plot with columns
corresponding to the students semester, colors corresponding to the
religious belief and a legend. For coloring we use the `RColorBrewer`

package, hence make sure
that this package is already installed on your machine, if not type
`install.packages("RColorBrewer")`

into your console before
you continue.

```
counts <- table(students$religion, students$semester)
barplot(counts,
col = RColorBrewer::brewer.pal(length(rownames(counts)), "Set1"),
legend = rownames(counts)
)
```

By adding the argument `beside = TRUE`

to the function
call we get a grouped bar plot.

```
barplot(counts,
col = RColorBrewer::brewer.pal(length(rownames(counts)), "Set1"),
legend = rownames(counts),
beside = TRUE
)
```

Box
plots are useful for displaying the distribution of data. The
`boxplot()`

function is used to create box plots in R.

`boxplot(students$salary)`

By using the `~`

syntax we split the box plots by groups.
In order to generate a box plot of the variable `salary`

conditioned on the religious groups, we use the following command:

`boxplot(salary ~ religion, data = students)`

Again we may tweak the plot by adding some additional arguments (type
`help(boxplot)`

into your console for further details). For
coloring we use the `RColorBrewer`

package, hence make sure
that this package is already installed on your machine. If not, type
`install.packages("RColorBrewer")`

into your console before
you continue.

```
boxplot(salary ~ religion,
data = students,
notch = TRUE,
col = RColorBrewer::brewer.pal(length(unique(students$religion)), "Set1")
)
```

As you can see, the additional argument `notch = TRUE`

creates two triangular notches on both sides of the median lines. The
vertical mouth width indicates roughly the 95 % confidence interval for the median.
Thus, if the notches of two box-plots do not intersect, you may state a
significant difference of the medians with an error probability of >
5 %.

Exercise:Are there any differences concerning nc.scores between students’ major subjects? Analyze graphically using appropriate boxplots.

`### your code here`

```
boxplot(nc.score ~ major,
data = students,
notch = TRUE,
col = RColorBrewer::brewer.pal(length(unique(students$major)), "Set1")
)
```

In R line charts and scatter plots are built in the same way, using the
`plot()`

command. They only differ with respect to the data
provided and the choice whether a line or dotted features are plotted.
This behavior is specified by the line type argument
(`type`

). Note that by default `plot()`

plots
points.

The `type`

argument can take the following values:

`p`

points`l`

lines`o`

over plotted points and lines`b`

,`c`

points (empty if “c”) joined by lines`s`

,`S`

stair steps`h`

histogram-like vertical lines`n`

does not produce any points or lines

For the sake of simplicity let us plot a simple cosine curve:

```
x <- seq(from = -2 * pi, to = 2 * pi, length.out = 100)
y <- cos(x)
plot(x, y)
```

By specifying the `type`

argument we change the line type
of the plot.

`plot(x, y, type = "l")`

`plot(x, y, type = "h")`

Note that the command `plot()`

creates a new plot,
overwriting the existing one. In order to add a line graph feature to an
existing plot we call the `lines()`

command.

```
plot(x, y)
lines(x, sin(x))
```

Above, we display just one variable (`cos(x)`

) in our
graph. In order to display two (continuous) variables we generally refer
to a scatter plot. In R we construct a scatter plot using the
`plot()`

command.

Let us return to the *students* data set and construct a
scatter plot. For a less packed visualization we restrict the data set
to the first 100 entries.

```
students100 <- students[1:100, ]
plot(students100$height, students100$weight)
```

We can easily select a different point type character by tweaking the
`pch`

argument of the `plot()`

function. For
example by setting `pch = 19`

we set the points to be filled
circles.

`plot(students100$height, students100$weight, pch = 19)`

Other `pch`

values correspond to other point types (see
figure below).

A nice feature of R is the simplicity to construct high quality
graphs layer by layer. One such approach includes adding points to an
existing plot. This can be achieved using the `points()`

function.

Let us add one red dot, corresponding to the mean of both variables
of interest, to the plot from above. Note that we use the
`pch`

, the `col`

and the `cex`

arguments to define the visualization properties of the point.

```
mean.weight <- mean(students100$weight)
mean.height <- mean(students100$height)
plot(students100$height, students100$weight, pch = 19)
points(mean.height, mean.weight, pch = 15, col = "red", cex = 1.5)
```

Another approach includes adding regression lines. There are several
ways to compute regression lines in R. One of them makes use of the
`lm()`

function (a function to fit linear models; type
`help(lm)`

into your console for further information),
another is to make use of the `lowess()`

function (a function
that uses locally-weighted polynomial regression; type
`help(lowess)`

into your console for further information). In
R it is very easy to add such modeled regression lines. Note that we add
the `col`

argument to give each line a different color.

```
plot(students100$height, students100$weight, pch = 19)
# regression line (y~x)
abline(lm(weight ~ height, data = students100), col = "red")
# lowess line (x,y)
lines(lowess(students100$height, students100$weight), col = "blue")
```

Scatter plots are extremely useful for explanatory data analysis and
hence many contributed R packages provide enhanced or specialized
scatter plotting capabilities. In base R there exists the
`pairs()`

function, which produces a matrix of scatter plots.
Note the special notation using the `~`

operator.

`pairs(~ weight + height + age + salary, data = students100)`