Data frames can be considered the most important data structure to store data in an ordered way. They also empower you to work with and on big data sets conveniently.
But what are the advantages of data frames in contrast to other data types and structures?
In short: Data frames allow us to:
Following are the characteristics of a data frame:
As you can see there are a lot of good reasons to learn about the concept of data frames.
Let’s get started!
Defining a data frame
We can create a data frame by applying the data.frame()
function. Note that in most applications we import data into a
data.frame
object.
data <- data.frame(
name = c("John", "Molly", "Frank", "Peter", "Michelle"),
job = c("Policeman", "Artist", "Banker", NA, "Teacher"),
sex = c("male", "female", "male", "male", "female"),
age = c(45, 32, 58, 18, 22),
stringsAsFactors = FALSE # added to avoid character string variables being interpreted as factors
)
data
## name job sex age
## 1 John Policeman male 45
## 2 Molly Artist female 32
## 3 Frank Banker male 58
## 4 Peter <NA> male 18
## 5 Michelle Teacher female 22
class(data)
## [1] "data.frame"
Note: As you can see we used the
c()
function to link the single data values together. Moreover, we definied 4 individual vectorized variables (name, job, sex, age) and provide them as arguments for thedata.frame
function. The function combines the four individual vectorized variables to one data frame that incorporates the provided data.
Important Note: As you can see data frames are used to store a user-defined number of vectors. The types of the given vectors can be different.
Summarizing a data frame
Typing the name of the stored data frame prints all the data which is
stored in the variable. This can be inconvenient, particularly when
working with big data sets. In order to get an overview
of our data and a summary of the stored information we
can use the str()
and the summary()
function.
We can review the structure of the data frame and data types of the
values using the str()
function.
str(data)
## 'data.frame': 5 obs. of 4 variables:
## $ name: chr "John" "Molly" "Frank" "Peter" ...
## $ job : chr "Policeman" "Artist" "Banker" NA ...
## $ sex : chr "male" "female" "male" "male" ...
## $ age : num 45 32 58 18 22
The statistical summary and nature of the data can be obtained by
applying the summary()
function.
summary(data)
## name job sex age
## Length:5 Length:5 Length:5 Min. :18
## Class :character Class :character Class :character 1st Qu.:22
## Mode :character Mode :character Mode :character Median :32
## Mean :35
## 3rd Qu.:45
## Max. :58
Exercise: You have got the following data containing the sales volume of a store on a specific day (See below). The information are stored in 3 vectors.
product_name
contains the name of the product,price
contains the individual price for the product andsold_pieces
contains the number of sales per product. Store the following three vectors as a single variable in form of a data frame!
product_name <- c(
"ginger", "oil", "hazelnut chocolate spread", "tomatoe", "pepper",
"milk", "orangejuice", "beer", "pumpkin", "onion", "bread", "apple",
"banana", "cookies", "soysauce", "toiletpaper", "toothpaste", "soap",
"chocolate", "pasta", "rice", "cucumber", "cereals", "mixed nuts",
"tortilla chips", "sausages", "cheese", "chewing gums", "tofu", "coke",
"tequila", "cinnamon", "salt", "lemon", "pizza", "shower gel", "battery",
"flavoured yoghurt", "basil", "coconut milk"
)
price <- c(
0.70, 2.69, 3.79, 2.30, 2.99, 0.89, 1.59, 0.99, 4.28, 1.56,
1.99, 2.05, 2.10, 1.79, 3.89, 2.69, 0.79, 0.99, 1.39, 1.19, 1.99, 0.60,
2.79, 3.99, 1.89, 3.99, 1.99, 0.79, 1.99, 1.39, 6.99, 1.49, 0.99, 2.05,
2.99, 0.89, 2.49, 0.79, 2.99, 1.99
)
sold_pieces <- c(
106, 85, 50, 48, 47, 91, 7, 54, 35, 66, 49, 51, 59, 89, 33, 36,
76, 61, 60, 45, 74, 53, 66, 77, 38, 77, 72, 60, 51, 69, 26, 57,
32, 47, 60, 44, 36, 54, 51, 39
)
### your code here
supermarket <- data.frame(product_name, price, sold_pieces)
str(supermarket)
## 'data.frame': 40 obs. of 3 variables:
## $ product_name: chr "ginger" "oil" "hazelnut chocolate spread" "tomatoe" ...
## $ price : num 0.7 2.69 3.79 2.3 2.99 0.89 1.59 0.99 4.28 1.56 ...
## $ sold_pieces : num 106 85 50 48 47 91 7 54 35 66 ...
Extracting data from a data frame
In order to extract one or more specific columns from a data frame we
may use the $
operator.
data$name
## [1] "John" "Molly" "Frank" "Peter" "Michelle"
data.frame(data$name, data$job)
## data.name data.job
## 1 John Policeman
## 2 Molly Artist
## 3 Frank Banker
## 4 Peter <NA>
## 5 Michelle Teacher
Similar to a matrix
object we can slice a
data.frame
object with column and row indices
(data.frame[row, column]
):
# Extract first two rows
data[1:2, ]
## name job sex age
## 1 John Policeman male 45
## 2 Molly Artist female 32
# Extract 3rd and 5th row with 2nd and 4th column
data[c(3, 5), c(2, 4)]
## job age
## 3 Banker 58
## 5 Teacher 22
Note: A specific row of a data frame is extracted by
[rownumber, ]
. We need the,
because the data frame is two-dimensional. By not specifying a second argument we get the whole row.
In addition, we may extract data based on variable values. Thus, we
have to create a logical vector first, such as
data$sex == 'female'
.
data$sex == "female"
## [1] FALSE TRUE FALSE FALSE TRUE
Now we feed that logical vector into the square brackets as row values.
data[data$sex == "female", ]
## name job sex age
## 2 Molly Artist female 32
## 5 Michelle Teacher female 22
Note: To do smart value extractions out of a given data frame use
[logical_expression, ]
Alternatively, we may use the subset()
function. This
function yields basically the same results as logical indexing, however,
maybe it is more readable.
subset(data,
age >= 20 & sex == "male",
select = c(name, job, age)
)
## name job age
## 1 John Policeman 45
## 3 Frank Banker 58
Here, we select all men over the age of 20 and we keep the variables
name
, job
and age
.
Exercise: Use the data frame from the previous exercise. Which products are higher in price than 2.50 EUR? How many products are these in sum?
### your code here
supermarket_filtered <- supermarket[supermarket$price > 2.50, ]
# products that cost more than 2.50
supermarket_filtered$product_name
# amount of these products
length(supermarket_filtered$product_name)
## [1] "oil" "hazelnut chocolate spread"
## [3] "pepper" "pumpkin"
## [5] "soysauce" "toiletpaper"
## [7] "cereals" "mixed nuts"
## [9] "sausages" "tequila"
## [11] "pizza" "basil"
## [1] 12
Excluding (dropping) variables
There are several ways to drop data from a data.frame
object. One possibility is to apply the %in%
syntax.
Therefore, we create a logical vector and select columns of the data
frame accordingly.
vars <- colnames(data) %in% c("name", "age", "job")
vars
## [1] TRUE TRUE FALSE TRUE
data[vars]
## name job age
## 1 John Policeman 45
## 2 Molly Artist 32
## 3 Frank Banker 58
## 4 Peter <NA> 18
## 5 Michelle Teacher 22
Using the negate command !
gives us even more
flexibility.
data[!vars]
## sex
## 1 male
## 2 female
## 3 male
## 4 male
## 5 female
Another possibility is to exclude particular columns by adding the minus sign to the column index.
# exclude 1st and 3rd variable
data[c(-1, -3)]
## job age
## 1 Policeman 45
## 2 Artist 32
## 3 Banker 58
## 4 <NA> 18
## 5 Teacher 22
Alternatively, we can set columns to NULL
.
newdata <- data
newdata$job <- newdata$age <- NULL
newdata
## name sex
## 1 John male
## 2 Molly female
## 3 Frank male
## 4 Peter male
## 5 Michelle female
Expanding a data frame
A data frame can be expanded by adding columns and rows.
# Add the "height" column
data$height <- c(195, 165, 180, 178, 182)
data
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 4 Peter <NA> male 18 178
## 5 Michelle Teacher female 22 182
To add rows to an existing data frame, we apply the
rbind()
function. Note that the added data needs to have
the same structure as the existing data frame.
new_data <- data.frame(
name = "Lisa",
job = "Fitness coach",
sex = "female",
age = NA,
height = 166
)
data <- rbind(data, new_data)
data
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 4 Peter <NA> male 18 178
## 5 Michelle Teacher female 22 182
## 6 Lisa Fitness coach female NA 166
Exercise: Use the data frame
supermarket
from the first exercise. Calculate the business volumes of the products and store them in a new column of your data frame!
### your code here
supermarket["business_volume"] <- supermarket$price * supermarket$sold_pieces
supermarket$business_volume
## [1] 74.20 228.65 189.50 110.40 140.53 80.99 11.13 53.46 149.80 102.96
## [11] 97.51 104.55 123.90 159.31 128.37 96.84 60.04 60.39 83.40 53.55
## [21] 147.26 31.80 184.14 307.23 71.82 307.23 143.28 47.40 101.49 95.91
## [31] 181.74 84.93 31.68 96.35 179.40 39.16 89.64 42.66 152.49 77.61
Sorting data
In order to sort a data frame we use the order()
function. By default, sorting is ascending. By prepending the sorting
variable with a minus sign we indicate a descending order.
# sort by age (ascending)
data[order(data$age), ]
## name job sex age height
## 4 Peter <NA> male 18 178
## 5 Michelle Teacher female 22 182
## 2 Molly Artist female 32 165
## 1 John Policeman male 45 195
## 3 Frank Banker male 58 180
## 6 Lisa Fitness coach female NA 166
# sort by height (descending)
data[order(-data$height), ]
## name job sex age height
## 1 John Policeman male 45 195
## 5 Michelle Teacher female 22 182
## 3 Frank Banker male 58 180
## 4 Peter <NA> male 18 178
## 6 Lisa Fitness coach female NA 166
## 2 Molly Artist female 32 165
Exercise: Use the data frame from the previous exercise. Extract the top 10 most sold products on this day!
### your code here
supermarket_top10 <- supermarket[order(supermarket$business_volume, decreasing = TRUE), ]
supermarket_top10 <- supermarket_top10[1:10, ]
supermarket_top10
## product_name price sold_pieces business_volume
## 24 mixed nuts 3.99 77 307.23
## 26 sausages 3.99 77 307.23
## 2 oil 2.69 85 228.65
## 3 hazelnut chocolate spread 3.79 50 189.50
## 23 cereals 2.79 66 184.14
## 31 tequila 6.99 26 181.74
## 35 pizza 2.99 60 179.40
## 14 cookies 1.79 89 159.31
## 39 basil 2.99 51 152.49
## 9 pumpkin 4.28 35 149.80
Missing Values
In R, missing values are represented by the symbol NA
(not available). Impossible values (e.g. dividing by zero) are
represented by the symbol NaN
(not a number).
We can test for missing values by applying the is.na()
function.
is.na(data$job)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
The function is.na()
returns TRUE
if a
value is missing. Hence, we can use this logical vector to slice the
data frame.
data[is.na(data$job), ]
## name job sex age height
## 4 Peter <NA> male 18 178
The command above returns all rows of the data frame
data
, where a missing value occurs in the column
job
. Using the negate operator (!
) returns all
rows without missing values in the column job
.
data[!is.na(data$job), ]
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 5 Michelle Teacher female 22 182
## 6 Lisa Fitness coach female NA 166
Note that in the age
column there is still a missing
value. In order to inspect the complete data.frame
object
with respect to missing values we can apply the is.na()
function on the whole data.frame
object.
is.na(data)
## name job sex age height
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE TRUE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE TRUE FALSE
Well, this might not be very useful. However, in combination with the
rowSums()
or colSums()
function we get a nice
representation of missing values in the data frame.
rowSums(is.na(data))
## [1] 0 0 0 1 0 1
colSums(is.na(data))
## name job sex age height
## 0 1 0 1 0
The function complete.cases()
returns a logical vector
indicating which cases are complete.
complete.cases(data)
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
By using the resulting logical vector for slicing we may get a clean data set.
data[complete.cases(data), ]
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 5 Michelle Teacher female 22 182
Another useful function is the na.omit()
function.
# create new data set without missing data
clean_data <- na.omit(data)
clean_data
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 5 Michelle Teacher female 22 182