10630_data_frame.knit

Data frames can be considered the most important data structure to store data in an ordered way. They also empower you to work with and on big data sets conveniently.

But what are the advantages of data frames in contrast to other data types and structures?

In short: Data frames allow us to:

store associated data in a table-like data structure, in which each column contains values of one variable and each row contains one set of values from each column.
access, modify, extract and / or sort the data set in a comfortable way.
simplify complex analysis of big data sets.

Following are the characteristics of a data frame:

The column names should be non-empty.
The row names should be unique.
The data stored in a data frame can be of numeric, factor or character type.
Each column should contain the same number of data items.

As you can see there are a lot of good reasons to learn about the concept of data frames.

Let’s get started!

Defining a data frame

We can create a data frame by applying the data.frame() function. Note that in most applications we import data into a data.frame object.

data <- data.frame(
  name = c("John", "Molly", "Frank", "Peter", "Michelle"),
  job = c("Policeman", "Artist", "Banker", NA, "Teacher"),
  sex = c("male", "female", "male", "male", "female"),
  age = c(45, 32, 58, 18, 22),
  stringsAsFactors = FALSE # added to avoid character string variables being interpreted as factors
)
data

##       name       job    sex age
## 1     John Policeman   male  45
## 2    Molly    Artist female  32
## 3    Frank    Banker   male  58
## 4    Peter      <NA>   male  18
## 5 Michelle   Teacher female  22

class(data)

## [1] "data.frame"

Note: As you can see we used the c() function to link the single data values together. Moreover, we definied 4 individual vectorized variables (name, job, sex, age) and provide them as arguments for the data.frame function. The function combines the four individual vectorized variables to one data frame that incorporates the provided data.

Important Note: As you can see data frames are used to store a user-defined number of vectors. The types of the given vectors can be different.

Summarizing a data frame

Typing the name of the stored data frame prints all the data which is stored in the variable. This can be inconvenient, particularly when working with big data sets. In order to get an overview of our data and a summary of the stored information we can use the str() and the summary() function.

We can review the structure of the data frame and data types of the values using the str() function.

str(data)

## 'data.frame':    5 obs. of  4 variables:
##  $ name: chr  "John" "Molly" "Frank" "Peter" ...
##  $ job : chr  "Policeman" "Artist" "Banker" NA ...
##  $ sex : chr  "male" "female" "male" "male" ...
##  $ age : num  45 32 58 18 22

The statistical summary and nature of the data can be obtained by applying the summary() function.

summary(data)

##      name               job                sex                 age    
##  Length:5           Length:5           Length:5           Min.   :18  
##  Class :character   Class :character   Class :character   1st Qu.:22  
##  Mode  :character   Mode  :character   Mode  :character   Median :32  
##                                                           Mean   :35  
##                                                           3rd Qu.:45  
##                                                           Max.   :58

Exercise: You have got the following data containing the sales volume of a store on a specific day (See below). The information are stored in 3 vectors. product_name contains the name of the product, price contains the individual price for the product and sold_pieces contains the number of sales per product. Store the following three vectors as a single variable in form of a data frame!

product_name <- c(
  "ginger", "oil", "hazelnut chocolate spread", "tomatoe", "pepper",
  "milk", "orangejuice", "beer", "pumpkin", "onion", "bread", "apple",
  "banana", "cookies", "soysauce", "toiletpaper", "toothpaste", "soap",
  "chocolate", "pasta", "rice", "cucumber", "cereals", "mixed nuts",
  "tortilla chips", "sausages", "cheese", "chewing gums", "tofu", "coke",
  "tequila", "cinnamon", "salt", "lemon", "pizza", "shower gel", "battery",
  "flavoured yoghurt", "basil", "coconut milk"
)
price <- c(
  0.70, 2.69, 3.79, 2.30, 2.99, 0.89, 1.59, 0.99, 4.28, 1.56,
  1.99, 2.05, 2.10, 1.79, 3.89, 2.69, 0.79, 0.99, 1.39, 1.19, 1.99, 0.60,
  2.79, 3.99, 1.89, 3.99, 1.99, 0.79, 1.99, 1.39, 6.99, 1.49, 0.99, 2.05,
  2.99, 0.89, 2.49, 0.79, 2.99, 1.99
)
sold_pieces <- c(
  106, 85, 50, 48, 47, 91, 7, 54, 35, 66, 49, 51, 59, 89, 33, 36,
  76, 61, 60, 45, 74, 53, 66, 77, 38, 77, 72, 60, 51, 69, 26, 57,
  32, 47, 60, 44, 36, 54, 51, 39
)

### your code here

Show code

supermarket <- data.frame(product_name, price, sold_pieces)
str(supermarket)

## 'data.frame':    40 obs. of  3 variables:
##  $ product_name: chr  "ginger" "oil" "hazelnut chocolate spread" "tomatoe" ...
##  $ price       : num  0.7 2.69 3.79 2.3 2.99 0.89 1.59 0.99 4.28 1.56 ...
##  $ sold_pieces : num  106 85 50 48 47 91 7 54 35 66 ...

Extracting data from a data frame

In order to extract one or more specific columns from a data frame we may use the $ operator.

data$name

## [1] "John"     "Molly"    "Frank"    "Peter"    "Michelle"

data.frame(data$name, data$job)

##   data.name  data.job
## 1      John Policeman
## 2     Molly    Artist
## 3     Frank    Banker
## 4     Peter      <NA>
## 5  Michelle   Teacher

Similar to a matrix object we can slice a data.frame object with column and row indices (data.frame[row, column]):

# Extract first two rows
data[1:2, ]

##    name       job    sex age
## 1  John Policeman   male  45
## 2 Molly    Artist female  32

# Extract 3rd and 5th row with 2nd and 4th column
data[c(3, 5), c(2, 4)]

##       job age
## 3  Banker  58
## 5 Teacher  22

Note: A specific row of a data frame is extracted by [rownumber, ]. We need the , because the data frame is two-dimensional. By not specifying a second argument we get the whole row.

In addition, we may extract data based on variable values. Thus, we have to create a logical vector first, such as data$sex == 'female'.

data$sex == "female"

## [1] FALSE  TRUE FALSE FALSE  TRUE

Now we feed that logical vector into the square brackets as row values.

data[data$sex == "female", ]

##       name     job    sex age
## 2    Molly  Artist female  32
## 5 Michelle Teacher female  22

Note: To do smart value extractions out of a given data frame use [logical_expression, ]

Alternatively, we may use the subset() function. This function yields basically the same results as logical indexing, however, maybe it is more readable.

subset(data,
  age >= 20 & sex == "male",
  select = c(name, job, age)
)

##    name       job age
## 1  John Policeman  45
## 3 Frank    Banker  58

Here, we select all men over the age of 20 and we keep the variables name, job and age.

Exercise: Use the data frame from the previous exercise. Which products are higher in price than 2.50 EUR? How many products are these in sum?

### your code here

Show code

supermarket_filtered <- supermarket[supermarket$price > 2.50, ]
# products that cost more than 2.50
supermarket_filtered$product_name
# amount of these products
length(supermarket_filtered$product_name)

##  [1] "oil"                       "hazelnut chocolate spread"
##  [3] "pepper"                    "pumpkin"                  
##  [5] "soysauce"                  "toiletpaper"              
##  [7] "cereals"                   "mixed nuts"               
##  [9] "sausages"                  "tequila"                  
## [11] "pizza"                     "basil"

## [1] 12

Excluding (dropping) variables

There are several ways to drop data from a data.frame object. One possibility is to apply the %in% syntax. Therefore, we create a logical vector and select columns of the data frame accordingly.

vars <- colnames(data) %in% c("name", "age", "job")
vars

## [1]  TRUE  TRUE FALSE  TRUE

data[vars]

##       name       job age
## 1     John Policeman  45
## 2    Molly    Artist  32
## 3    Frank    Banker  58
## 4    Peter      <NA>  18
## 5 Michelle   Teacher  22

Using the negate command ! gives us even more flexibility.

data[!vars]

##      sex
## 1   male
## 2 female
## 3   male
## 4   male
## 5 female

Another possibility is to exclude particular columns by adding the minus sign to the column index.

# exclude 1st and 3rd variable
data[c(-1, -3)]

##         job age
## 1 Policeman  45
## 2    Artist  32
## 3    Banker  58
## 4      <NA>  18
## 5   Teacher  22

Alternatively, we can set columns to NULL.

newdata <- data
newdata$job <- newdata$age <- NULL
newdata

##       name    sex
## 1     John   male
## 2    Molly female
## 3    Frank   male
## 4    Peter   male
## 5 Michelle female

Expanding a data frame

A data frame can be expanded by adding columns and rows.

# Add the "height" column
data$height <- c(195, 165, 180, 178, 182)
data

##       name       job    sex age height
## 1     John Policeman   male  45    195
## 2    Molly    Artist female  32    165
## 3    Frank    Banker   male  58    180
## 4    Peter      <NA>   male  18    178
## 5 Michelle   Teacher female  22    182

To add rows to an existing data frame, we apply the rbind() function. Note that the added data needs to have the same structure as the existing data frame.

new_data <- data.frame(
  name = "Lisa",
  job = "Fitness coach",
  sex = "female",
  age = NA,
  height = 166
)
data <- rbind(data, new_data)
data

##       name           job    sex age height
## 1     John     Policeman   male  45    195
## 2    Molly        Artist female  32    165
## 3    Frank        Banker   male  58    180
## 4    Peter          <NA>   male  18    178
## 5 Michelle       Teacher female  22    182
## 6     Lisa Fitness coach female  NA    166

Exercise: Use the data frame supermarket from the first exercise. Calculate the business volumes of the products and store them in a new column of your data frame!

### your code here

Show code

supermarket["business_volume"] <- supermarket$price * supermarket$sold_pieces
supermarket$business_volume

##  [1]  74.20 228.65 189.50 110.40 140.53  80.99  11.13  53.46 149.80 102.96
## [11]  97.51 104.55 123.90 159.31 128.37  96.84  60.04  60.39  83.40  53.55
## [21] 147.26  31.80 184.14 307.23  71.82 307.23 143.28  47.40 101.49  95.91
## [31] 181.74  84.93  31.68  96.35 179.40  39.16  89.64  42.66 152.49  77.61

Sorting data

In order to sort a data frame we use the order() function. By default, sorting is ascending. By prepending the sorting variable with a minus sign we indicate a descending order.

# sort by age (ascending)
data[order(data$age), ]

##       name           job    sex age height
## 4    Peter          <NA>   male  18    178
## 5 Michelle       Teacher female  22    182
## 2    Molly        Artist female  32    165
## 1     John     Policeman   male  45    195
## 3    Frank        Banker   male  58    180
## 6     Lisa Fitness coach female  NA    166

# sort by height (descending)
data[order(-data$height), ]

##       name           job    sex age height
## 1     John     Policeman   male  45    195
## 5 Michelle       Teacher female  22    182
## 3    Frank        Banker   male  58    180
## 4    Peter          <NA>   male  18    178
## 6     Lisa Fitness coach female  NA    166
## 2    Molly        Artist female  32    165

Exercise: Use the data frame from the previous exercise. Extract the top 10 most sold products on this day!

### your code here

Show code

supermarket_top10 <- supermarket[order(supermarket$business_volume, decreasing = TRUE), ]
supermarket_top10 <- supermarket_top10[1:10, ]
supermarket_top10

##                 product_name price sold_pieces business_volume
## 24                mixed nuts  3.99          77          307.23
## 26                  sausages  3.99          77          307.23
## 2                        oil  2.69          85          228.65
## 3  hazelnut chocolate spread  3.79          50          189.50
## 23                   cereals  2.79          66          184.14
## 31                   tequila  6.99          26          181.74
## 35                     pizza  2.99          60          179.40
## 14                   cookies  1.79          89          159.31
## 39                     basil  2.99          51          152.49
## 9                    pumpkin  4.28          35          149.80

Missing Values

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g. dividing by zero) are represented by the symbol NaN (not a number).

We can test for missing values by applying the is.na() function.

is.na(data$job)

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

The function is.na() returns TRUE if a value is missing. Hence, we can use this logical vector to slice the data frame.

data[is.na(data$job), ]

##    name  job  sex age height
## 4 Peter <NA> male  18    178

The command above returns all rows of the data frame data, where a missing value occurs in the column job. Using the negate operator (!) returns all rows without missing values in the column job.

data[!is.na(data$job), ]

##       name           job    sex age height
## 1     John     Policeman   male  45    195
## 2    Molly        Artist female  32    165
## 3    Frank        Banker   male  58    180
## 5 Michelle       Teacher female  22    182
## 6     Lisa Fitness coach female  NA    166

Note that in the age column there is still a missing value. In order to inspect the complete data.frame object with respect to missing values we can apply the is.na() function on the whole data.frame object.

is.na(data)

##       name   job   sex   age height
## [1,] FALSE FALSE FALSE FALSE  FALSE
## [2,] FALSE FALSE FALSE FALSE  FALSE
## [3,] FALSE FALSE FALSE FALSE  FALSE
## [4,] FALSE  TRUE FALSE FALSE  FALSE
## [5,] FALSE FALSE FALSE FALSE  FALSE
## [6,] FALSE FALSE FALSE  TRUE  FALSE

Well, this might not be very useful. However, in combination with the rowSums() or colSums() function we get a nice representation of missing values in the data frame.

rowSums(is.na(data))

## [1] 0 0 0 1 0 1

colSums(is.na(data))

##   name    job    sex    age height 
##      0      1      0      1      0

The function complete.cases() returns a logical vector indicating which cases are complete.

complete.cases(data)

## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

By using the resulting logical vector for slicing we may get a clean data set.

data[complete.cases(data), ]

##       name       job    sex age height
## 1     John Policeman   male  45    195
## 2    Molly    Artist female  32    165
## 3    Frank    Banker   male  58    180
## 5 Michelle   Teacher female  22    182

Another useful function is the na.omit() function.

# create new data set without missing data
clean_data <- na.omit(data)
clean_data

##       name       job    sex age height
## 1     John Policeman   male  45    195
## 2    Molly    Artist female  32    165
## 3    Frank    Banker   male  58    180
## 5 Michelle   Teacher female  22    182

Citation

The E-Learning project SOGA-R was developed at the Department of Earth Sciences by Kai Hartmann, Joachim Krois and Annette Rudolph. You can reach us via mail by soga[at]zedat.fu-berlin.de.

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.

Please cite as follow: Hartmann, K., Krois, J., Rudolph, A. (2023): Statistics and Geodata Analysis using R (SOGA-R). Department of Earth Sciences, Freie Universitaet Berlin.