A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Following are the characteristics of a data frame:
We can create a data frame by applying the data.frame()
function. Note that in most of the use cases we import data into a data.frame
object.
data <- data.frame(
name = c("John", "Molly", "Frank", "Peter", "Michelle"),
job = c('Policeman', 'Artist', 'Banker', NA, 'Teacher'),
sex = c('male', 'female', 'male', 'male', 'female'),
age = c(45, 32, 58, 18, 22),
stringsAsFactors = FALSE # added to avoid character string variables being interpreted as factors
)
data
## name job sex age
## 1 John Policeman male 45
## 2 Molly Artist female 32
## 3 Frank Banker male 58
## 4 Peter <NA> male 18
## 5 Michelle Teacher female 22
class(data)
## [1] "data.frame"
We can review the structure of the data frame by using the str()
function.
str(data)
## 'data.frame': 5 obs. of 4 variables:
## $ name: chr "John" "Molly" "Frank" "Peter" ...
## $ job : chr "Policeman" "Artist" "Banker" NA ...
## $ sex : chr "male" "female" "male" "male" ...
## $ age : num 45 32 58 18 22
The statistical summary and nature of the data can be obtained by applying summary()
function.
summary(data)
## name job sex age
## Length:5 Length:5 Length:5 Min. :18
## Class :character Class :character Class :character 1st Qu.:22
## Mode :character Mode :character Mode :character Median :32
## Mean :35
## 3rd Qu.:45
## Max. :58
Extract data from a data frame
In order to extract one specific column from a data frame we may use the $
operator.
data$name
## [1] "John" "Molly" "Frank" "Peter" "Michelle"
data.frame(data$name, data$job)
## data.name data.job
## 1 John Policeman
## 2 Molly Artist
## 3 Frank Banker
## 4 Peter <NA>
## 5 Michelle Teacher
Similar to a matrix
object we can slice a data.frame
object with column and row indices (data.frame[row, column]
):
# Extract first two rows
data[1:2,]
## name job sex age
## 1 John Policeman male 45
## 2 Molly Artist female 32
# Extract 3rd and 5th row with 2nd and 4th column
data[c(3,5),c(2,4)]
## job age
## 3 Banker 58
## 5 Teacher 22
In addition we may select data based on variable values. Thus, we have to create a logical vector first, such as data$sex == 'female'
.
data$sex == 'female'
## [1] FALSE TRUE FALSE FALSE TRUE
Now we feed that logical vector into the square brackets as row values.
data[data$sex == 'female',]
## name job sex age
## 2 Molly Artist female 32
## 5 Michelle Teacher female 22
Alternatively we may use the subset()
function. This function yields basically the same results as logical indexing, however, maybe it is more readable.
subset(data,
age >= 20 & sex == 'male',
select = c(name, job, age))
## name job age
## 1 John Policeman 45
## 3 Frank Banker 58
Here we selected all men over the age of 20 and we keep variables name
, job
and age
.
Excluding (dropping) variables
There are several ways to drop data from a data.frame
object. One possibility is to apply the %in%
syntax. Therefore we create a logical vector and select columns of the data frame accordingly.
vars <- colnames(data) %in% c("name", "age", "job")
vars
## [1] TRUE TRUE FALSE TRUE
data[vars]
## name job age
## 1 John Policeman 45
## 2 Molly Artist 32
## 3 Frank Banker 58
## 4 Peter <NA> 18
## 5 Michelle Teacher 22
Using the negate command !
gives us even more flexibility.
data[!vars]
## sex
## 1 male
## 2 female
## 3 male
## 4 male
## 5 female
Another possibility is to exclude particular columns by adding the minus sign to the column index.
# exclude 1st and 3rd variable
data[c(-1,-3)]
## job age
## 1 Policeman 45
## 2 Artist 32
## 3 Banker 58
## 4 <NA> 18
## 5 Teacher 22
Alternative we can set columns to NULL
newdata <- data
newdata$job <- newdata$age <- newdata$height <- NULL
newdata
## name sex
## 1 John male
## 2 Molly female
## 3 Frank male
## 4 Peter male
## 5 Michelle female
Expand a data frame
A data frame can be expanded by adding columns and rows.
# Add the "height" column
data$height = c(195, 165, 180, 178, 182)
data
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 4 Peter <NA> male 18 178
## 5 Michelle Teacher female 22 182
To add rows to an existing data frame, we apply the rbind()
function. Note that the added data needs to be in the same structure as the existing data frame.
new.data <- data.frame(name = 'Lisa',
job = 'Fitness coach',
sex = 'female',
age = NA,
height = 166)
data <- rbind(data, new.data)
data
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 4 Peter <NA> male 18 178
## 5 Michelle Teacher female 22 182
## 6 Lisa Fitness coach female NA 166
Sorting data
In order to sort a data frame we use the order()
function. By default, sorting is ascending. By prepending the sorting variable by a minus sign we indicate a descending order.
# sort by age (ascending)
data[order(data$age),]
## name job sex age height
## 4 Peter <NA> male 18 178
## 5 Michelle Teacher female 22 182
## 2 Molly Artist female 32 165
## 1 John Policeman male 45 195
## 3 Frank Banker male 58 180
## 6 Lisa Fitness coach female NA 166
# sort by height (descending)
data[order(-data$height),]
## name job sex age height
## 1 John Policeman male 45 195
## 5 Michelle Teacher female 22 182
## 3 Frank Banker male 58 180
## 4 Peter <NA> male 18 178
## 6 Lisa Fitness coach female NA 166
## 2 Molly Artist female 32 165
Missing Values
In R, missing values are represented by the symbol NA
(not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN
(not a number).
We can test for missing values by applying the is.na()
function.
is.na(data$job)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
The function is.na()
returns TRUE
if a value is missing. Hence, we can use this logical vector to slice the data frame.
data[is.na(data$job),]
## name job sex age height
## 4 Peter <NA> male 18 178
The command above returns all rows of the data fame data
, where a missing value occurs in the column job
. Using the negate operator (!
) returns all rows without missing values in the column job
.
data[!is.na(data$job),]
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 5 Michelle Teacher female 22 182
## 6 Lisa Fitness coach female NA 166
Note that in the age
column there is still a missing value. In order to inspect the data.frame
object with respect to missing values we can apply the is.na()
function on the whole data.frame
object.
is.na(data)
## name job sex age height
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE TRUE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE TRUE FALSE
Well, this might not be very useful. However, in combination with the rowSums()
or colSums()
function we get a nice representation of missing values in the data frame.
rowSums(is.na(data))
## [1] 0 0 0 1 0 1
colSums(is.na(data))
## name job sex age height
## 0 1 0 1 0
The function complete.cases()
returns a logical vector indicating which cases are complete.
complete.cases(data)
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
By using the resulting logical vector for slicing we may get a clean data set.
data[complete.cases(data),]
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 5 Michelle Teacher female 22 182
Another useful function is the na.omit()
function.
# create new dataset without missing data
clean.data <- na.omit(data)
clean.data
## name job sex age height
## 1 John Policeman male 45 195
## 2 Molly Artist female 32 165
## 3 Frank Banker male 58 180
## 5 Michelle Teacher female 22 182