A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Following are the characteristics of a data frame:

We can create a data frame by applying the data.frame() function. Note that in most of the use cases we import data into a data.frame object.

data <- data.frame(
   name = c("John", "Molly", "Frank", "Peter", "Michelle"),
   job = c('Policeman', 'Artist', 'Banker', NA, 'Teacher'),
   sex = c('male', 'female', 'male', 'male', 'female'),
   age = c(45, 32, 58, 18, 22), 
   stringsAsFactors = FALSE # added to avoid character string variables being interpreted as factors
   )
data
##       name       job    sex age
## 1     John Policeman   male  45
## 2    Molly    Artist female  32
## 3    Frank    Banker   male  58
## 4    Peter      <NA>   male  18
## 5 Michelle   Teacher female  22
class(data)
## [1] "data.frame"

We can review the structure of the data frame by using the str() function.

str(data)
## 'data.frame':    5 obs. of  4 variables:
##  $ name: chr  "John" "Molly" "Frank" "Peter" ...
##  $ job : chr  "Policeman" "Artist" "Banker" NA ...
##  $ sex : chr  "male" "female" "male" "male" ...
##  $ age : num  45 32 58 18 22

The statistical summary and nature of the data can be obtained by applying summary() function.

summary(data)
##      name               job                sex                 age    
##  Length:5           Length:5           Length:5           Min.   :18  
##  Class :character   Class :character   Class :character   1st Qu.:22  
##  Mode  :character   Mode  :character   Mode  :character   Median :32  
##                                                           Mean   :35  
##                                                           3rd Qu.:45  
##                                                           Max.   :58

Extract data from a data frame

In order to extract one specific column from a data frame we may use the $ operator.

data$name
## [1] "John"     "Molly"    "Frank"    "Peter"    "Michelle"
data.frame(data$name, data$job)
##   data.name  data.job
## 1      John Policeman
## 2     Molly    Artist
## 3     Frank    Banker
## 4     Peter      <NA>
## 5  Michelle   Teacher

Similar to a matrix object we can slice a data.frame object with column and row indices (data.frame[row, column]):

# Extract first two rows
data[1:2,]
##    name       job    sex age
## 1  John Policeman   male  45
## 2 Molly    Artist female  32
# Extract 3rd and 5th row with 2nd and 4th column
data[c(3,5),c(2,4)]
##       job age
## 3  Banker  58
## 5 Teacher  22

In addition we may select data based on variable values. Thus, we have to create a logical vector first, such as data$sex == 'female'.

data$sex == 'female'
## [1] FALSE  TRUE FALSE FALSE  TRUE

Now we feed that logical vector into the square brackets as row values.

data[data$sex == 'female',]
##       name     job    sex age
## 2    Molly  Artist female  32
## 5 Michelle Teacher female  22

Alternatively we may use the subset() function. This function yields basically the same results as logical indexing, however, maybe it is more readable.

subset(data, 
       age >= 20 & sex == 'male',
       select = c(name, job, age))
##    name       job age
## 1  John Policeman  45
## 3 Frank    Banker  58

Here we selected all men over the age of 20 and we keep variables name, job and age.

Excluding (dropping) variables

There are several ways to drop data from a data.frame object. One possibility is to apply the %in% syntax. Therefore we create a logical vector and select columns of the data frame accordingly.

vars <- colnames(data) %in% c("name", "age", "job")
vars
## [1]  TRUE  TRUE FALSE  TRUE
data[vars]
##       name       job age
## 1     John Policeman  45
## 2    Molly    Artist  32
## 3    Frank    Banker  58
## 4    Peter      <NA>  18
## 5 Michelle   Teacher  22

Using the negate command ! gives us even more flexibility.

data[!vars]
##      sex
## 1   male
## 2 female
## 3   male
## 4   male
## 5 female

Another possibility is to exclude particular columns by adding the minus sign to the column index.

# exclude 1st and 3rd variable
data[c(-1,-3)]
##         job age
## 1 Policeman  45
## 2    Artist  32
## 3    Banker  58
## 4      <NA>  18
## 5   Teacher  22

Alternative we can set columns to NULL

newdata <- data
newdata$job <- newdata$age <- newdata$height <- NULL
newdata
##       name    sex
## 1     John   male
## 2    Molly female
## 3    Frank   male
## 4    Peter   male
## 5 Michelle female

Expand a data frame

A data frame can be expanded by adding columns and rows.

# Add the "height" column
data$height = c(195, 165, 180, 178, 182)
data
##       name       job    sex age height
## 1     John Policeman   male  45    195
## 2    Molly    Artist female  32    165
## 3    Frank    Banker   male  58    180
## 4    Peter      <NA>   male  18    178
## 5 Michelle   Teacher female  22    182

To add rows to an existing data frame, we apply the rbind() function. Note that the added data needs to be in the same structure as the existing data frame.

new.data <- data.frame(name = 'Lisa',
                       job = 'Fitness coach',
                       sex = 'female',
                       age = NA,
                       height = 166)
data <- rbind(data, new.data)
data
##       name           job    sex age height
## 1     John     Policeman   male  45    195
## 2    Molly        Artist female  32    165
## 3    Frank        Banker   male  58    180
## 4    Peter          <NA>   male  18    178
## 5 Michelle       Teacher female  22    182
## 6     Lisa Fitness coach female  NA    166

Sorting data

In order to sort a data frame we use the order() function. By default, sorting is ascending. By prepending the sorting variable by a minus sign we indicate a descending order.

# sort by age (ascending)
data[order(data$age),]
##       name           job    sex age height
## 4    Peter          <NA>   male  18    178
## 5 Michelle       Teacher female  22    182
## 2    Molly        Artist female  32    165
## 1     John     Policeman   male  45    195
## 3    Frank        Banker   male  58    180
## 6     Lisa Fitness coach female  NA    166
# sort by height (descending)
data[order(-data$height),]
##       name           job    sex age height
## 1     John     Policeman   male  45    195
## 5 Michelle       Teacher female  22    182
## 3    Frank        Banker   male  58    180
## 4    Peter          <NA>   male  18    178
## 6     Lisa Fitness coach female  NA    166
## 2    Molly        Artist female  32    165

Missing Values

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

We can test for missing values by applying the is.na() function.

is.na(data$job)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

The function is.na() returns TRUE if a value is missing. Hence, we can use this logical vector to slice the data frame.

data[is.na(data$job),]
##    name  job  sex age height
## 4 Peter <NA> male  18    178

The command above returns all rows of the data fame data, where a missing value occurs in the column job. Using the negate operator (!) returns all rows without missing values in the column job.

data[!is.na(data$job),]
##       name           job    sex age height
## 1     John     Policeman   male  45    195
## 2    Molly        Artist female  32    165
## 3    Frank        Banker   male  58    180
## 5 Michelle       Teacher female  22    182
## 6     Lisa Fitness coach female  NA    166

Note that in the age column there is still a missing value. In order to inspect the data.frame object with respect to missing values we can apply the is.na() function on the whole data.frame object.

is.na(data)
##       name   job   sex   age height
## [1,] FALSE FALSE FALSE FALSE  FALSE
## [2,] FALSE FALSE FALSE FALSE  FALSE
## [3,] FALSE FALSE FALSE FALSE  FALSE
## [4,] FALSE  TRUE FALSE FALSE  FALSE
## [5,] FALSE FALSE FALSE FALSE  FALSE
## [6,] FALSE FALSE FALSE  TRUE  FALSE

Well, this might not be very useful. However, in combination with the rowSums() or colSums() function we get a nice representation of missing values in the data frame.

rowSums(is.na(data))
## [1] 0 0 0 1 0 1
colSums(is.na(data))
##   name    job    sex    age height 
##      0      1      0      1      0

The function complete.cases() returns a logical vector indicating which cases are complete.

complete.cases(data)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

By using the resulting logical vector for slicing we may get a clean data set.

data[complete.cases(data),]
##       name       job    sex age height
## 1     John Policeman   male  45    195
## 2    Molly    Artist female  32    165
## 3    Frank    Banker   male  58    180
## 5 Michelle   Teacher female  22    182

Another useful function is the na.omit() function.

# create new dataset without missing data
clean.data <- na.omit(data) 
clean.data
##       name       job    sex age height
## 1     John Policeman   male  45    195
## 2    Molly    Artist female  32    165
## 3    Frank    Banker   male  58    180
## 5 Michelle   Teacher female  22    182