Data science is about the extraction of knowledge from data. Data, a specific form of information, shows different levels of abstraction and structure (structured, semi-structured or unstructured).
A very common data structure is an array. In different domains there are other names for such a type of data used synonymously, such as matrix in mathematics, table in databases, spreadsheet, and data frame, which is a fundamental R object class (data.frame
).
Data of such a type is build up of observations and corresponding variables, often called features.
id | name | age |
---|---|---|
1 | John | 26 |
2 | Alice | 20 |
3 | Mike | 21 |
4 | Anne | 25 |
In this example the observations (called sample) correspond to a number of individuals. Each observed person is characterized by a set of variables (called features): by an identification number (id), by a name and by an age. In our example it is very easy, just by looking at the table, to get an overall impression of the data itself. We immediately realize that there are 4 persons in our sample, two women and two men. Further we see immediately that the youngest person is 20 years old, and is called Alice and the oldest is 26 years old, and is called John; Perfect!
However, real world applications often come with a lot of data. Hundreds, thousands, millions, even billions of observations, combined with thousands of variables may built up a data set. For humans it is impossible to draw any conclusion about the data just by looking at such kind of data sets. Therefore, we reduce data to a manageable size by constructing tables, drawing graphs, or calculating summary measures such as averages. These kind of statistical methods is called descriptive statistics (Mann 2012).
During this section we will explore a data set called students. You may download the students.csv
file here. First, we load the data set, assign a proper name to the data set and get an impression about its structure and size by calling the str()
function on the data set.
students <- read.csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
str(students)
## 'data.frame': 8239 obs. of 16 variables:
## $ stud.id : int 833917 898539 379678 807564 383291 256074 754591 146494 723584 314281 ...
## $ name : Factor w/ 8174 levels "Aarvold, Cindi",..: 2480 4196 7858 5109 5770 5592 1258 162 7221 5240 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 1 1 2 1 2 1 1 2 1 ...
## $ age : int 19 19 22 19 21 19 21 21 18 18 ...
## $ height : int 160 172 168 183 175 189 156 167 195 165 ...
## $ weight : num 64.8 73 70.6 79.7 71.4 85.8 65.9 65.7 94.4 66 ...
## $ religion : Factor w/ 5 levels "Catholic","Muslim",..: 2 4 5 4 1 1 5 4 4 3 ...
## $ nc.score : num 1.91 1.56 1.24 1.37 1.46 1.34 1.11 2.03 1.29 1.19 ...
## $ semester : Factor w/ 7 levels ">6th","1st","2nd",..: 2 3 4 3 2 3 3 4 4 3 ...
## $ major : Factor w/ 6 levels "Biology","Economics and Finance",..: 5 6 6 3 3 5 5 5 2 3 ...
## $ minor : Factor w/ 6 levels "Biology","Economics and Finance",..: 6 4 4 4 4 4 6 2 3 4 ...
## $ score1 : int NA NA 45 NA NA NA NA 58 57 NA ...
## $ score2 : int NA NA 46 NA NA NA NA 62 67 NA ...
## $ online.tutorial: int 0 0 0 0 0 0 0 0 0 0 ...
## $ graduated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ salary : num NA NA NA NA NA NA NA NA NA NA ...
The students data set consists of 8239 rows, each of them representing a particular student, and 16 columns, each of them corresponding to a variable/feature related to that particular student. These self-explaining variables are: stud.id, name, gender, age, height, weight, religion, nc.score, semester, major, minor, score1, score2, online.tutorial, graduated, salary. Next to the variable names, the str()
function lists the class
of each particular variable. All objects in R have a class, for example numeric
(num), and int
(integer), which is a special type (sub-class) of the the numeric
class, or factor
(the terms category and enumerated type are also used for factors), among others.
In the next sections we will explore the descriptive statistics of the students data set in more depth.