<- tibble(
heroes_tbldf subject_ID = factor(c("008", "016", "115", "027", "001")),
name = c("Wonder Woman", "Green Lantern", "Spider-Man", "Batman", "Superman"),
alias = c(
"Diana Prince", "Alan Scott", "Peter Parker", "Bruce Wayne",
"Clark Kent / Kal-El"
),city = c(
"Gateway City", "Capitol City", "New York City", "Gotham", "Metropolis"
),male = c(FALSE, TRUE, TRUE, TRUE, TRUE),
heightCM = c(183.5, 182.9, 177.8, 188.0, 190.5),
weightKg = c(74.8, 91.2, 75.7, 95.3, 106.6),
firstRun = c(1941L, 1940L, 1962L, 1939L, 1938L)
)
Lesson 9 Supplement: Compare dplyr and base R Subsetting
Overview
We will discuss/review
- The data sets
- Row subsetting
- Column subsetting
- Relational subsetting
This lesson will not compare merging two data sets using basic R, because the dplyr version is superior in almost every way.
Data Sets
In this class, we have used a few different data sets. The table below gives the names of the data sets and their sources.
Data Set | Source |
---|---|
heroes_tbldf |
In class: code below |
framingham_tbldf |
LocalControl with modifications shown below |
gapminder_tbldf |
dslabs with modifications shown below |
The heroes_tbldf
Data
Framingham Data
For the examples in this comparison document, we will use the first 20 records of the Framingham data only. Notice that the only difference between framingham_df
and framingham_tbldf
is that we forced the _tbldf
version to be a tibble (modern data frame).
# install.packages("LocalControl")
library(LocalControl)
# Base R data frame
<-
framingham_dataframe %>%
framingham mutate(sex = ifelse(female == 1, yes = "Female", no = "Male")) %>%
mutate(curSmoke = cursmoke == 1) %>%
mutate(highBP = outcome == 1) %>%
mutate(death = outcome == 2) %>%
select(-female, -cursmoke, -outcome) %>%
slice(1:20)
# Tibble
<- as_tibble(framingham_dataframe) framingham_tbldf
Gapminder Data
# install.packages("dslabs")
library(dslabs)
# Base R data frame
<-
gapminder_dataframe %>%
gapminder mutate(gdpPerCap = gdp / population)
# Tibble
<- as_tibble(gapminder_dataframe) gapminder_tbldf
Row Subsetting
base
To select the first row of a tibble using basic R subsetting rules, use the [
function. This follows row, column
syntax.
1, ] framingham_dataframe[
totchol age bmi BPVar heartrte glucose time_outcome cigpday sex curSmoke
1 195 39 26.97 -17 80 77 24 0 Male FALSE
highBP death
1 FALSE FALSE
This returns a 1-row data frame.
dplyr
To select the first row of a tibble using dplyr subsetting rules, use [
or the slice()
function. This uses row position only.
%>%
framingham_tbldf slice(1)
# A tibble: 1 × 12
totchol age bmi BPVar heartrte glucose time_outcome cigpday sex curSmoke
<int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <chr> <lgl>
1 195 39 27.0 -17 80 77 24 0 Male FALSE
# ℹ 2 more variables: highBP <lgl>, death <lgl>
1, ] framingham_tbldf[
# A tibble: 1 × 12
totchol age bmi BPVar heartrte glucose time_outcome cigpday sex curSmoke
<int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <chr> <lgl>
1 195 39 27.0 -17 80 77 24 0 Male FALSE
# ℹ 2 more variables: highBP <lgl>, death <lgl>
These both return the same 1-row tibble.
Column Subsetting
base R
To select the first column of a tibble using basic R subsetting rules, again use the [
function. This follows row, column
syntax, so move the 1
on the other side of the comma.
1] framingham_dataframe[,
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
This returns an atomic vector of length 20.
Because tibbles are also lists, we can use the list subsetting rules for columns as well. To subset by column position, we can use the [
or the [[
functions (for single-level or double-level subsetting). Note that [[
can only take in a single scalar argument, rather than a vector of positions.
1] framingham_dataframe[
totchol
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
This returns a 1-column data frame.
1]] framingham_dataframe[[
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
This returns an atomic vector of length 20 with the contents of the first column.
Note that these functions also work when we supply the name of a column as a character string:
"totchol"] # atomic vector framingham_dataframe[,
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
"totchol"] # 1-column data frame framingham_dataframe[
totchol
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
"totchol"]] # atomic vector framingham_dataframe[[
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
Infuriatingly, the [
function changes behaviour if you request two or more columns using [row, c(col1, col2, ...)]
syntax:
c("totchol", "death")] # now a data frame framingham_dataframe[,
totchol death
1 195 FALSE
2 250 FALSE
3 245 FALSE
4 285 FALSE
5 205 FALSE
6 313 FALSE
7 254 FALSE
8 247 FALSE
9 291 TRUE
10 195 FALSE
11 195 FALSE
12 190 FALSE
13 215 FALSE
14 294 TRUE
15 247 FALSE
16 295 FALSE
17 226 FALSE
18 175 FALSE
19 180 FALSE
20 243 FALSE
One of the main differences between [
and [[
is that the single-subset operator can take in multiple column names, while [[
cannot. For example:
c("totchol", "death")]] framingham_dataframe[[
Error in .subset2(x, i, exact = exact): subscript out of bounds
Finally, we can use the $
function to extract the contents of a single column by name as a symbol (not using quotes):
$totchol framingham_dataframe
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
This returns an atomic vector of length 20 with the contents of the first column.
dplyr
To select the first column of a tibble using dplyr subsetting rules, use [
or the select()
function. This uses column position only.
%>%
framingham_tbldf select(1)
# A tibble: 20 × 1
totchol
<int>
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
1] framingham_tbldf[,
# A tibble: 20 × 1
totchol
<int>
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
1] framingham_tbldf[
# A tibble: 20 × 1
totchol
<int>
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
These all return the same 1-column tibble.
To access a column by name, we have two options: do we want the contents of the column or do we want the column itself.
Column Itself
We can use the select()
function in addition to the [
function to extract one or more columns. However, for the select()
function, we are not required to supply the name in quotes, but rather we can use column names as a symbol object.
%>%
framingham_tbldf select(totchol)
# A tibble: 20 × 1
totchol
<int>
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
"totchol"] framingham_tbldf[,
# A tibble: 20 × 1
totchol
<int>
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
"totchol"] framingham_tbldf[
# A tibble: 20 × 1
totchol
<int>
1 195
2 250
3 245
4 285
5 205
6 313
7 254
8 247
9 291
10 195
11 195
12 190
13 215
14 294
15 247
16 295
17 226
18 175
19 180
20 243
As with subsetting by position, these all return the same 1-column tibble. What’s more, is that—unlike a base R data frame—selecting more than one column does not change the behaviour of the function. Tibbles stay tibbles.
%>%
framingham_tbldf select(totchol, death)
# A tibble: 20 × 2
totchol death
<int> <lgl>
1 195 FALSE
2 250 FALSE
3 245 FALSE
4 285 FALSE
5 205 FALSE
6 313 FALSE
7 254 FALSE
8 247 FALSE
9 291 TRUE
10 195 FALSE
11 195 FALSE
12 190 FALSE
13 215 FALSE
14 294 TRUE
15 247 FALSE
16 295 FALSE
17 226 FALSE
18 175 FALSE
19 180 FALSE
20 243 FALSE
c("totchol", "death")] framingham_tbldf[,
# A tibble: 20 × 2
totchol death
<int> <lgl>
1 195 FALSE
2 250 FALSE
3 245 FALSE
4 285 FALSE
5 205 FALSE
6 313 FALSE
7 254 FALSE
8 247 FALSE
9 291 TRUE
10 195 FALSE
11 195 FALSE
12 190 FALSE
13 215 FALSE
14 294 TRUE
15 247 FALSE
16 295 FALSE
17 226 FALSE
18 175 FALSE
19 180 FALSE
20 243 FALSE
c("totchol", "death")] framingham_tbldf[
# A tibble: 20 × 2
totchol death
<int> <lgl>
1 195 FALSE
2 250 FALSE
3 245 FALSE
4 285 FALSE
5 205 FALSE
6 313 FALSE
7 254 FALSE
8 247 FALSE
9 291 TRUE
10 195 FALSE
11 195 FALSE
12 190 FALSE
13 215 FALSE
14 294 TRUE
15 247 FALSE
16 295 FALSE
17 226 FALSE
18 175 FALSE
19 180 FALSE
20 243 FALSE
Column Contents
The pull()
function works similarly to the $
and [[
functions: it takes in the name of a single column as a symbol object or a character string, and it returns the contents of the column as a vector.
%>%
framingham_tbldf pull(totchol)
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
%>%
framingham_tbldf pull("totchol")
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
$totchol framingham_tbldf
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
1]] framingham_tbldf[[
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
"totchol"]] framingham_tbldf[[
[1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243
These all return the same atomic vector of length 20 with the contents of the first column.
Relational Subsetting
Finally, we can subset the rows of a data set by the relationships between the values a column and a specified value or by the relationships between one column and another. For the remainder of this lesson, we will use the “tibble” versions of the data only.
Example 1: Framingham Young Women
For example, we want to find all the females in the Framingham data set younger than the 75th percentile for the ages of these recorded females. (Remember that this lesson uses a subset of the Framingham data, not the full data set. Your results will look different.)
Using base
# Subset the women
<-
framinghamWomen_tbldf $sex == "Female", ]
framingham_tbldf[framingham_tbldf
# Find the age threshold for P75
<- quantile(framinghamWomen_tbldf$age, 0.75)
womenP75_num <-
framinghamWomenYoung_tbldf $age < womenP75_num, ]
framinghamWomen_tbldf[framinghamWomen_tbldf
# Remove intermediary pieces from memory
rm(framinghamWomen_tbldf, womenP75_num)
# Inspect
framinghamWomenYoung_tbldf
# A tibble: 9 × 12
totchol age bmi BPVar heartrte glucose time_outcome cigpday sex
<int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <chr>
1 250 46 28.7 1.5 95 76 24 0 Female
2 285 46 23.1 9 85 85 11.7 23 Female
3 313 45 21.7 -19 79 78 23.8 20 Female
4 254 50 22.9 2.5 75 76 24 0 Female
5 247 43 27.6 13.5 72 61 24 0 Female
6 291 46 23.4 -6 80 89 3.95 20 Female
7 195 38 23.2 5.5 75 78 5.91 5 Female
8 190 42 21.6 -15.5 72 85 16.2 30 Female
9 243 43 26.9 -1.75 68 78 24 10 Female
# ℹ 3 more variables: curSmoke <lgl>, highBP <lgl>, death <lgl>
Using dplyr
%>%
framingham_tbldf filter(sex == "Female") %>%
filter(age < quantile(age, 0.75))
# A tibble: 9 × 12
totchol age bmi BPVar heartrte glucose time_outcome cigpday sex
<int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <chr>
1 250 46 28.7 1.5 95 76 24 0 Female
2 285 46 23.1 9 85 85 11.7 23 Female
3 313 45 21.7 -19 79 78 23.8 20 Female
4 254 50 22.9 2.5 75 76 24 0 Female
5 247 43 27.6 13.5 72 61 24 0 Female
6 291 46 23.4 -6 80 89 3.95 20 Female
7 195 38 23.2 5.5 75 78 5.91 5 Female
8 190 42 21.6 -15.5 72 85 16.2 30 Female
9 243 43 26.9 -1.75 68 78 24 10 Female
# ℹ 3 more variables: curSmoke <lgl>, highBP <lgl>, death <lgl>
Example 2: Countries with Highest Infant Mortality
As another example, we want to find the 25 countries in the Gapminder data set which had the largest infant mortality rate in 2010.
Using base
# Save a subset of the Gapminder data only for the year 2010
<-
gapminder2010_tbldf "year"]] == 2010, ]
gapminder_tbldf[gapminder_tbldf[[
# Find the order of the rows by worst infant mortality to best
<- order(
worstInfMort_idx "infant_mortality"]],
gapminder2010_tbldf[[decreasing = TRUE
)
# Subset the 2010 data by the first 25 indices in the "worst" vector
<-
gapminderInfMort2010_tbldf 1:25], ]
gapminder2010_tbldf[worstInfMort_idx[
# Inspect
gapminderInfMort2010_tbldf
# A tibble: 25 × 10
country year infant_mortality life_expectancy fertility population gdp
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Angola 2010 110. 57.6 6.22 21219954 2.61e10
2 Sierra L… 2010 107 55 4.94 5775902 1.57e 9
3 Central … 2010 102. 47.9 4.63 4444973 1.05e 9
4 Chad 2010 93.6 55.8 6.6 11896380 3.37e 9
5 Haiti 2010 85.5 32.2 3.35 9999617 3.70e 9
6 Congo, D… 2010 84.8 58.4 6.25 65938712 6.96e 9
7 Mali 2010 82.9 59.2 6.84 15167286 4.20e 9
8 Nigeria 2010 81.5 61.2 6.02 159424742 8.56e10
9 Equatori… 2010 78.9 58.6 5.14 728710 5.98e 9
10 Cote d'I… 2010 76.9 56.6 4.91 20131707 1.16e10
# ℹ 15 more rows
# ℹ 3 more variables: continent <fct>, region <fct>, gdpPerCap <dbl>
Using dplyr
%>%
gapminder_tbldf filter(year == 2010) %>%
arrange(desc(infant_mortality)) %>%
slice(1:25)
# A tibble: 25 × 10
country year infant_mortality life_expectancy fertility population gdp
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Angola 2010 110. 57.6 6.22 21219954 2.61e10
2 Sierra L… 2010 107 55 4.94 5775902 1.57e 9
3 Central … 2010 102. 47.9 4.63 4444973 1.05e 9
4 Chad 2010 93.6 55.8 6.6 11896380 3.37e 9
5 Haiti 2010 85.5 32.2 3.35 9999617 3.70e 9
6 Congo, D… 2010 84.8 58.4 6.25 65938712 6.96e 9
7 Mali 2010 82.9 59.2 6.84 15167286 4.20e 9
8 Nigeria 2010 81.5 61.2 6.02 159424742 8.56e10
9 Equatori… 2010 78.9 58.6 5.14 728710 5.98e 9
10 Cote d'I… 2010 76.9 56.6 4.91 20131707 1.16e10
# ℹ 15 more rows
# ℹ 3 more variables: continent <fct>, region <fct>, gdpPerCap <dbl>
Concluding Remarks
This lesson I usually only assign if we have quite a few students who had learned some version of R in the past. Many of my students “learned R” by copying and pasting their professor’s code, Googling the errors, and making changes until the problems went away. This isn’t learning. This is beating your code into submission. It’s not healthy for long-term research success.
If you learned this way, I apologize, but you will most likely have to “unlearn” what you learned first, then re-learn R as a modern computing language for data science and statistical learning. This lesson is to give you a comparison to some of the variants of data management in R that you may have learned in other classes. Hopefully, this helps you compare the two schools of thought. Some of you will still prefer your “old way” of doing things, and I suppose that I can’t fault you for wanting to stay in your comfort zone. It’s comfortable there, after all. But, you have a chance to learn some more sophisticated data management techniques, and I suggest you take it.