Lesson 9 Supplement: Compare dplyr and base R Subsetting

Author

Gabriel Odom

Published

October 2, 2019

Overview

We will discuss/review

The data sets
Row subsetting
Column subsetting
Relational subsetting

This lesson will not compare merging two data sets using basic R, because the dplyr version is superior in almost every way.

Data Sets

In this class, we have used a few different data sets. The table below gives the names of the data sets and their sources.

Data Set	Source
`heroes_tbldf`	In class: code below
`framingham_tbldf`	`LocalControl` with modifications shown below
`gapminder_tbldf`	`dslabs` with modifications shown below

Tip: What’s in a Name?

Up to this point in the semester, I have used the object name ending _df exclusively to mean a “tibble” (or modern data frame). In this lesson, I will use _tbldf to mean “tibble” instead of _df, and I will mark data in the traditional “data frame” class with _dataframe. After this lesson is over, I will go back to using _df to mean “tibble”, and we will probably never speak of “old” data frames again.

The `heroes_tbldf` Data

heroes_tbldf <- tibble(
  subject_ID = factor(c("008", "016", "115", "027", "001")),
  name = c("Wonder Woman", "Green Lantern", "Spider-Man", "Batman", "Superman"),
  alias = c(
    "Diana Prince", "Alan Scott", "Peter Parker", "Bruce Wayne",
    "Clark Kent / Kal-El"
  ),
  city = c(
    "Gateway City", "Capitol City", "New York City", "Gotham", "Metropolis"
  ),
  male = c(FALSE, TRUE, TRUE, TRUE, TRUE),
  heightCM = c(183.5, 182.9, 177.8, 188.0, 190.5),
  weightKg = c(74.8, 91.2, 75.7, 95.3, 106.6),
  firstRun = c(1941L, 1940L, 1962L, 1939L, 1938L)
)

Framingham Data

For the examples in this comparison document, we will use the first 20 records of the Framingham data only. Notice that the only difference between framingham_df and framingham_tbldf is that we forced the _tbldf version to be a tibble (modern data frame).

# install.packages("LocalControl")
library(LocalControl)

# Base R data frame
framingham_dataframe <- 
  framingham %>% 
  mutate(sex = ifelse(female == 1, yes = "Female", no = "Male")) %>% 
  mutate(curSmoke = cursmoke == 1) %>% 
  mutate(highBP = outcome == 1) %>% 
  mutate(death = outcome == 2) %>% 
  select(-female, -cursmoke, -outcome) %>% 
  slice(1:20)

# Tibble
framingham_tbldf <- as_tibble(framingham_dataframe)

Gapminder Data

# install.packages("dslabs")
library(dslabs)

# Base R data frame
gapminder_dataframe <-
  gapminder %>% 
  mutate(gdpPerCap = gdp / population)

# Tibble
gapminder_tbldf <- as_tibble(gapminder_dataframe)

Row Subsetting

base

To select the first row of a tibble using basic R subsetting rules, use the [ function. This follows row, column syntax.

framingham_dataframe[1, ]

  totchol age   bmi BPVar heartrte glucose time_outcome cigpday  sex curSmoke
1     195  39 26.97   -17       80      77           24       0 Male    FALSE
  highBP death
1  FALSE FALSE

This returns a 1-row data frame.

dplyr

To select the first row of a tibble using dplyr subsetting rules, use [ or the slice() function. This uses row position only.

framingham_tbldf %>% 
  slice(1)

# A tibble: 1 × 12
  totchol   age   bmi BPVar heartrte glucose time_outcome cigpday sex   curSmoke
    <int> <int> <dbl> <dbl>    <int>   <int>        <dbl>   <int> <chr> <lgl>   
1     195    39  27.0   -17       80      77           24       0 Male  FALSE   
# ℹ 2 more variables: highBP <lgl>, death <lgl>

framingham_tbldf[1, ]

# A tibble: 1 × 12
  totchol   age   bmi BPVar heartrte glucose time_outcome cigpday sex   curSmoke
    <int> <int> <dbl> <dbl>    <int>   <int>        <dbl>   <int> <chr> <lgl>   
1     195    39  27.0   -17       80      77           24       0 Male  FALSE   
# ℹ 2 more variables: highBP <lgl>, death <lgl>

These both return the same 1-row tibble.

Column Subsetting

base R

To select the first column of a tibble using basic R subsetting rules, again use the [ function. This follows row, column syntax, so move the 1 on the other side of the comma.

framingham_dataframe[, 1]

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

This returns an atomic vector of length 20.

Because tibbles are also lists, we can use the list subsetting rules for columns as well. To subset by column position, we can use the [ or the [[ functions (for single-level or double-level subsetting). Note that [[ can only take in a single scalar argument, rather than a vector of positions.

framingham_dataframe[1]

This returns a 1-column data frame.

framingham_dataframe[[1]]

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

This returns an atomic vector of length 20 with the contents of the first column.

Note that these functions also work when we supply the name of a column as a character string:

framingham_dataframe[, "totchol"] # atomic vector

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

framingham_dataframe["totchol"]   # 1-column data frame

framingham_dataframe[["totchol"]] # atomic vector

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

Infuriatingly, the [ function changes behaviour if you request two or more columns using [row, c(col1, col2, ...)] syntax:

framingham_dataframe[, c("totchol", "death")] # now a data frame

   totchol death
1      195 FALSE
2      250 FALSE
3      245 FALSE
4      285 FALSE
5      205 FALSE
6      313 FALSE
7      254 FALSE
8      247 FALSE
9      291  TRUE
10     195 FALSE
11     195 FALSE
12     190 FALSE
13     215 FALSE
14     294  TRUE
15     247 FALSE
16     295 FALSE
17     226 FALSE
18     175 FALSE
19     180 FALSE
20     243 FALSE

One of the main differences between [ and [[ is that the single-subset operator can take in multiple column names, while [[ cannot. For example:

framingham_dataframe[[c("totchol", "death")]]

Error in .subset2(x, i, exact = exact): subscript out of bounds

Finally, we can use the $ function to extract the contents of a single column by name as a symbol (not using quotes):

framingham_dataframe$totchol

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

This returns an atomic vector of length 20 with the contents of the first column.

dplyr

To select the first column of a tibble using dplyr subsetting rules, use [ or the select() function. This uses column position only.

framingham_tbldf %>% 
  select(1)

# A tibble: 20 × 1
   totchol
     <int>
 1     195
 2     250
 3     245
 4     285
 5     205
 6     313
 7     254
 8     247
 9     291
10     195
11     195
12     190
13     215
14     294
15     247
16     295
17     226
18     175
19     180
20     243

framingham_tbldf[, 1]

# A tibble: 20 × 1
   totchol
     <int>
 1     195
 2     250
 3     245
 4     285
 5     205
 6     313
 7     254
 8     247
 9     291
10     195
11     195
12     190
13     215
14     294
15     247
16     295
17     226
18     175
19     180
20     243

framingham_tbldf[1]

# A tibble: 20 × 1
   totchol
     <int>
 1     195
 2     250
 3     245
 4     285
 5     205
 6     313
 7     254
 8     247
 9     291
10     195
11     195
12     190
13     215
14     294
15     247
16     295
17     226
18     175
19     180
20     243

These all return the same 1-column tibble.

To access a column by name, we have two options: do we want the contents of the column or do we want the column itself.

Column Itself

We can use the select() function in addition to the [ function to extract one or more columns. However, for the select() function, we are not required to supply the name in quotes, but rather we can use column names as a symbol object.

framingham_tbldf %>% 
  select(totchol)

# A tibble: 20 × 1
   totchol
     <int>
 1     195
 2     250
 3     245
 4     285
 5     205
 6     313
 7     254
 8     247
 9     291
10     195
11     195
12     190
13     215
14     294
15     247
16     295
17     226
18     175
19     180
20     243

framingham_tbldf[, "totchol"]

# A tibble: 20 × 1
   totchol
     <int>
 1     195
 2     250
 3     245
 4     285
 5     205
 6     313
 7     254
 8     247
 9     291
10     195
11     195
12     190
13     215
14     294
15     247
16     295
17     226
18     175
19     180
20     243

framingham_tbldf["totchol"]

# A tibble: 20 × 1
   totchol
     <int>
 1     195
 2     250
 3     245
 4     285
 5     205
 6     313
 7     254
 8     247
 9     291
10     195
11     195
12     190
13     215
14     294
15     247
16     295
17     226
18     175
19     180
20     243

As with subsetting by position, these all return the same 1-column tibble. What’s more, is that—unlike a base R data frame—selecting more than one column does not change the behaviour of the function. Tibbles stay tibbles.

framingham_tbldf %>% 
  select(totchol, death)

# A tibble: 20 × 2
   totchol death
     <int> <lgl>
 1     195 FALSE
 2     250 FALSE
 3     245 FALSE
 4     285 FALSE
 5     205 FALSE
 6     313 FALSE
 7     254 FALSE
 8     247 FALSE
 9     291 TRUE 
10     195 FALSE
11     195 FALSE
12     190 FALSE
13     215 FALSE
14     294 TRUE 
15     247 FALSE
16     295 FALSE
17     226 FALSE
18     175 FALSE
19     180 FALSE
20     243 FALSE

framingham_tbldf[, c("totchol", "death")]

# A tibble: 20 × 2
   totchol death
     <int> <lgl>
 1     195 FALSE
 2     250 FALSE
 3     245 FALSE
 4     285 FALSE
 5     205 FALSE
 6     313 FALSE
 7     254 FALSE
 8     247 FALSE
 9     291 TRUE 
10     195 FALSE
11     195 FALSE
12     190 FALSE
13     215 FALSE
14     294 TRUE 
15     247 FALSE
16     295 FALSE
17     226 FALSE
18     175 FALSE
19     180 FALSE
20     243 FALSE

framingham_tbldf[c("totchol", "death")]

# A tibble: 20 × 2
   totchol death
     <int> <lgl>
 1     195 FALSE
 2     250 FALSE
 3     245 FALSE
 4     285 FALSE
 5     205 FALSE
 6     313 FALSE
 7     254 FALSE
 8     247 FALSE
 9     291 TRUE 
10     195 FALSE
11     195 FALSE
12     190 FALSE
13     215 FALSE
14     294 TRUE 
15     247 FALSE
16     295 FALSE
17     226 FALSE
18     175 FALSE
19     180 FALSE
20     243 FALSE

Column Contents

The pull() function works similarly to the $ and [[ functions: it takes in the name of a single column as a symbol object or a character string, and it returns the contents of the column as a vector.

framingham_tbldf %>% 
  pull(totchol)

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

framingham_tbldf %>% 
  pull("totchol")

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

framingham_tbldf$totchol

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

framingham_tbldf[[1]]

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

framingham_tbldf[["totchol"]]

 [1] 195 250 245 285 205 313 254 247 291 195 195 190 215 294 247 295 226 175 180
[20] 243

These all return the same atomic vector of length 20 with the contents of the first column.

Relational Subsetting

Finally, we can subset the rows of a data set by the relationships between the values a column and a specified value or by the relationships between one column and another. For the remainder of this lesson, we will use the “tibble” versions of the data only.

Example 1: Framingham Young Women

For example, we want to find all the females in the Framingham data set younger than the 75th percentile for the ages of these recorded females. (Remember that this lesson uses a subset of the Framingham data, not the full data set. Your results will look different.)

Using base

# Subset the women
framinghamWomen_tbldf <- 
  framingham_tbldf[framingham_tbldf$sex == "Female", ]

# Find the age threshold for P75
womenP75_num <- quantile(framinghamWomen_tbldf$age, 0.75)
framinghamWomenYoung_tbldf <- 
  framinghamWomen_tbldf[framinghamWomen_tbldf$age < womenP75_num, ]

# Remove intermediary pieces from memory
rm(framinghamWomen_tbldf, womenP75_num)

# Inspect
framinghamWomenYoung_tbldf

# A tibble: 9 × 12
  totchol   age   bmi  BPVar heartrte glucose time_outcome cigpday sex   
    <int> <int> <dbl>  <dbl>    <int>   <int>        <dbl>   <int> <chr> 
1     250    46  28.7   1.5        95      76        24          0 Female
2     285    46  23.1   9          85      85        11.7       23 Female
3     313    45  21.7 -19          79      78        23.8       20 Female
4     254    50  22.9   2.5        75      76        24          0 Female
5     247    43  27.6  13.5        72      61        24          0 Female
6     291    46  23.4  -6          80      89         3.95      20 Female
7     195    38  23.2   5.5        75      78         5.91       5 Female
8     190    42  21.6 -15.5        72      85        16.2       30 Female
9     243    43  26.9  -1.75       68      78        24         10 Female
# ℹ 3 more variables: curSmoke <lgl>, highBP <lgl>, death <lgl>

Using dplyr

framingham_tbldf %>%
  filter(sex == "Female") %>%
  filter(age < quantile(age, 0.75))

# A tibble: 9 × 12
  totchol   age   bmi  BPVar heartrte glucose time_outcome cigpday sex   
    <int> <int> <dbl>  <dbl>    <int>   <int>        <dbl>   <int> <chr> 
1     250    46  28.7   1.5        95      76        24          0 Female
2     285    46  23.1   9          85      85        11.7       23 Female
3     313    45  21.7 -19          79      78        23.8       20 Female
4     254    50  22.9   2.5        75      76        24          0 Female
5     247    43  27.6  13.5        72      61        24          0 Female
6     291    46  23.4  -6          80      89         3.95      20 Female
7     195    38  23.2   5.5        75      78         5.91       5 Female
8     190    42  21.6 -15.5        72      85        16.2       30 Female
9     243    43  26.9  -1.75       68      78        24         10 Female
# ℹ 3 more variables: curSmoke <lgl>, highBP <lgl>, death <lgl>

Example 2: Countries with Highest Infant Mortality

As another example, we want to find the 25 countries in the Gapminder data set which had the largest infant mortality rate in 2010.

Using base

# Save a subset of the Gapminder data only for the year 2010
gapminder2010_tbldf <- 
  gapminder_tbldf[gapminder_tbldf[["year"]] == 2010, ]

# Find the order of the rows by worst infant mortality to best
worstInfMort_idx <- order(
  gapminder2010_tbldf[["infant_mortality"]],
  decreasing = TRUE
)

# Subset the 2010 data by the first 25 indices in the "worst" vector
gapminderInfMort2010_tbldf <-
  gapminder2010_tbldf[worstInfMort_idx[1:25], ]

# Inspect
gapminderInfMort2010_tbldf

# A tibble: 25 × 10
   country    year infant_mortality life_expectancy fertility population     gdp
   <fct>     <int>            <dbl>           <dbl>     <dbl>      <dbl>   <dbl>
 1 Angola     2010            110.             57.6      6.22   21219954 2.61e10
 2 Sierra L…  2010            107              55        4.94    5775902 1.57e 9
 3 Central …  2010            102.             47.9      4.63    4444973 1.05e 9
 4 Chad       2010             93.6            55.8      6.6    11896380 3.37e 9
 5 Haiti      2010             85.5            32.2      3.35    9999617 3.70e 9
 6 Congo, D…  2010             84.8            58.4      6.25   65938712 6.96e 9
 7 Mali       2010             82.9            59.2      6.84   15167286 4.20e 9
 8 Nigeria    2010             81.5            61.2      6.02  159424742 8.56e10
 9 Equatori…  2010             78.9            58.6      5.14     728710 5.98e 9
10 Cote d'I…  2010             76.9            56.6      4.91   20131707 1.16e10
# ℹ 15 more rows
# ℹ 3 more variables: continent <fct>, region <fct>, gdpPerCap <dbl>

Using dplyr

gapminder_tbldf %>% 
  filter(year == 2010) %>% 
  arrange(desc(infant_mortality)) %>% 
  slice(1:25)

# A tibble: 25 × 10
   country    year infant_mortality life_expectancy fertility population     gdp
   <fct>     <int>            <dbl>           <dbl>     <dbl>      <dbl>   <dbl>
 1 Angola     2010            110.             57.6      6.22   21219954 2.61e10
 2 Sierra L…  2010            107              55        4.94    5775902 1.57e 9
 3 Central …  2010            102.             47.9      4.63    4444973 1.05e 9
 4 Chad       2010             93.6            55.8      6.6    11896380 3.37e 9
 5 Haiti      2010             85.5            32.2      3.35    9999617 3.70e 9
 6 Congo, D…  2010             84.8            58.4      6.25   65938712 6.96e 9
 7 Mali       2010             82.9            59.2      6.84   15167286 4.20e 9
 8 Nigeria    2010             81.5            61.2      6.02  159424742 8.56e10
 9 Equatori…  2010             78.9            58.6      5.14     728710 5.98e 9
10 Cote d'I…  2010             76.9            56.6      4.91   20131707 1.16e10
# ℹ 15 more rows
# ℹ 3 more variables: continent <fct>, region <fct>, gdpPerCap <dbl>

Concluding Remarks

This lesson I usually only assign if we have quite a few students who had learned some version of R in the past. Many of my students “learned R” by copying and pasting their professor’s code, Googling the errors, and making changes until the problems went away. This isn’t learning. This is beating your code into submission. It’s not healthy for long-term research success.

If you learned this way, I apologize, but you will most likely have to “unlearn” what you learned first, then re-learn R as a modern computing language for data science and statistical learning. This lesson is to give you a comparison to some of the variants of data management in R that you may have learned in other classes. Hopefully, this helps you compare the two schools of thought. Some of you will still prefer your “old way” of doing things, and I suppose that I can’t fault you for wanting to stay in your comfort zone. It’s comfortable there, after all. But, you have a chance to learn some more sophisticated data management techniques, and I suggest you take it.

Overview

Data Sets

The heroes_tbldf Data

Framingham Data

Gapminder Data

Row Subsetting

base

dplyr

Column Subsetting

base R

dplyr

Column Itself

Column Contents

Relational Subsetting

Example 1: Framingham Young Women

Using base

Using dplyr

Example 2: Countries with Highest Infant Mortality

Using base

Using dplyr

Concluding Remarks

The `heroes_tbldf` Data