39 dplyr basics

The dplyr package offers extensive functionality for data manipulation.

dplyr’s functions are named after verbs (like filter, select, etc.) and are commonly used with the pipe operator to build pipelines. The package offers a large number of functions in total, often with multiple versions of the same “verb”. It has undergone many major changes since its introduction, so always make sure to consult the latest documentation.

Core operations include:

filter(): identify cases
select(): identify columns
mutate(): create new variables as a function of existing ones
arrange(): reorder the rows/cases of a dataset
summarize(): apply functions on columns
group_by(): identify columns to group rows/cases by. All of the above can be applied on a grouped data frame

dplyr operates on data.frames as well as the tidyverse’s data.frame replacement, known as tibble.

Note that dplyr masks some builtin functions when loaded.

library(readr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

As an example, we read in the Heart Failure Clinical Records dataset from the UCI Machine Learning Repository using readr.

For this example, we are showing how you can specify column types when reading in data using the col_types argument of readr’s read_csv() function. In practice, you may want to let readr infer column types automatically and fix any types, as needed, afterwards.

dat <- read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv",
    col_types = cols(
        age = col_double(),
        anaemia = col_factor(),
        creatinine_phosphokinase = col_double(),
        diabetes = col_factor(),
        ejection_fraction = col_double(),
        high_blood_pressure = col_factor(),
        platelets = col_double(),
        serum_creatinine = col_double(),
        serum_sodium = col_double(),
        sex = col_factor(),
        smoking = col_factor(),
        time = col_double(),
        DEATH_EVENT = col_factor()
    )
)

dplyr and the tidyverse in general operate on data.frames and tibbles, which can be used as a drop-in replacement for data.frames, and heavily promote the use of the pipe operator |>.

39.1 Filter

Single condition

dat |> 
    filter(age > 60)

# A tibble: 137 × 13
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl> <fct>                      <dbl> <fct>                <dbl>
 1    75 0                            582 0                       20
 2    65 0                            146 0                       20
 3    65 1                            160 1                       20
 4    90 1                             47 0                       40
 5    75 1                            246 0                       15
 6    65 0                            157 0                       65
 7    80 1                            123 0                       35
 8    75 1                             81 0                       38
 9    62 0                            231 0                       25
10    82 1                            379 0                       50
# ℹ 127 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

Multiple conditions

dat |> 
    filter(age > 60, anaemia == "1")

# A tibble: 61 × 13
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl> <fct>                      <dbl> <fct>                <dbl>
 1    65 1                            160 1                       20
 2    90 1                             47 0                       40
 3    75 1                            246 0                       15
 4    80 1                            123 0                       35
 5    75 1                             81 0                       38
 6    82 1                            379 0                       50
 7    87 1                            149 0                       38
 8    70 1                            125 0                       25
 9    65 1                             52 0                       25
10    65 1                            128 1                       30
# ℹ 51 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

39.2 Select

Columns can be specified by name, index, or pattern-matching.

39.2.1 By column name

dat |> 
    select("age", "diabetes", "ejection_fraction")

# A tibble: 299 × 3
     age diabetes ejection_fraction
   <dbl> <fct>                <dbl>
 1    75 0                       20
 2    55 0                       38
 3    65 0                       20
 4    50 0                       20
 5    65 1                       20
 6    90 0                       40
 7    75 0                       15
 8    60 1                       60
 9    65 0                       65
10    80 0                       35
# ℹ 289 more rows

39.2.2 By integer column index

dat |> 
    select(c(7, 9))

# A tibble: 299 × 2
   platelets serum_sodium
       <dbl>        <dbl>
 1   265000           130
 2   263358.          136
 3   162000           129
 4   210000           137
 5   327000           116
 6   204000           132
 7   127000           137
 8   454000           131
 9   263358.          138
10   388000           133
# ℹ 289 more rows

dat |> 
    select(7:9)

# A tibble: 299 × 3
   platelets serum_creatinine serum_sodium
       <dbl>            <dbl>        <dbl>
 1   265000               1.9          130
 2   263358.              1.1          136
 3   162000               1.3          129
 4   210000               1.9          137
 5   327000               2.7          116
 6   204000               2.1          132
 7   127000               1.2          137
 8   454000               1.1          131
 9   263358.              1.5          138
10   388000               9.4          133
# ℹ 289 more rows

39.2.3 By character name range

For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:

dat |> 
    select(platelets:serum_sodium)

# A tibble: 299 × 3
   platelets serum_creatinine serum_sodium
       <dbl>            <dbl>        <dbl>
 1   265000               1.9          130
 2   263358.              1.1          136
 3   162000               1.3          129
 4   210000               1.9          137
 5   327000               2.7          116
 6   204000               2.1          132
 7   127000               1.2          137
 8   454000               1.1          131
 9   263358.              1.5          138
10   388000               9.4          133
# ℹ 289 more rows

39.2.4 Pattern-matching

dat |> 
    select(starts_with("serum_"))

# A tibble: 299 × 2
   serum_creatinine serum_sodium
              <dbl>        <dbl>
 1              1.9          130
 2              1.1          136
 3              1.3          129
 4              1.9          137
 5              2.7          116
 6              2.1          132
 7              1.2          137
 8              1.1          131
 9              1.5          138
10              9.4          133
# ℹ 289 more rows

dat |> 
    select(ends_with("_fraction"))

# A tibble: 299 × 1
   ejection_fraction
               <dbl>
 1                20
 2                38
 3                20
 4                20
 5                20
 6                40
 7                15
 8                60
 9                65
10                35
# ℹ 289 more rows

39.3 Count

Count the number of smokers over age 60 by sex:

dat |> 
    filter(smoking == "1", age > 60) |> 
    count(sex)

# A tibble: 2 × 2
  sex       n
  <fct> <int>
1 1        43
2 0         2

39.4 Summarize

Operate on a single column:

dat |> 
    summarize(mean(age))

# A tibble: 1 × 1
  `mean(age)`
        <dbl>
1        60.8

Operate on multiple columns using across():

dat |> 
    summarize(across(c(age, ejection_fraction), mean))

# A tibble: 1 × 2
    age ejection_fraction
  <dbl>             <dbl>
1  60.8              38.1

39.4.1 Grouped operation on single column

with single grouping variable:

dat |> 
    group_by(sex) |> 
    summarize(mean(age))

# A tibble: 2 × 2
  sex   `mean(age)`
  <fct>       <dbl>
1 1            61.4
2 0            59.8

or using the .by argument of summarize():

dat |> 
    summarize(mean(age), .by = sex)

# A tibble: 2 × 2
  sex   `mean(age)`
  <fct>       <dbl>
1 1            61.4
2 0            59.8

with multiple grouping variables:

dat |> 
    group_by(sex, diabetes) |>
    summarize(mean(age))

`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.

# A tibble: 4 × 3
# Groups:   sex [2]
  sex   diabetes `mean(age)`
  <fct> <fct>          <dbl>
1 1     0               62.0
2 1     1               60.4
3 0     0               61.6
4 0     1               58.2

dat |> 
    summarize(mean(age), .by = c(sex, diabetes))

# A tibble: 4 × 3
  sex   diabetes `mean(age)`
  <fct> <fct>          <dbl>
1 1     0               62.0
2 0     1               58.2
3 1     1               60.4
4 0     0               61.6

39.4.2 Grouped operation on multiple columns

Group by single variable:

dat |> 
    group_by(sex) |> 
    summarize(across(c(age, serum_sodium), mean))

# A tibble: 2 × 3
  sex     age serum_sodium
  <fct> <dbl>        <dbl>
1 1      61.4         137.
2 0      59.8         137.

or:

dat |> 
    summarize(across(c(age, serum_sodium), mean), .by = sex)

# A tibble: 2 × 3
  sex     age serum_sodium
  <fct> <dbl>        <dbl>
1 1      61.4         137.
2 0      59.8         137.

Group by multiple variables:

dat |> 
    group_by(smoking, anaemia) |> 
    summarize(across(c(age, serum_sodium), mean))

`summarise()` has grouped output by 'smoking'. You can override using the
`.groups` argument.

# A tibble: 4 × 4
# Groups:   smoking [2]
  smoking anaemia   age serum_sodium
  <fct>   <fct>   <dbl>        <dbl>
1 0       0        59.7         136.
2 0       1        61.8         137.
3 1       0        60.4         136.
4 1       1        62.6         137

or:

dat |> 
    summarize(across(c(age, serum_sodium), mean), .by = c(smoking, anaemia))

# A tibble: 4 × 4
  smoking anaemia   age serum_sodium
  <fct>   <fct>   <dbl>        <dbl>
1 0       0        59.7         136.
2 1       0        60.4         136.
3 0       1        61.8         137.
4 1       1        62.6         137

39.5 Arrange

arrange() sorts a data.frame/tibble, i.e. reorders the rows/cases of a dataset.

dat |> 
    select(age, sex, serum_sodium) |> 
    arrange(age)

# A tibble: 299 × 3
     age sex   serum_sodium
   <dbl> <fct>        <dbl>
 1    40 1              136
 2    40 0              140
 3    40 0              141
 4    40 1              137
 5    40 1              136
 6    40 1              142
 7    40 1              132
 8    41 1              140
 9    42 0              136
10    42 1              140
# ℹ 289 more rows

39.6 Slice

Return the row with the highest ejection_fraction value:

dat |>
  slice_max(ejection_fraction)

# A tibble: 1 × 13
    age anaemia creatinine_phosphokinase diabetes ejection_fraction
  <dbl> <fct>                      <dbl> <fct>                <dbl>
1    45 0                            582 0                       80
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

Return the row with the minimum age:

dat |>
  slice_min(age)

# A tibble: 7 × 13
    age anaemia creatinine_phosphokinase diabetes ejection_fraction
  <dbl> <fct>                      <dbl> <fct>                <dbl>
1    40 0                            478 1                       30
2    40 0                            244 0                       45
3    40 1                            101 0                       40
4    40 1                            129 0                       35
5    40 0                             90 0                       35
6    40 0                            624 0                       35
7    40 0                            582 1                       35
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

slice_max() and slice_min() return all rows with the maximum/minimum value by default. To return only a single row, use the with_ties = FALSE argument.

dat |>
  slice_min(age, with_ties = FALSE)

# A tibble: 1 × 13
    age anaemia creatinine_phosphokinase diabetes ejection_fraction
  <dbl> <fct>                      <dbl> <fct>                <dbl>
1    40 0                            478 1                       30
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
#   time <dbl>, DEATH_EVENT <fct>

39.7 Relocate

Move the sex column after the age column:

dat <- dat |>
    relocate(sex, .after = age)

names(dat)

 [1] "age"                      "sex"                     
 [3] "anaemia"                  "creatinine_phosphokinase"
 [5] "diabetes"                 "ejection_fraction"       
 [7] "high_blood_pressure"      "platelets"               
 [9] "serum_creatinine"         "serum_sodium"            
[11] "smoking"                  "time"                    
[13] "DEATH_EVENT"

Make the diabetes column the last column:

dat <- dat |>
    relocate(diabetes, .after = last_col())

names(dat)

 [1] "age"                      "sex"                     
 [3] "anaemia"                  "creatinine_phosphokinase"
 [5] "ejection_fraction"        "high_blood_pressure"     
 [7] "platelets"                "serum_creatinine"        
 [9] "serum_sodium"             "smoking"                 
[11] "time"                     "DEATH_EVENT"             
[13] "diabetes"

Note: above we use names() to show the column names of the resulting data.frame/tibble and verify the changes, since only the first few columns are printed by default.

39.8 Mutate

Add a new column named Age_days:

dat <- dat |> 
    mutate(Age_days = age * 365)

dat["Age_days"]

# A tibble: 299 × 1
   Age_days
      <dbl>
 1    27375
 2    20075
 3    23725
 4    18250
 5    23725
 6    32850
 7    27375
 8    21900
 9    23725
10    29200
# ℹ 289 more rows

39.8.1 Mutate by group

dat <- dat |> 
    mutate(demeaned_sodium_bysex = serum_sodium - mean(serum_sodium), .by = sex)

Note that ungroup() is required to return the data.frame/tibble to its original ungrouped state. If you don’t do this, all subsequent operations will be performed on the same grouping.

Inspect the new column:

dat |> 
    select(sex, serum_sodium, demeaned_sodium_bysex)

# A tibble: 299 × 3
   sex   serum_sodium demeaned_sodium_bysex
   <fct>        <dbl>                 <dbl>
 1 1              130                -6.54 
 2 1              136                -0.536
 3 1              129                -7.54 
 4 1              137                 0.464
 5 0              116               -20.8  
 6 1              132                -4.54 
 7 1              137                 0.464
 8 1              131                -5.54 
 9 0              138                 1.21 
10 1              133                -3.54 
# ℹ 289 more rows

39.9 Rename

Rename the DEATH_EVENT to Mortality:

dat <- dat |> 
    rename(Mortality = DEATH_EVENT)

names(dat)

 [1] "age"                      "sex"                     
 [3] "anaemia"                  "creatinine_phosphokinase"
 [5] "ejection_fraction"        "high_blood_pressure"     
 [7] "platelets"                "serum_creatinine"        
 [9] "serum_sodium"             "smoking"                 
[11] "time"                     "Mortality"               
[13] "diabetes"                 "Age_days"                
[15] "demeaned_sodium_bysex"

39.10 Specifying multiple columns

dplyr includes a number of ways to identify multiple variables. The latest version of dplyr suggests using across() within dplyr functions that allow specifying columns, as we’ve used above.

This replaces separate functions previously used for each of filter/select/mutate/summarize/arrange that had independent functions ending with *_all(), *_at(), *_each(), *_if().

Other than specifying multiple columns by name, they can also be specified by index, by range, or by pattern-matching.

39.10.1 By integer column index

dat |> 
    summarize(across(c(7, 9), mean))

# A tibble: 1 × 2
  platelets serum_sodium
      <dbl>        <dbl>
1   263358.         137.

dat |> 
    summarize(across(7:9, mean))

# A tibble: 1 × 3
  platelets serum_creatinine serum_sodium
      <dbl>            <dbl>        <dbl>
1   263358.             1.39         137.

39.10.2 By character name range

For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:

dat |> 
    summarize(across(platelets:serum_sodium, mean))

# A tibble: 1 × 3
  platelets serum_creatinine serum_sodium
      <dbl>            <dbl>        <dbl>
1   263358.             1.39         137.

39.10.3 Pattern-matching

dat |> 
    summarize(across(starts_with("serum_"), mean))

# A tibble: 1 × 2
  serum_creatinine serum_sodium
             <dbl>        <dbl>
1             1.39         137.

dat |> 
    summarize(across(ends_with("fraction"), mean))

# A tibble: 1 × 1
  ejection_fraction
              <dbl>
1              38.1

39.10.4 Using predicate function wrapped in `where()`

A predicate function is a function that returns a logical value, i.e. either TRUE or FALSE.

For example, you can use is.* functions to identify columns based on their type, e.g. numeric or factor.

dat |> 
    summarize(across(where(is.numeric), mean))

# A tibble: 1 × 9
    age creatinine_phosphokinase ejection_fraction platelets serum_creatinine
  <dbl>                    <dbl>             <dbl>     <dbl>            <dbl>
1  60.8                     582.              38.1   263358.             1.39
# ℹ 4 more variables: serum_sodium <dbl>, time <dbl>, Age_days <dbl>,
#   demeaned_sodium_bysex <dbl>

To use a summary function that returns more than one row per group, use reframe() instead of summarize(). For example, to get the range (min and max) of all numeric columns:

dat |> 
    reframe(across(where(is.numeric), range))

# A tibble: 2 × 9
    age creatinine_phosphokinase ejection_fraction platelets serum_creatinine
  <dbl>                    <dbl>             <dbl>     <dbl>            <dbl>
1    40                       23                14     25100              0.5
2    95                     7861                80    850000              9.4
# ℹ 4 more variables: serum_sodium <dbl>, time <dbl>, Age_days <dbl>,
#   demeaned_sodium_bysex <dbl>

39.10.5 Using anonymous predicate function with `where()`

It is often useful to be able to get the number of unique values per column, e.g. in order to identify which columns may need to be converted to factors. dplyr includes the function n_distinct() which counts the number of unique values of a vector.

The following example shows dplyr’s formula shorthand syntax to build an anonymous function using n_distinct() in order to select columns with fewer than 10 unique values:

dat |> 
    select(where(~ n_distinct(.x) < 10))

# A tibble: 299 × 6
   sex   anaemia high_blood_pressure smoking Mortality diabetes
   <fct> <fct>   <fct>               <fct>   <fct>     <fct>   
 1 1     0       1                   0       1         0       
 2 1     0       0                   0       1         0       
 3 1     0       0                   1       1         0       
 4 1     1       0                   0       1         0       
 5 0     1       0                   0       1         1       
 6 1     1       1                   1       1         0       
 7 1     1       0                   0       1         0       
 8 1     1       0                   1       1         1       
 9 0     0       0                   0       1         0       
10 1     1       1                   1       1         0       
# ℹ 289 more rows

which is equivalent to using a standard R anonymous function:

dat |> 
    select(where(\(x) n_distinct(x) < 10))

# A tibble: 299 × 6
   sex   anaemia high_blood_pressure smoking Mortality diabetes
   <fct> <fct>   <fct>               <fct>   <fct>     <fct>   
 1 1     0       1                   0       1         0       
 2 1     0       0                   0       1         0       
 3 1     0       0                   1       1         0       
 4 1     1       0                   0       1         0       
 5 0     1       0                   0       1         1       
 6 1     1       1                   1       1         0       
 7 1     1       0                   0       1         0       
 8 1     1       0                   1       1         1       
 9 0     0       0                   0       1         0       
10 1     1       1                   1       1         0       
# ℹ 289 more rows

n_distinct(x) is equivalent to length(unique(x)):

dat |> 
    select(where(\(x) length(unique(x)) < 10))

# A tibble: 299 × 6
   sex   anaemia high_blood_pressure smoking Mortality diabetes
   <fct> <fct>   <fct>               <fct>   <fct>     <fct>   
 1 1     0       1                   0       1         0       
 2 1     0       0                   0       1         0       
 3 1     0       0                   1       1         0       
 4 1     1       0                   0       1         0       
 5 0     1       0                   0       1         1       
 6 1     1       1                   1       1         0       
 7 1     1       0                   0       1         0       
 8 1     1       0                   1       1         1       
 9 0     0       0                   0       1         0       
10 1     1       1                   1       1         0       
# ℹ 289 more rows

To mutate those columns to factors, we combine across() and where() with the formula notation seen above within mutate():

dat <- dat |> 
    mutate(across(where(~ n_distinct(.) < 10), as.factor))

str(dat)

tibble [299 × 15] (S3: tbl_df/tbl/data.frame)
 $ age                     : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
 $ sex                     : Factor w/ 2 levels "1","0": 1 1 1 1 2 1 1 1 2 1 ...
 $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
 $ ejection_fraction       : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : Factor w/ 2 levels "1","0": 1 2 2 2 2 1 2 2 2 1 ...
 $ platelets               : num [1:299] 265000 263358 162000 210000 327000 ...
 $ serum_creatinine        : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
 $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ time                    : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
 $ Mortality               : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
 $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ Age_days                : num [1:299] 27375 20075 23725 18250 23725 ...
 $ demeaned_sodium_bysex   : num [1:299] -6.536 -0.536 -7.536 0.464 -20.79 ...

39.11 Combining multiple dplyr operations

For example, filter for smokers over age 60 and show mean platelets by sex:

dat |> 
    filter(smoking == "1", age > 60) |> 
    summarize(Mean_platelets_by_sex = mean(platelets), .by = sex)

# A tibble: 2 × 2
  sex   Mean_platelets_by_sex
  <fct>                 <dbl>
1 1                   248792.
2 0                   469000

39.12 dplyr backends

Using dplyr on data.frames or tibbles does not offer any performance benefits over performing the equivalent operations in base R (and may in fact be slightly slower).

A number of alternative backends, at different stages of maturity, are available that offer performance benefits for large datasets.

39.13 Resources

39.14 See also

data.table

39.1 Filter

39.2 Select

39.2.1 By column name

39.2.2 By integer column index

39.2.3 By character name range

39.2.4 Pattern-matching

39.3 Count

39.4 Summarize

39.4.1 Grouped operation on single column

39.4.2 Grouped operation on multiple columns

39.5 Arrange

39.6 Slice

39.7 Relocate

39.8 Mutate

39.8.1 Mutate by group

39.9 Rename

39.10 Specifying multiple columns

39.10.1 By integer column index

39.10.2 By character name range

39.10.3 Pattern-matching

39.10.4 Using predicate function wrapped in where()

39.10.5 Using anonymous predicate function with where()

39.11 Combining multiple dplyr operations

39.12 dplyr backends

39.13 Resources

39.14 See also

39.10.4 Using predicate function wrapped in `where()`

39.10.5 Using anonymous predicate function with `where()`