Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The dplyr package offers extensive functionality for data manipulation.
dplyr’s functions are named after verbs (like filter, select, etc.) and are commonly used with the pipe operator to build pipelines. The package offers a large number of functions in total, often with multiple versions of the same “verb”. It has undergone many major changes since its introduction, so always make sure to consult the latest documentation.
Core operations include:
filter(): identify casesselect(): identify columnsmutate(): create new variables as a function of existing onesarrange(): reorder the rows/cases of a datasetsummarize(): apply functions on columnsgroup_by(): identify columns to group rows/cases by. All of the above can be applied on a grouped data framedplyr operates on data.frames as well as the tidyverse’s data.frame replacement, known as tibble.
Note that dplyr masks some builtin functions when loaded.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
As an example, we read in the Heart Failure Clinical Records dataset from the UCI Machine Learning Repository using readr.
For this example, we are showing how you can specify column types when reading in data using the col_types argument of readr’s read_csv() function. In practice, you may want to let readr infer column types automatically and fix any types, as needed, afterwards.
dat <- read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv",
col_types = cols(
age = col_double(),
anaemia = col_factor(),
creatinine_phosphokinase = col_double(),
diabetes = col_factor(),
ejection_fraction = col_double(),
high_blood_pressure = col_factor(),
platelets = col_double(),
serum_creatinine = col_double(),
serum_sodium = col_double(),
sex = col_factor(),
smoking = col_factor(),
time = col_double(),
DEATH_EVENT = col_factor()
)
)Single condition
dat |>
filter(age > 60)# A tibble: 137 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 75 0 582 0 20
2 65 0 146 0 20
3 65 1 160 1 20
4 90 1 47 0 40
5 75 1 246 0 15
6 65 0 157 0 65
7 80 1 123 0 35
8 75 1 81 0 38
9 62 0 231 0 25
10 82 1 379 0 50
# ℹ 127 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
Multiple conditions
dat |>
filter(age > 60, anaemia == "1")# A tibble: 61 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 65 1 160 1 20
2 90 1 47 0 40
3 75 1 246 0 15
4 80 1 123 0 35
5 75 1 81 0 38
6 82 1 379 0 50
7 87 1 149 0 38
8 70 1 125 0 25
9 65 1 52 0 25
10 65 1 128 1 30
# ℹ 51 more rows
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
Columns can be specified by name, index, or pattern-matching.
dat |>
select("age", "diabetes", "ejection_fraction")# A tibble: 299 × 3
age diabetes ejection_fraction
<dbl> <fct> <dbl>
1 75 0 20
2 55 0 38
3 65 0 20
4 50 0 20
5 65 1 20
6 90 0 40
7 75 0 15
8 60 1 60
9 65 0 65
10 80 0 35
# ℹ 289 more rows
# A tibble: 299 × 2
platelets serum_sodium
<dbl> <dbl>
1 265000 130
2 263358. 136
3 162000 129
4 210000 137
5 327000 116
6 204000 132
7 127000 137
8 454000 131
9 263358. 138
10 388000 133
# ℹ 289 more rows
dat |>
select(7:9)# A tibble: 299 × 3
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 265000 1.9 130
2 263358. 1.1 136
3 162000 1.3 129
4 210000 1.9 137
5 327000 2.7 116
6 204000 2.1 132
7 127000 1.2 137
8 454000 1.1 131
9 263358. 1.5 138
10 388000 9.4 133
# ℹ 289 more rows
For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:
dat |>
select(platelets:serum_sodium)# A tibble: 299 × 3
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 265000 1.9 130
2 263358. 1.1 136
3 162000 1.3 129
4 210000 1.9 137
5 327000 2.7 116
6 204000 2.1 132
7 127000 1.2 137
8 454000 1.1 131
9 263358. 1.5 138
10 388000 9.4 133
# ℹ 289 more rows
dat |>
select(starts_with("serum_"))# A tibble: 299 × 2
serum_creatinine serum_sodium
<dbl> <dbl>
1 1.9 130
2 1.1 136
3 1.3 129
4 1.9 137
5 2.7 116
6 2.1 132
7 1.2 137
8 1.1 131
9 1.5 138
10 9.4 133
# ℹ 289 more rows
Count the number of smokers over age 60 by sex:
Operate on a single column:
Operate on multiple columns using across():
# A tibble: 1 × 2
age ejection_fraction
<dbl> <dbl>
1 60.8 38.1
with single grouping variable:
# A tibble: 2 × 2
sex `mean(age)`
<fct> <dbl>
1 1 61.4
2 0 59.8
or using the .by argument of summarize():
# A tibble: 2 × 2
sex `mean(age)`
<fct> <dbl>
1 1 61.4
2 0 59.8
with multiple grouping variables:
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 4 × 3
# Groups: sex [2]
sex diabetes `mean(age)`
<fct> <fct> <dbl>
1 1 0 62.0
2 1 1 60.4
3 0 0 61.6
4 0 1 58.2
or
Group by single variable:
# A tibble: 2 × 3
sex age serum_sodium
<fct> <dbl> <dbl>
1 1 61.4 137.
2 0 59.8 137.
or:
# A tibble: 2 × 3
sex age serum_sodium
<fct> <dbl> <dbl>
1 1 61.4 137.
2 0 59.8 137.
Group by multiple variables:
`summarise()` has grouped output by 'smoking'. You can override using the
`.groups` argument.
# A tibble: 4 × 4
# Groups: smoking [2]
smoking anaemia age serum_sodium
<fct> <fct> <dbl> <dbl>
1 0 0 59.7 136.
2 0 1 61.8 137.
3 1 0 60.4 136.
4 1 1 62.6 137
or:
arrange() sorts a data.frame/tibble, i.e. reorders the rows/cases of a dataset.
Return the row with the highest ejection_fraction value:
dat |>
slice_max(ejection_fraction)# A tibble: 1 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 45 0 582 0 80
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
Return the row with the minimum age:
dat |>
slice_min(age)# A tibble: 7 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 40 0 478 1 30
2 40 0 244 0 45
3 40 1 101 0 40
4 40 1 129 0 35
5 40 0 90 0 35
6 40 0 624 0 35
7 40 0 582 1 35
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
slice_max() and slice_min() return all rows with the maximum/minimum value by default. To return only a single row, use the with_ties = FALSE argument.
dat |>
slice_min(age, with_ties = FALSE)# A tibble: 1 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <fct> <dbl> <fct> <dbl>
1 40 0 478 1 30
# ℹ 8 more variables: high_blood_pressure <fct>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <fct>, smoking <fct>,
# time <dbl>, DEATH_EVENT <fct>
Move the sex column after the age column:
[1] "age" "sex"
[3] "anaemia" "creatinine_phosphokinase"
[5] "diabetes" "ejection_fraction"
[7] "high_blood_pressure" "platelets"
[9] "serum_creatinine" "serum_sodium"
[11] "smoking" "time"
[13] "DEATH_EVENT"
Make the diabetes column the last column:
[1] "age" "sex"
[3] "anaemia" "creatinine_phosphokinase"
[5] "ejection_fraction" "high_blood_pressure"
[7] "platelets" "serum_creatinine"
[9] "serum_sodium" "smoking"
[11] "time" "DEATH_EVENT"
[13] "diabetes"
Note: above we use names() to show the column names of the resulting data.frame/tibble and verify the changes, since only the first few columns are printed by default.
Add a new column named Age_days:
dat <- dat |>
mutate(Age_days = age * 365)dat["Age_days"]# A tibble: 299 × 1
Age_days
<dbl>
1 27375
2 20075
3 23725
4 18250
5 23725
6 32850
7 27375
8 21900
9 23725
10 29200
# ℹ 289 more rows
Note that ungroup() is required to return the data.frame/tibble to its original ungrouped state. If you don’t do this, all subsequent operations will be performed on the same grouping.
Inspect the new column:
dat |>
select(sex, serum_sodium, demeaned_sodium_bysex)# A tibble: 299 × 3
sex serum_sodium demeaned_sodium_bysex
<fct> <dbl> <dbl>
1 1 130 -6.54
2 1 136 -0.536
3 1 129 -7.54
4 1 137 0.464
5 0 116 -20.8
6 1 132 -4.54
7 1 137 0.464
8 1 131 -5.54
9 0 138 1.21
10 1 133 -3.54
# ℹ 289 more rows
Rename the DEATH_EVENT to Mortality:
[1] "age" "sex"
[3] "anaemia" "creatinine_phosphokinase"
[5] "ejection_fraction" "high_blood_pressure"
[7] "platelets" "serum_creatinine"
[9] "serum_sodium" "smoking"
[11] "time" "Mortality"
[13] "diabetes" "Age_days"
[15] "demeaned_sodium_bysex"
dplyr includes a number of ways to identify multiple variables. The latest version of dplyr suggests using across() within dplyr functions that allow specifying columns, as we’ve used above.
This replaces separate functions previously used for each of filter/select/mutate/summarize/arrange that had independent functions ending with *_all(), *_at(), *_each(), *_if().
Other than specifying multiple columns by name, they can also be specified by index, by range, or by pattern-matching.
# A tibble: 1 × 2
platelets serum_sodium
<dbl> <dbl>
1 263358. 137.
For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:
dat |>
summarize(across(starts_with("serum_"), mean))# A tibble: 1 × 2
serum_creatinine serum_sodium
<dbl> <dbl>
1 1.39 137.
where()
A predicate function is a function that returns a logical value, i.e. either TRUE or FALSE.
For example, you can use is.* functions to identify columns based on their type, e.g. numeric or factor.
# A tibble: 1 × 9
age creatinine_phosphokinase ejection_fraction platelets serum_creatinine
<dbl> <dbl> <dbl> <dbl> <dbl>
1 60.8 582. 38.1 263358. 1.39
# ℹ 4 more variables: serum_sodium <dbl>, time <dbl>, Age_days <dbl>,
# demeaned_sodium_bysex <dbl>
To use a summary function that returns more than one row per group, use reframe() instead of summarize(). For example, to get the range (min and max) of all numeric columns:
# A tibble: 2 × 9
age creatinine_phosphokinase ejection_fraction platelets serum_creatinine
<dbl> <dbl> <dbl> <dbl> <dbl>
1 40 23 14 25100 0.5
2 95 7861 80 850000 9.4
# ℹ 4 more variables: serum_sodium <dbl>, time <dbl>, Age_days <dbl>,
# demeaned_sodium_bysex <dbl>
where()
It is often useful to be able to get the number of unique values per column, e.g. in order to identify which columns may need to be converted to factors. dplyr includes the function n_distinct() which counts the number of unique values of a vector.
The following example shows dplyr’s formula shorthand syntax to build an anonymous function using n_distinct() in order to select columns with fewer than 10 unique values:
dat |>
select(where(~ n_distinct(.x) < 10))# A tibble: 299 × 6
sex anaemia high_blood_pressure smoking Mortality diabetes
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 0 1 0 1 0
2 1 0 0 0 1 0
3 1 0 0 1 1 0
4 1 1 0 0 1 0
5 0 1 0 0 1 1
6 1 1 1 1 1 0
7 1 1 0 0 1 0
8 1 1 0 1 1 1
9 0 0 0 0 1 0
10 1 1 1 1 1 0
# ℹ 289 more rows
which is equivalent to using a standard R anonymous function:
dat |>
select(where(\(x) n_distinct(x) < 10))# A tibble: 299 × 6
sex anaemia high_blood_pressure smoking Mortality diabetes
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 0 1 0 1 0
2 1 0 0 0 1 0
3 1 0 0 1 1 0
4 1 1 0 0 1 0
5 0 1 0 0 1 1
6 1 1 1 1 1 0
7 1 1 0 0 1 0
8 1 1 0 1 1 1
9 0 0 0 0 1 0
10 1 1 1 1 1 0
# ℹ 289 more rows
n_distinct(x) is equivalent to length(unique(x)):
# A tibble: 299 × 6
sex anaemia high_blood_pressure smoking Mortality diabetes
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 0 1 0 1 0
2 1 0 0 0 1 0
3 1 0 0 1 1 0
4 1 1 0 0 1 0
5 0 1 0 0 1 1
6 1 1 1 1 1 0
7 1 1 0 0 1 0
8 1 1 0 1 1 1
9 0 0 0 0 1 0
10 1 1 1 1 1 0
# ℹ 289 more rows
To mutate those columns to factors, we combine across() and where() with the formula notation seen above within mutate():
dat <- dat |>
mutate(across(where(~ n_distinct(.) < 10), as.factor))str(dat)tibble [299 × 15] (S3: tbl_df/tbl/data.frame)
$ age : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
$ sex : Factor w/ 2 levels "1","0": 1 1 1 1 2 1 1 1 2 1 ...
$ anaemia : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
$ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
$ ejection_fraction : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
$ high_blood_pressure : Factor w/ 2 levels "1","0": 1 2 2 2 2 1 2 2 2 1 ...
$ platelets : num [1:299] 265000 263358 162000 210000 327000 ...
$ serum_creatinine : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
$ serum_sodium : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
$ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
$ time : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
$ Mortality : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
$ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
$ Age_days : num [1:299] 27375 20075 23725 18250 23725 ...
$ demeaned_sodium_bysex : num [1:299] -6.536 -0.536 -7.536 0.464 -20.79 ...
For example, filter for smokers over age 60 and show mean platelets by sex:
Using dplyr on data.frames or tibbles does not offer any performance benefits over performing the equivalent operations in base R (and may in fact be slightly slower).
A number of alternative backends, at different stages of maturity, are available that offer performance benefits for large datasets.