The duckplyr package offers extensive functionality for data manipulation. duckplyr’s functions are named after verbs (like filter, select, etc.) and are commonly used with the pipe operator to build pipelines. The package offers a large number of functions in total, often with multiple versions of the same “verb”. It has undergone many major changes since its introduction, so always make sure to consult the latest documentation.
Core operations include:
duckplyr operates on data.frames as well as the tidyverse’s data.frame replacement, known as tibble .
Note that duckplyr masks some builtin functions when loaded.
Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, data will be collected but no data will
be uploaded.
ℹ Automatic fallback uploading is not controlled and therefore disabled, see
`?duckplyr::fallback()`.
✔ Number of reports ready for upload: 5.
→ Review with `duckplyr::fallback_review()`, upload with
`duckplyr::fallback_upload()`.
ℹ Configure automatic uploading with `duckplyr::fallback_config()`.
✔ Overwriting dplyr methods with duckplyr methods.
ℹ Turn off with `duckplyr::methods_restore()`.
As an example, we read in the Heart Failure Clinical Records dataset from the UCI Machine Learning Repository using readr .
To read from https or S3 URLs, we need to install and load the httpfs extension in DuckDB first:
dat <- read_csv_duckdb (
"https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv"
) |>
as_duckdb_tibble ( )
dat
# A duckplyr data frame: 13 variables
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 75 0 582 0 20
2 55 0 7861 0 38
3 65 0 146 0 20
4 50 1 111 0 20
5 65 1 160 1 20
6 90 1 47 0 40
7 75 1 246 0 15
8 60 1 315 1 60
9 65 0 157 0 65
10 80 1 123 0 35
# ℹ more rows
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <dbl>
Filter
Single condition
# A duckplyr data frame: 13 variables
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 75 0 582 0 20
2 65 0 146 0 20
3 65 1 160 1 20
4 90 1 47 0 40
5 75 1 246 0 15
6 65 0 157 0 65
7 80 1 123 0 35
8 75 1 81 0 38
9 62 0 231 0 25
10 82 1 379 0 50
# ℹ more rows
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <dbl>
Multiple conditions
dat |>
filter ( age > 60 , anaemia == "1" )
# A duckplyr data frame: 13 variables
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 65 1 160 1 20
2 90 1 47 0 40
3 75 1 246 0 15
4 80 1 123 0 35
5 75 1 81 0 38
6 82 1 379 0 50
7 87 1 149 0 38
8 70 1 125 0 25
9 65 1 52 0 25
10 65 1 128 1 30
# ℹ more rows
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <dbl>
Select
Columns can be specified by name, index, or pattern-matching.
By column name
dat |>
select ( "age" , "diabetes" , "ejection_fraction" )
# A duckplyr data frame: 3 variables
age diabetes ejection_fraction
<dbl> <dbl> <dbl>
1 75 0 20
2 55 0 38
3 65 0 20
4 50 0 20
5 65 1 20
6 90 0 40
7 75 0 15
8 60 1 60
9 65 0 65
10 80 0 35
# ℹ more rows
By integer column index
# A duckplyr data frame: 2 variables
platelets serum_sodium
<dbl> <dbl>
1 265000 130
2 263358. 136
3 162000 129
4 210000 137
5 327000 116
6 204000 132
7 127000 137
8 454000 131
9 263358. 138
10 388000 133
# ℹ more rows
# A duckplyr data frame: 3 variables
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 265000 1.9 130
2 263358. 1.1 136
3 162000 1.3 129
4 210000 1.9 137
5 327000 2.7 116
6 204000 2.1 132
7 127000 1.2 137
8 454000 1.1 131
9 263358. 1.5 138
10 388000 9.4 133
# ℹ more rows
By character name range
For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:
dat |>
select ( platelets : serum_sodium )
# A duckplyr data frame: 3 variables
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 265000 1.9 130
2 263358. 1.1 136
3 162000 1.3 129
4 210000 1.9 137
5 327000 2.7 116
6 204000 2.1 132
7 127000 1.2 137
8 454000 1.1 131
9 263358. 1.5 138
10 388000 9.4 133
# ℹ more rows
Pattern-matching
# A duckplyr data frame: 2 variables
serum_creatinine serum_sodium
<dbl> <dbl>
1 1.9 130
2 1.1 136
3 1.3 129
4 1.9 137
5 2.7 116
6 2.1 132
7 1.2 137
8 1.1 131
9 1.5 138
10 9.4 133
# ℹ more rows
# A duckplyr data frame: 1 variable
ejection_fraction
<dbl>
1 20
2 38
3 20
4 20
5 20
6 40
7 15
8 60
9 65
10 35
# ℹ more rows
Count
Count the number of smokers over age 60 by sex:
# A duckplyr data frame: 2 variables
sex n
<dbl> <int>
1 0 2
2 1 43
Summarize
Operate on a single column:
# A duckplyr data frame: 1 variable
`mean(age)`
<dbl>
1 60.8
Operate on multiple columns using across() :
# A duckplyr data frame: 2 variables
age ejection_fraction
<dbl> <dbl>
1 60.8 38.1
Grouped operation on single column
with single grouping variable:
# A duckplyr data frame: 2 variables
sex `mean(age)`
<dbl> <dbl>
1 0 59.8
2 1 61.4
with multiple grouping variables:
# A duckplyr data frame: 3 variables
sex diabetes `mean(age)`
<dbl> <dbl> <dbl>
1 1 1 60.4
2 0 0 61.6
3 1 0 62.0
4 0 1 58.2
Grouped operation on multiple columns
Group by single variable:
# A duckplyr data frame: 3 variables
sex age serum_sodium
<dbl> <dbl> <dbl>
1 1 61.4 137.
2 0 59.8 137.
Group by multiple variables:
# A duckplyr data frame: 4 variables
smoking anaemia age serum_sodium
<dbl> <dbl> <dbl> <dbl>
1 1 0 60.4 136.
2 1 1 62.6 137
3 0 0 59.7 136.
4 0 1 61.8 137.
Arrange
arrange() sorts a data.frame/tibble, i.e. reorders the rows/cases of a dataset.
# A duckplyr data frame: 3 variables
age sex serum_sodium
<dbl> <dbl> <dbl>
1 40 1 132
2 40 1 136
3 40 1 136
4 40 1 137
5 40 0 141
6 40 1 142
7 40 0 140
8 41 1 140
9 42 1 140
10 42 1 140
# ℹ more rows
Slice
Return the row with the highest ejection_fraction value:
# A duckplyr data frame: 13 variables
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 45 0 582 0 80
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <dbl>
Return the row with the minimum age:
# A duckplyr data frame: 13 variables
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 40 0 478 1 30
2 40 0 244 0 45
3 40 1 101 0 40
4 40 1 129 0 35
5 40 0 90 0 35
6 40 0 624 0 35
7 40 0 582 1 35
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <dbl>
slice_max() and slice_min() return all rows with the maximum/minimum value by default. To return only a single row, use the with_ties = FALSE argument.
# A duckplyr data frame: 13 variables
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 40 0 478 1 30
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <dbl>
Relocate
Move the sex column after the age column:
dat <- dat |>
relocate ( sex , .after = age )
dat
# A duckplyr data frame: 13 variables
age sex anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 75 1 0 582 0 20
2 55 1 0 7861 0 38
3 65 1 0 146 0 20
4 50 1 1 111 0 20
5 65 0 1 160 1 20
6 90 1 1 47 0 40
7 75 1 1 246 0 15
8 60 1 1 315 1 60
9 65 0 0 157 0 65
10 80 1 1 123 0 35
# ℹ more rows
# ℹ 7 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, smoking <dbl>, time <dbl>,
# DEATH_EVENT <dbl>
Make the diabetes column the last column:
[1] "age" "sex"
[3] "anaemia" "creatinine_phosphokinase"
[5] "ejection_fraction" "high_blood_pressure"
[7] "platelets" "serum_creatinine"
[9] "serum_sodium" "smoking"
[11] "time" "DEATH_EVENT"
[13] "diabetes"
Note: above we use names() to show the column names of the resulting data.frame/tibble and verify the changes, since only the first few columns are printed by default.
Mutate
Add a new column named Age_days:
dat <- dat |>
mutate ( Age_days = age * 365 )
# A duckplyr data frame: 1 variable
Age_days
<dbl>
1 27375
2 20075
3 23725
4 18250
5 23725
6 32850
7 27375
8 21900
9 23725
10 29200
# ℹ more rows
Mutate by group
dat <- dat |>
mutate ( demeaned_sodium_bysex = serum_sodium - mean ( serum_sodium ) , .by = "sex" )
Note that ungroup() is required to return the data.frame/tibble to its original ungrouped state. If you don’t do this, all subsequent operations will be performed on the same grouping.
Inspect the new column:
dat |>
select ( sex , serum_sodium , demeaned_sodium_bysex )
# A duckplyr data frame: 3 variables
sex serum_sodium demeaned_sodium_bysex
<dbl> <dbl> <dbl>
1 1 130 -6.54
2 1 136 -0.536
3 1 129 -7.54
4 1 137 0.464
5 0 116 -20.8
6 1 132 -4.54
7 1 137 0.464
8 1 131 -5.54
9 0 138 1.21
10 1 133 -3.54
# ℹ more rows
Rename
Rename the DEATH_EVENT to Mortality:
dat <- dat |>
rename ( Mortality = DEATH_EVENT )
names ( dat )
[1] "age" "sex"
[3] "anaemia" "creatinine_phosphokinase"
[5] "ejection_fraction" "high_blood_pressure"
[7] "platelets" "serum_creatinine"
[9] "serum_sodium" "smoking"
[11] "time" "Mortality"
[13] "diabetes" "Age_days"
[15] "demeaned_sodium_bysex"
Specifying multiple columns
duckplyr includes a number of ways to identify multiple variables. The latest version of duckplyr suggests using across() within duckplyr functions that allow specifying columns, as we’ve used above. This replaces separate functions previously used for each of filter/select/mutate/summarize/arrange that had independent functions ending with *_all(), *_at(), *_each(), *_if(). Other than specifying multiple columns by name, they can also be specified by index, by range, or by pattern-matching.
By integer column index
# A duckplyr data frame: 2 variables
platelets serum_sodium
<dbl> <dbl>
1 263358. 137.
# A duckplyr data frame: 3 variables
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 263358. 1.39 137.
By character name range
For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:
# A duckplyr data frame: 3 variables
platelets serum_creatinine serum_sodium
<dbl> <dbl> <dbl>
1 263358. 1.39 137.
Pattern-matching
# A duckplyr data frame: 2 variables
serum_creatinine serum_sodium
<dbl> <dbl>
1 1.39 137.
# A duckplyr data frame: 1 variable
ejection_fraction
<dbl>
1 38.1
Using predicate function wrapped in where()
A predicate function is a function that returns a logical value, i.e. either TRUE or FALSE.
For example, you can use is.* functions to identify columns based on their type, e.g. numeric or factor.
# A duckplyr data frame: 15 variables
age sex anaemia creatinine_phosphokinase ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 60.8 0.649 0.431 582. 38.1
# ℹ 10 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, smoking <dbl>, time <dbl>,
# Mortality <dbl>, diabetes <dbl>, Age_days <dbl>,
# demeaned_sodium_bysex <dbl>
To use a summary function that returns more than one row per group, use reframe() instead of summarize() . For example, to get the range (min and max) of all numeric columns:
# A tibble: 2 × 15
age sex anaemia creatinine_phosphokinase ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 40 0 0 23 14
2 95 1 1 7861 80
# ℹ 10 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, smoking <dbl>, time <dbl>,
# Mortality <dbl>, diabetes <dbl>, Age_days <dbl>,
# demeaned_sodium_bysex <dbl>
Using anonymous predicate function with where()
It is often useful to be able to get the number of unique values per column, e.g. in order to identify which columns may need to be converted to factors. duckplyr includes the function n_distinct() which counts the number of unique values of a vector. The following example shows duckplyr’s formula shorthand syntax to build an anonymous function using n_distinct() in order to select columns with fewer than 10 unique values:
# A duckplyr data frame: 6 variables
sex anaemia high_blood_pressure smoking Mortality diabetes
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0
2 1 0 0 0 1 0
3 1 0 0 1 1 0
4 1 1 0 0 1 0
5 0 1 0 0 1 1
6 1 1 1 1 1 0
7 1 1 0 0 1 0
8 1 1 0 1 1 1
9 0 0 0 0 1 0
10 1 1 1 1 1 0
# ℹ more rows
which is equivalent to using a standard R anonymous function:
# A duckplyr data frame: 6 variables
sex anaemia high_blood_pressure smoking Mortality diabetes
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0
2 1 0 0 0 1 0
3 1 0 0 1 1 0
4 1 1 0 0 1 0
5 0 1 0 0 1 1
6 1 1 1 1 1 0
7 1 1 0 0 1 0
8 1 1 0 1 1 1
9 0 0 0 0 1 0
10 1 1 1 1 1 0
# ℹ more rows
n_distinct(x) is equivalent to length(unique(x)):
# A duckplyr data frame: 6 variables
sex anaemia high_blood_pressure smoking Mortality diabetes
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0
2 1 0 0 0 1 0
3 1 0 0 1 1 0
4 1 1 0 0 1 0
5 0 1 0 0 1 1
6 1 1 1 1 1 0
7 1 1 0 0 1 0
8 1 1 0 1 1 1
9 0 0 0 0 1 0
10 1 1 1 1 1 0
# ℹ more rows
To mutate those columns to factors, we combine across() and where() with the formula notation seen above within mutate() :
dckplyr_ [299 × 15] (S3: duckplyr_df/tbl_df/tbl/data.frame)
$ age : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
$ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
$ anaemia : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
$ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
$ ejection_fraction : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
$ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
$ platelets : num [1:299] 265000 263358 162000 210000 327000 ...
$ serum_creatinine : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
$ serum_sodium : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
$ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
$ time : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
$ Mortality : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
$ Age_days : num [1:299] 27375 20075 23725 18250 23725 ...
$ demeaned_sodium_bysex : num [1:299] -6.536 -0.536 -7.536 0.464 -20.79 ...
Combining multiple duckplyr operations
For example, filter for smokers over age 60 and show mean platelets by sex:
dat |>
filter ( smoking == "1" , age > 60 ) |>
summarize ( Mean_platelets_by_sex = mean ( platelets ) , .by = sex )
# A duckplyr data frame: 2 variables
sex Mean_platelets_by_sex
<fct> <dbl>
1 1 248792.
2 0 469000