43 duckplyr basics

The duckplyr package offers extensive functionality for data manipulation.

duckplyr’s functions are named after verbs (like filter, select, etc.) and are commonly used with the pipe operator to build pipelines. The package offers a large number of functions in total, often with multiple versions of the same “verb”. It has undergone many major changes since its introduction, so always make sure to consult the latest documentation.

Core operations include:

filter(): identify cases
select(): identify columns
mutate(): create new variables as a function of existing ones
arrange(): reorder the rows/cases of a dataset
summarize(): apply functions on columns

duckplyr operates on data.frames as well as the tidyverse’s data.frame replacement, known as tibble.

Note that duckplyr masks some builtin functions when loaded.

library(duckplyr)

Loading required package: dplyr


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, data will be collected but no data will
be uploaded.
ℹ Automatic fallback uploading is not controlled and therefore disabled, see
  `?duckplyr::fallback()`.
✔ Number of reports ready for upload: 5.
→ Review with `duckplyr::fallback_review()`, upload with
  `duckplyr::fallback_upload()`.
ℹ Configure automatic uploading with `duckplyr::fallback_config()`.

✔ Overwriting dplyr methods with duckplyr methods.
ℹ Turn off with `duckplyr::methods_restore()`.

As an example, we read in the Heart Failure Clinical Records dataset from the UCI Machine Learning Repository using readr.

To read from https or S3 URLs, we need to install and load the httpfs extension in DuckDB first:

db_exec("INSTALL httpfs")
db_exec("LOAD httpfs")

dat <- read_csv_duckdb(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv"
) |> 
as_duckdb_tibble()
dat

# A duckplyr data frame: 13 variables
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
 1    75       0                      582        0                20
 2    55       0                     7861        0                38
 3    65       0                      146        0                20
 4    50       1                      111        0                20
 5    65       1                      160        1                20
 6    90       1                       47        0                40
 7    75       1                      246        0                15
 8    60       1                      315        1                60
 9    65       0                      157        0                65
10    80       1                      123        0                35
# ℹ more rows
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
#   time <dbl>, DEATH_EVENT <dbl>

43.1 Filter

Single condition

dat |> 
    filter(age > 60)

# A duckplyr data frame: 13 variables
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
 1    75       0                      582        0                20
 2    65       0                      146        0                20
 3    65       1                      160        1                20
 4    90       1                       47        0                40
 5    75       1                      246        0                15
 6    65       0                      157        0                65
 7    80       1                      123        0                35
 8    75       1                       81        0                38
 9    62       0                      231        0                25
10    82       1                      379        0                50
# ℹ more rows
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
#   time <dbl>, DEATH_EVENT <dbl>

Multiple conditions

dat |> 
    filter(age > 60, anaemia == "1")

# A duckplyr data frame: 13 variables
     age anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
 1    65       1                      160        1                20
 2    90       1                       47        0                40
 3    75       1                      246        0                15
 4    80       1                      123        0                35
 5    75       1                       81        0                38
 6    82       1                      379        0                50
 7    87       1                      149        0                38
 8    70       1                      125        0                25
 9    65       1                       52        0                25
10    65       1                      128        1                30
# ℹ more rows
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
#   time <dbl>, DEATH_EVENT <dbl>

43.2 Select

Columns can be specified by name, index, or pattern-matching.

43.2.1 By column name

dat |> 
    select("age", "diabetes", "ejection_fraction")

# A duckplyr data frame: 3 variables
     age diabetes ejection_fraction
   <dbl>    <dbl>             <dbl>
 1    75        0                20
 2    55        0                38
 3    65        0                20
 4    50        0                20
 5    65        1                20
 6    90        0                40
 7    75        0                15
 8    60        1                60
 9    65        0                65
10    80        0                35
# ℹ more rows

43.2.2 By integer column index

dat |> 
    select(c(7, 9))

# A duckplyr data frame: 2 variables
   platelets serum_sodium
       <dbl>        <dbl>
 1   265000           130
 2   263358.          136
 3   162000           129
 4   210000           137
 5   327000           116
 6   204000           132
 7   127000           137
 8   454000           131
 9   263358.          138
10   388000           133
# ℹ more rows

dat |> 
    select(7:9)

# A duckplyr data frame: 3 variables
   platelets serum_creatinine serum_sodium
       <dbl>            <dbl>        <dbl>
 1   265000               1.9          130
 2   263358.              1.1          136
 3   162000               1.3          129
 4   210000               1.9          137
 5   327000               2.7          116
 6   204000               2.1          132
 7   127000               1.2          137
 8   454000               1.1          131
 9   263358.              1.5          138
10   388000               9.4          133
# ℹ more rows

43.2.3 By character name range

For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:

dat |> 
    select(platelets:serum_sodium)

# A duckplyr data frame: 3 variables
   platelets serum_creatinine serum_sodium
       <dbl>            <dbl>        <dbl>
 1   265000               1.9          130
 2   263358.              1.1          136
 3   162000               1.3          129
 4   210000               1.9          137
 5   327000               2.7          116
 6   204000               2.1          132
 7   127000               1.2          137
 8   454000               1.1          131
 9   263358.              1.5          138
10   388000               9.4          133
# ℹ more rows

43.2.4 Pattern-matching

dat |> 
    select(starts_with("serum_"))

# A duckplyr data frame: 2 variables
   serum_creatinine serum_sodium
              <dbl>        <dbl>
 1              1.9          130
 2              1.1          136
 3              1.3          129
 4              1.9          137
 5              2.7          116
 6              2.1          132
 7              1.2          137
 8              1.1          131
 9              1.5          138
10              9.4          133
# ℹ more rows

dat |> 
    select(ends_with("_fraction"))

# A duckplyr data frame: 1 variable
   ejection_fraction
               <dbl>
 1                20
 2                38
 3                20
 4                20
 5                20
 6                40
 7                15
 8                60
 9                65
10                35
# ℹ more rows

43.3 Count

Count the number of smokers over age 60 by sex:

dat |> 
    filter(smoking == "1", age > 60) |> 
    count(sex)

# A duckplyr data frame: 2 variables
    sex     n
  <dbl> <int>
1     0     2
2     1    43

43.4 Summarize

Operate on a single column:

dat |> 
    summarize(mean(age))

# A duckplyr data frame: 1 variable
  `mean(age)`
        <dbl>
1        60.8

Operate on multiple columns using across():

dat |> 
    summarize(across(c(age, ejection_fraction), mean))

# A duckplyr data frame: 2 variables
    age ejection_fraction
  <dbl>             <dbl>
1  60.8              38.1

43.4.1 Grouped operation on single column

with single grouping variable:

dat |> 
    summarize(mean(age), .by = sex)

# A duckplyr data frame: 2 variables
    sex `mean(age)`
  <dbl>       <dbl>
1     0        59.8
2     1        61.4

with multiple grouping variables:

dat |> 
    summarize(mean(age), .by = c(sex, diabetes))

# A duckplyr data frame: 3 variables
    sex diabetes `mean(age)`
  <dbl>    <dbl>       <dbl>
1     1        1        60.4
2     0        0        61.6
3     1        0        62.0
4     0        1        58.2

group_by() followed by summarize() can also be used but results in a tibble, not a duckplyr data frame. The .by argument of summarize() is preferred.

43.4.2 Grouped operation on multiple columns

Group by single variable:

dat |> 
    summarize(across(c(age, serum_sodium), mean), .by = sex)

# A duckplyr data frame: 3 variables
    sex   age serum_sodium
  <dbl> <dbl>        <dbl>
1     1  61.4         137.
2     0  59.8         137.

Group by multiple variables:

dat |> 
    summarize(across(c(age, serum_sodium), mean), .by = c(smoking, anaemia))

# A duckplyr data frame: 4 variables
  smoking anaemia   age serum_sodium
    <dbl>   <dbl> <dbl>        <dbl>
1       1       0  60.4         136.
2       1       1  62.6         137 
3       0       0  59.7         136.
4       0       1  61.8         137.

43.5 Arrange

arrange() sorts a data.frame/tibble, i.e. reorders the rows/cases of a dataset.

dat |> 
    select(age, sex, serum_sodium) |> 
    arrange(age)

# A duckplyr data frame: 3 variables
     age   sex serum_sodium
   <dbl> <dbl>        <dbl>
 1    40     1          132
 2    40     1          136
 3    40     1          136
 4    40     1          137
 5    40     0          141
 6    40     1          142
 7    40     0          140
 8    41     1          140
 9    42     1          140
10    42     1          140
# ℹ more rows

43.6 Slice

Return the row with the highest ejection_fraction value:

dat |>
  slice_max(ejection_fraction)

# A duckplyr data frame: 13 variables
    age anaemia creatinine_phosphokinase diabetes ejection_fraction
  <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
1    45       0                      582        0                80
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
#   time <dbl>, DEATH_EVENT <dbl>

Return the row with the minimum age:

dat |>
  slice_min(age)

# A duckplyr data frame: 13 variables
    age anaemia creatinine_phosphokinase diabetes ejection_fraction
  <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
1    40       0                      478        1                30
2    40       0                      244        0                45
3    40       1                      101        0                40
4    40       1                      129        0                35
5    40       0                       90        0                35
6    40       0                      624        0                35
7    40       0                      582        1                35
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
#   time <dbl>, DEATH_EVENT <dbl>

slice_max() and slice_min() return all rows with the maximum/minimum value by default. To return only a single row, use the with_ties = FALSE argument.

dat |>
  slice_min(age, with_ties = FALSE)

# A duckplyr data frame: 13 variables
    age anaemia creatinine_phosphokinase diabetes ejection_fraction
  <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
1    40       0                      478        1                30
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
#   time <dbl>, DEATH_EVENT <dbl>

43.7 Relocate

Move the sex column after the age column:

dat <- dat |>
    relocate(sex, .after = age)
dat

# A duckplyr data frame: 13 variables
     age   sex anaemia creatinine_phosphokinase diabetes ejection_fraction
   <dbl> <dbl>   <dbl>                    <dbl>    <dbl>             <dbl>
 1    75     1       0                      582        0                20
 2    55     1       0                     7861        0                38
 3    65     1       0                      146        0                20
 4    50     1       1                      111        0                20
 5    65     0       1                      160        1                20
 6    90     1       1                       47        0                40
 7    75     1       1                      246        0                15
 8    60     1       1                      315        1                60
 9    65     0       0                      157        0                65
10    80     1       1                      123        0                35
# ℹ more rows
# ℹ 7 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, smoking <dbl>, time <dbl>,
#   DEATH_EVENT <dbl>

Make the diabetes column the last column:

dat <- dat |>
    relocate(diabetes, .after = last_col())
names(dat)

 [1] "age"                      "sex"                     
 [3] "anaemia"                  "creatinine_phosphokinase"
 [5] "ejection_fraction"        "high_blood_pressure"     
 [7] "platelets"                "serum_creatinine"        
 [9] "serum_sodium"             "smoking"                 
[11] "time"                     "DEATH_EVENT"             
[13] "diabetes"

Note: above we use names() to show the column names of the resulting data.frame/tibble and verify the changes, since only the first few columns are printed by default.

43.8 Mutate

Add a new column named Age_days:

dat <- dat |> 
    mutate(Age_days = age * 365)

dat["Age_days"]

# A duckplyr data frame: 1 variable
   Age_days
      <dbl>
 1    27375
 2    20075
 3    23725
 4    18250
 5    23725
 6    32850
 7    27375
 8    21900
 9    23725
10    29200
# ℹ more rows

43.8.1 Mutate by group

dat <- dat |>
    mutate(demeaned_sodium_bysex = serum_sodium - mean(serum_sodium), .by = "sex")

Note that ungroup() is required to return the data.frame/tibble to its original ungrouped state. If you don’t do this, all subsequent operations will be performed on the same grouping.

Inspect the new column:

dat |> 
    select(sex, serum_sodium, demeaned_sodium_bysex)

# A duckplyr data frame: 3 variables
     sex serum_sodium demeaned_sodium_bysex
   <dbl>        <dbl>                 <dbl>
 1     1          130                -6.54 
 2     1          136                -0.536
 3     1          129                -7.54 
 4     1          137                 0.464
 5     0          116               -20.8  
 6     1          132                -4.54 
 7     1          137                 0.464
 8     1          131                -5.54 
 9     0          138                 1.21 
10     1          133                -3.54 
# ℹ more rows

43.9 Rename

Rename the DEATH_EVENT to Mortality:

dat <- dat |> 
    rename(Mortality = DEATH_EVENT)

names(dat)

 [1] "age"                      "sex"                     
 [3] "anaemia"                  "creatinine_phosphokinase"
 [5] "ejection_fraction"        "high_blood_pressure"     
 [7] "platelets"                "serum_creatinine"        
 [9] "serum_sodium"             "smoking"                 
[11] "time"                     "Mortality"               
[13] "diabetes"                 "Age_days"                
[15] "demeaned_sodium_bysex"

43.10 Specifying multiple columns

duckplyr includes a number of ways to identify multiple variables. The latest version of duckplyr suggests using across() within duckplyr functions that allow specifying columns, as we’ve used above.

This replaces separate functions previously used for each of filter/select/mutate/summarize/arrange that had independent functions ending with *_all(), *_at(), *_each(), *_if().

Other than specifying multiple columns by name, they can also be specified by index, by range, or by pattern-matching.

43.10.1 By integer column index

dat |> 
    summarize(across(c(7, 9), mean))

# A duckplyr data frame: 2 variables
  platelets serum_sodium
      <dbl>        <dbl>
1   263358.         137.

dat |> 
    summarize(across(7:9, mean))

# A duckplyr data frame: 3 variables
  platelets serum_creatinine serum_sodium
      <dbl>            <dbl>        <dbl>
1   263358.             1.39         137.

43.10.2 By character name range

For example, select all columns between platelets and serum_sodium in the order they appear in the data.frame/tibble:

dat |> 
    summarize(across(platelets:serum_sodium, mean))

# A duckplyr data frame: 3 variables
  platelets serum_creatinine serum_sodium
      <dbl>            <dbl>        <dbl>
1   263358.             1.39         137.

43.10.3 Pattern-matching

dat |> 
    summarize(across(starts_with("serum_"), mean))

# A duckplyr data frame: 2 variables
  serum_creatinine serum_sodium
             <dbl>        <dbl>
1             1.39         137.

dat |> 
    summarize(across(ends_with("fraction"), mean))

# A duckplyr data frame: 1 variable
  ejection_fraction
              <dbl>
1              38.1

43.10.4 Using predicate function wrapped in `where()`

A predicate function is a function that returns a logical value, i.e. either TRUE or FALSE.

For example, you can use is.* functions to identify columns based on their type, e.g. numeric or factor.

dat |> 
    summarize(across(where(is.numeric), mean))

# A duckplyr data frame: 15 variables
    age   sex anaemia creatinine_phosphokinase ejection_fraction
  <dbl> <dbl>   <dbl>                    <dbl>             <dbl>
1  60.8 0.649   0.431                     582.              38.1
# ℹ 10 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, smoking <dbl>, time <dbl>,
#   Mortality <dbl>, diabetes <dbl>, Age_days <dbl>,
#   demeaned_sodium_bysex <dbl>

To use a summary function that returns more than one row per group, use reframe() instead of summarize(). For example, to get the range (min and max) of all numeric columns:

dat |> 
    reframe(across(where(is.numeric), range))

# A tibble: 2 × 15
    age   sex anaemia creatinine_phosphokinase ejection_fraction
  <dbl> <dbl>   <dbl>                    <dbl>             <dbl>
1    40     0       0                       23                14
2    95     1       1                     7861                80
# ℹ 10 more variables: high_blood_pressure <dbl>, platelets <dbl>,
#   serum_creatinine <dbl>, serum_sodium <dbl>, smoking <dbl>, time <dbl>,
#   Mortality <dbl>, diabetes <dbl>, Age_days <dbl>,
#   demeaned_sodium_bysex <dbl>

43.10.5 Using anonymous predicate function with `where()`

It is often useful to be able to get the number of unique values per column, e.g. in order to identify which columns may need to be converted to factors. duckplyr includes the function n_distinct() which counts the number of unique values of a vector.

The following example shows duckplyr’s formula shorthand syntax to build an anonymous function using n_distinct() in order to select columns with fewer than 10 unique values:

dat |> 
    select(where(~ n_distinct(.x) < 10))

# A duckplyr data frame: 6 variables
     sex anaemia high_blood_pressure smoking Mortality diabetes
   <dbl>   <dbl>               <dbl>   <dbl>     <dbl>    <dbl>
 1     1       0                   1       0         1        0
 2     1       0                   0       0         1        0
 3     1       0                   0       1         1        0
 4     1       1                   0       0         1        0
 5     0       1                   0       0         1        1
 6     1       1                   1       1         1        0
 7     1       1                   0       0         1        0
 8     1       1                   0       1         1        1
 9     0       0                   0       0         1        0
10     1       1                   1       1         1        0
# ℹ more rows

which is equivalent to using a standard R anonymous function:

dat |> 
    select(where(\(x) n_distinct(x) < 10))

# A duckplyr data frame: 6 variables
     sex anaemia high_blood_pressure smoking Mortality diabetes
   <dbl>   <dbl>               <dbl>   <dbl>     <dbl>    <dbl>
 1     1       0                   1       0         1        0
 2     1       0                   0       0         1        0
 3     1       0                   0       1         1        0
 4     1       1                   0       0         1        0
 5     0       1                   0       0         1        1
 6     1       1                   1       1         1        0
 7     1       1                   0       0         1        0
 8     1       1                   0       1         1        1
 9     0       0                   0       0         1        0
10     1       1                   1       1         1        0
# ℹ more rows

n_distinct(x) is equivalent to length(unique(x)):

dat |> 
    select(where(\(x) length(unique(x)) < 10))

# A duckplyr data frame: 6 variables
     sex anaemia high_blood_pressure smoking Mortality diabetes
   <dbl>   <dbl>               <dbl>   <dbl>     <dbl>    <dbl>
 1     1       0                   1       0         1        0
 2     1       0                   0       0         1        0
 3     1       0                   0       1         1        0
 4     1       1                   0       0         1        0
 5     0       1                   0       0         1        1
 6     1       1                   1       1         1        0
 7     1       1                   0       0         1        0
 8     1       1                   0       1         1        1
 9     0       0                   0       0         1        0
10     1       1                   1       1         1        0
# ℹ more rows

To mutate those columns to factors, we combine across() and where() with the formula notation seen above within mutate():

dat <- dat |> 
    mutate(across(where(~ n_distinct(.) < 10), as.factor))

str(dat)

dckplyr_ [299 × 15] (S3: duckplyr_df/tbl_df/tbl/data.frame)
 $ age                     : num [1:299] 75 55 65 50 65 90 75 60 65 80 ...
 $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
 $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
 $ creatinine_phosphokinase: num [1:299] 582 7861 146 111 160 ...
 $ ejection_fraction       : num [1:299] 20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
 $ platelets               : num [1:299] 265000 263358 162000 210000 327000 ...
 $ serum_creatinine        : num [1:299] 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : num [1:299] 130 136 129 137 116 132 137 131 138 133 ...
 $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
 $ time                    : num [1:299] 4 6 7 7 8 8 10 10 10 10 ...
 $ Mortality               : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
 $ Age_days                : num [1:299] 27375 20075 23725 18250 23725 ...
 $ demeaned_sodium_bysex   : num [1:299] -6.536 -0.536 -7.536 0.464 -20.79 ...

43.11 Combining multiple duckplyr operations

For example, filter for smokers over age 60 and show mean platelets by sex:

dat |> 
    filter(smoking == "1", age > 60) |> 
    summarize(Mean_platelets_by_sex = mean(platelets), .by = sex)

# A duckplyr data frame: 2 variables
  sex   Mean_platelets_by_sex
  <fct>                 <dbl>
1 1                   248792.
2 0                   469000

43 duckplyr basics

43.1 Filter

43.2 Select

43.2.1 By column name

43.2.2 By integer column index

43.2.3 By character name range

43.2.4 Pattern-matching

43.3 Count

43.4 Summarize

43.4.1 Grouped operation on single column

43.4.2 Grouped operation on multiple columns

43.5 Arrange

43.6 Slice

43.7 Relocate

43.8 Mutate

43.8.1 Mutate by group

43.9 Rename

43.10 Specifying multiple columns

43.10.1 By integer column index

43.10.2 By character name range

43.10.3 Pattern-matching

43.10.4 Using predicate function wrapped in `where()`

43.10.5 Using anonymous predicate function with `where()`

43.11 Combining multiple duckplyr operations

43.12 Resources

43.13 See also

43.1 Filter

43.2 Select

43.2.1 By column name

43.2.2 By integer column index

43.2.3 By character name range

43.2.4 Pattern-matching

43.3 Count

43.4 Summarize

43.4.1 Grouped operation on single column

43.4.2 Grouped operation on multiple columns

43.5 Arrange

43.6 Slice

43.7 Relocate

43.8 Mutate

43.8.1 Mutate by group

43.9 Rename

43.10 Specifying multiple columns

43.10.1 By integer column index

43.10.2 By character name range

43.10.3 Pattern-matching

43.10.4 Using predicate function wrapped in where()

43.10.5 Using anonymous predicate function with where()

43.11 Combining multiple duckplyr operations

43.12 Resources

43.13 See also

43.10.4 Using predicate function wrapped in `where()`

43.10.5 Using anonymous predicate function with `where()`