46 Polars basics

The Polars R package exports two environments that host most of the API: pl and cs. To access functions and objects in these environments, use the $ operator. pl provides the main polars DataFrame operations, allowing you to filter, select, and operate on polars DataFrame. cs includes column selectors, like selecting columns by name patterns, type, etc.

library(polars)

46.1 Read a CSV file

Use the read_csv() function within the pl environment to read a CSV file into a Polars DataFrame:

dat <- pl$read_csv(
  "../docs/data/heart_failure_clinical_records_dataset.csv",
  infer_schema_length = 1000
)
dat

shape: (299, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ i64     ┆ i64                      ┆ i64      ┆   ┆ i64 ┆ i64     ┆ i64  ┆ i64         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 75.0 ┆ 0       ┆ 582                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 4    ┆ 1           │
│ 55.0 ┆ 0       ┆ 7861                     ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 6    ┆ 1           │
│ 65.0 ┆ 0       ┆ 146                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 7    ┆ 1           │
│ 50.0 ┆ 1       ┆ 111                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 7    ┆ 1           │
│ 65.0 ┆ 1       ┆ 160                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 8    ┆ 1           │
│ …    ┆ …       ┆ …                        ┆ …        ┆ … ┆ …   ┆ …       ┆ …    ┆ …           │
│ 62.0 ┆ 0       ┆ 61                       ┆ 1        ┆ … ┆ 1   ┆ 1       ┆ 270  ┆ 0           │
│ 55.0 ┆ 0       ┆ 1820                     ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 271  ┆ 0           │
│ 45.0 ┆ 0       ┆ 2060                     ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 278  ┆ 0           │
│ 45.0 ┆ 0       ┆ 2413                     ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 280  ┆ 0           │
│ 50.0 ┆ 0       ┆ 196                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 285  ┆ 0           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

or use scan_csv() for lazy loading, followed by collect() to load into memory:

dat <- pl$scan_csv(
  "./docs/data/heart_failure_clinical_records_dataset.csv",
  infer_schema_length = 1000
)$collect()

Lazy loading with scan_csv() is more memory efficient for large datasets, as it allows for deferred computation and optimizations, like filtering rows and selecting columns before loading the data into memory.

46.2 Convert types

Let’s convert columns with fewer than 5 unique values to factors.

First, get n_unique for each column:

dat$select(pl$all()$n_unique())

shape: (1, 13)
┌─────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ --- ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ u32 ┆ u32     ┆ u32                      ┆ u32      ┆   ┆ u32 ┆ u32     ┆ u32  ┆ u32         │
╞═════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 47  ┆ 2       ┆ 208                      ┆ 2        ┆ … ┆ 2   ┆ 2       ┆ 148  ┆ 2           │
└─────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

Can view the results transposed for easier reading:

dat$select(pl$all()$n_unique())$transpose(include_header = TRUE)

shape: (13, 2)
┌──────────────────────────┬──────────┐
│ column                   ┆ column_0 │
│ ---                      ┆ ---      │
│ str                      ┆ u32      │
╞══════════════════════════╪══════════╡
│ age                      ┆ 47       │
│ anaemia                  ┆ 2        │
│ creatinine_phosphokinase ┆ 208      │
│ diabetes                 ┆ 2        │
│ ejection_fraction        ┆ 17       │
│ …                        ┆ …        │
│ serum_sodium             ┆ 27       │
│ sex                      ┆ 2        │
│ smoking                  ┆ 2        │
│ time                     ┆ 148      │
│ DEATH_EVENT              ┆ 2        │
└──────────────────────────┴──────────┘

In either case, the output is cropped.
We can get the names of columns with fewer than 5 unique values:

to_factor <- names(dat)[dat$select(
  pl$all()$n_unique() < 5
)$transpose()$to_series()$to_r_vector()]
to_factor

[1] "anaemia"             "diabetes"            "high_blood_pressure"
[4] "sex"                 "smoking"             "DEATH_EVENT"

Convert identified columns to factors using cast() within a loop:

for (col in to_factor) {
  dat <- dat$with_columns(
    pl$col(col)$cast(pl$String)$cast(pl$Categorical())
  )
}
dat

shape: (299, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat     ┆ i64                      ┆ cat      ┆   ┆ cat ┆ cat     ┆ i64  ┆ cat         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 75.0 ┆ 0       ┆ 582                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 4    ┆ 1           │
│ 55.0 ┆ 0       ┆ 7861                     ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 6    ┆ 1           │
│ 65.0 ┆ 0       ┆ 146                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 7    ┆ 1           │
│ 50.0 ┆ 1       ┆ 111                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 7    ┆ 1           │
│ 65.0 ┆ 1       ┆ 160                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 8    ┆ 1           │
│ …    ┆ …       ┆ …                        ┆ …        ┆ … ┆ …   ┆ …       ┆ …    ┆ …           │
│ 62.0 ┆ 0       ┆ 61                       ┆ 1        ┆ … ┆ 1   ┆ 1       ┆ 270  ┆ 0           │
│ 55.0 ┆ 0       ┆ 1820                     ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 271  ┆ 0           │
│ 45.0 ┆ 0       ┆ 2060                     ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 278  ┆ 0           │
│ 45.0 ┆ 0       ┆ 2413                     ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 280  ┆ 0           │
│ 50.0 ┆ 0       ┆ 196                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 285  ┆ 0           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

46.3 Filter

To filter rows in a Polars DataFrame, use the filter() method along with column expressions from the pl environment.

Single condition:

dat$filter(pl$col("age") > 60)

shape: (137, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat     ┆ i64                      ┆ cat      ┆   ┆ cat ┆ cat     ┆ i64  ┆ cat         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 75.0 ┆ 0       ┆ 582                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 4    ┆ 1           │
│ 65.0 ┆ 0       ┆ 146                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 7    ┆ 1           │
│ 65.0 ┆ 1       ┆ 160                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 8    ┆ 1           │
│ 90.0 ┆ 1       ┆ 47                       ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 8    ┆ 1           │
│ 75.0 ┆ 1       ┆ 246                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 10   ┆ 1           │
│ …    ┆ …       ┆ …                        ┆ …        ┆ … ┆ …   ┆ …       ┆ …    ┆ …           │
│ 65.0 ┆ 0       ┆ 1688                     ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 250  ┆ 0           │
│ 65.0 ┆ 0       ┆ 892                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 256  ┆ 0           │
│ 90.0 ┆ 1       ┆ 337                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 256  ┆ 0           │
│ 63.0 ┆ 1       ┆ 103                      ┆ 1        ┆ … ┆ 1   ┆ 1       ┆ 270  ┆ 0           │
│ 62.0 ┆ 0       ┆ 61                       ┆ 1        ┆ … ┆ 1   ┆ 1       ┆ 270  ┆ 0           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

Multiple conditions:

dat$filter(
  (pl$col("age") > 60) & 
  (pl$col("anaemia") == "1")
)

shape: (61, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat     ┆ i64                      ┆ cat      ┆   ┆ cat ┆ cat     ┆ i64  ┆ cat         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 65.0 ┆ 1       ┆ 160                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 8    ┆ 1           │
│ 90.0 ┆ 1       ┆ 47                       ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 8    ┆ 1           │
│ 75.0 ┆ 1       ┆ 246                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 10   ┆ 1           │
│ 80.0 ┆ 1       ┆ 123                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 10   ┆ 1           │
│ 75.0 ┆ 1       ┆ 81                       ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 10   ┆ 1           │
│ …    ┆ …       ┆ …                        ┆ …        ┆ … ┆ …   ┆ …       ┆ …    ┆ …           │
│ 62.0 ┆ 1       ┆ 655                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 233  ┆ 0           │
│ 65.0 ┆ 1       ┆ 258                      ┆ 1        ┆ … ┆ 1   ┆ 0       ┆ 235  ┆ 1           │
│ 68.0 ┆ 1       ┆ 157                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 237  ┆ 0           │
│ 90.0 ┆ 1       ┆ 337                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 256  ┆ 0           │
│ 63.0 ┆ 1       ┆ 103                      ┆ 1        ┆ … ┆ 1   ┆ 1       ┆ 270  ┆ 0           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

46.4 Select

46.4.1 By column names

dat$select("age", "diabetes", "ejection_fraction")

shape: (299, 3)
┌──────┬──────────┬───────────────────┐
│ age  ┆ diabetes ┆ ejection_fraction │
│ ---  ┆ ---      ┆ ---               │
│ f64  ┆ cat      ┆ i64               │
╞══════╪══════════╪═══════════════════╡
│ 75.0 ┆ 0        ┆ 20                │
│ 55.0 ┆ 0        ┆ 38                │
│ 65.0 ┆ 0        ┆ 20                │
│ 50.0 ┆ 0        ┆ 20                │
│ 65.0 ┆ 1        ┆ 20                │
│ …    ┆ …        ┆ …                 │
│ 62.0 ┆ 1        ┆ 38                │
│ 55.0 ┆ 0        ┆ 38                │
│ 45.0 ┆ 1        ┆ 60                │
│ 45.0 ┆ 0        ┆ 38                │
│ 50.0 ┆ 0        ┆ 45                │
└──────┴──────────┴───────────────────┘

46.4.2 By column index

dat$select(pl$nth(c(7, 9)))

shape: (299, 2)
┌──────────────────┬─────┐
│ serum_creatinine ┆ sex │
│ ---              ┆ --- │
│ f64              ┆ cat │
╞══════════════════╪═════╡
│ 1.9              ┆ 1   │
│ 1.1              ┆ 1   │
│ 1.3              ┆ 1   │
│ 1.9              ┆ 1   │
│ 2.7              ┆ 0   │
│ …                ┆ …   │
│ 1.1              ┆ 1   │
│ 1.2              ┆ 0   │
│ 0.8              ┆ 0   │
│ 1.4              ┆ 1   │
│ 1.6              ┆ 1   │
└──────────────────┴─────┘

dat$select(pl$nth(c(7:9)))

shape: (299, 3)
┌──────────────────┬──────────────┬─────┐
│ serum_creatinine ┆ serum_sodium ┆ sex │
│ ---              ┆ ---          ┆ --- │
│ f64              ┆ i64          ┆ cat │
╞══════════════════╪══════════════╪═════╡
│ 1.9              ┆ 130          ┆ 1   │
│ 1.1              ┆ 136          ┆ 1   │
│ 1.3              ┆ 129          ┆ 1   │
│ 1.9              ┆ 137          ┆ 1   │
│ 2.7              ┆ 116          ┆ 0   │
│ …                ┆ …            ┆ …   │
│ 1.1              ┆ 143          ┆ 1   │
│ 1.2              ┆ 139          ┆ 0   │
│ 0.8              ┆ 138          ┆ 0   │
│ 1.4              ┆ 140          ┆ 1   │
│ 1.6              ┆ 136          ┆ 1   │
└──────────────────┴──────────────┴─────┘

46.4.3 By name pattern matching

dat$select(cs$starts_with("serum_"))

shape: (299, 2)
┌──────────────────┬──────────────┐
│ serum_creatinine ┆ serum_sodium │
│ ---              ┆ ---          │
│ f64              ┆ i64          │
╞══════════════════╪══════════════╡
│ 1.9              ┆ 130          │
│ 1.1              ┆ 136          │
│ 1.3              ┆ 129          │
│ 1.9              ┆ 137          │
│ 2.7              ┆ 116          │
│ …                ┆ …            │
│ 1.1              ┆ 143          │
│ 1.2              ┆ 139          │
│ 0.8              ┆ 138          │
│ 1.4              ┆ 140          │
│ 1.6              ┆ 136          │
└──────────────────┴──────────────┘

dat$select(cs$ends_with("_fraction"))

shape: (299, 1)
┌───────────────────┐
│ ejection_fraction │
│ ---               │
│ i64               │
╞═══════════════════╡
│ 20                │
│ 38                │
│ 20                │
│ 20                │
│ 20                │
│ …                 │
│ 38                │
│ 38                │
│ 60                │
│ 38                │
│ 45                │
└───────────────────┘

46.4.4 By data type

For example, to select all numeric columns:

dat$select(cs$numeric())

shape: (299, 7)
┌──────┬───────────────────┬──────────────────┬───────────┬──────────────────┬──────────────┬──────┐
│ age  ┆ creatinine_phosph ┆ ejection_fractio ┆ platelets ┆ serum_creatinine ┆ serum_sodium ┆ time │
│ ---  ┆ okinase           ┆ n                ┆ ---       ┆ ---              ┆ ---          ┆ ---  │
│ f64  ┆ ---               ┆ ---              ┆ f64       ┆ f64              ┆ i64          ┆ i64  │
│      ┆ i64               ┆ i64              ┆           ┆                  ┆              ┆      │
╞══════╪═══════════════════╪══════════════════╪═══════════╪══════════════════╪══════════════╪══════╡
│ 75.0 ┆ 582               ┆ 20               ┆ 265000.0  ┆ 1.9              ┆ 130          ┆ 4    │
│ 55.0 ┆ 7861              ┆ 38               ┆ 263358.03 ┆ 1.1              ┆ 136          ┆ 6    │
│ 65.0 ┆ 146               ┆ 20               ┆ 162000.0  ┆ 1.3              ┆ 129          ┆ 7    │
│ 50.0 ┆ 111               ┆ 20               ┆ 210000.0  ┆ 1.9              ┆ 137          ┆ 7    │
│ 65.0 ┆ 160               ┆ 20               ┆ 327000.0  ┆ 2.7              ┆ 116          ┆ 8    │
│ …    ┆ …                 ┆ …                ┆ …         ┆ …                ┆ …            ┆ …    │
│ 62.0 ┆ 61                ┆ 38               ┆ 155000.0  ┆ 1.1              ┆ 143          ┆ 270  │
│ 55.0 ┆ 1820              ┆ 38               ┆ 270000.0  ┆ 1.2              ┆ 139          ┆ 271  │
│ 45.0 ┆ 2060              ┆ 60               ┆ 742000.0  ┆ 0.8              ┆ 138          ┆ 278  │
│ 45.0 ┆ 2413              ┆ 38               ┆ 140000.0  ┆ 1.4              ┆ 140          ┆ 280  │
│ 50.0 ┆ 196               ┆ 45               ┆ 395000.0  ┆ 1.6              ┆ 136          ┆ 285  │
└──────┴───────────────────┴──────────────────┴───────────┴──────────────────┴──────────────┴──────┘

46.5 Count

Count the number of smokers over age 60 by sex:

Here, we filter based on two conditions, group_by sex, and then aggregate using len() to count the number of rows in each group.

dat$
  filter(pl$col("smoking") == "1", pl$col("age") > 60)$
  group_by("sex")$
  agg(
    N = pl$len()
  )

shape: (2, 2)
┌─────┬─────┐
│ sex ┆ N   │
│ --- ┆ --- │
│ cat ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 43  │
│ 0   ┆ 2   │
└─────┴─────┘

Another way would be to use value_counts() after filtering, but this does not return the count as a separate column:

dat$filter(
  (pl$col("smoking") == "1") & 
  (pl$col("age") > 60)
)$select(pl$col("sex")$value_counts())

shape: (2, 1)
┌───────────┐
│ sex       │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {"1",43}  │
│ {"0",2}   │
└───────────┘

46.6 Summarize

To get the mean age, we can use the mean() function on the age column:

dat$select(
  pl$col("age")$mean()
)

shape: (1, 1)
┌───────────┐
│ age       │
│ ---       │
│ f64       │
╞═══════════╡
│ 60.833893 │
└───────────┘

It is good practice to give the resulting column a meaningful name using alias():

dat$select(
  pl$col("age")$mean()$alias("Mean_Age")
)

shape: (1, 1)
┌───────────┐
│ Mean_Age  │
│ ---       │
│ f64       │
╞═══════════╡
│ 60.833893 │
└───────────┘

dat$select(
  pl$col("age")$mean()$alias("Mean_Age"),
  pl$col("ejection_fraction")$mean()$alias("Mean_Ejection_Fraction")
)

shape: (1, 2)
┌───────────┬────────────────────────┐
│ Mean_Age  ┆ Mean_Ejection_Fraction │
│ ---       ┆ ---                    │
│ f64       ┆ f64                    │
╞═══════════╪════════════════════════╡
│ 60.833893 ┆ 38.083612              │
└───────────┴────────────────────────┘

46.6.1 Grouped summarize

dat$group_by("sex")$agg(
  pl$col("age")$mean()$alias("Mean_Age")
)

shape: (2, 2)
┌─────┬───────────┐
│ sex ┆ Mean_Age  │
│ --- ┆ ---       │
│ cat ┆ f64       │
╞═════╪═══════════╡
│ 1   ┆ 61.4055   │
│ 0   ┆ 59.777781 │
└─────┴───────────┘

You can name arguments in group_by() to give the grouping variable a different name in the output:

dat$group_by(Sex = "sex")$agg(
  pl$col("age")$mean()$alias("Mean_Age")
)

shape: (2, 2)
┌─────┬───────────┐
│ Sex ┆ Mean_Age  │
│ --- ┆ ---       │
│ cat ┆ f64       │
╞═════╪═══════════╡
│ 1   ┆ 61.4055   │
│ 0   ┆ 59.777781 │
└─────┴───────────┘

Group by multiple variables:

dat$group_by(Smoking = "smoking", Anaemia = "anaemia")$agg(
  pl$col("age")$mean()$alias("Mean_Age"),
  pl$col("serum_sodium")$mean()$alias("Mean_Serum_Sodium")
)

shape: (4, 4)
┌─────────┬─────────┬───────────┬───────────────────┐
│ Smoking ┆ Anaemia ┆ Mean_Age  ┆ Mean_Serum_Sodium │
│ ---     ┆ ---     ┆ ---       ┆ ---               │
│ cat     ┆ cat     ┆ f64       ┆ f64               │
╞═════════╪═════════╪═══════════╪═══════════════════╡
│ 1       ┆ 0       ┆ 60.354839 ┆ 136.467742        │
│ 0       ┆ 0       ┆ 59.675926 ┆ 136.462963        │
│ 1       ┆ 1       ┆ 62.617647 ┆ 137.0             │
│ 0       ┆ 1       ┆ 61.824568 ┆ 136.778947        │
└─────────┴─────────┴───────────┴───────────────────┘

46.7 Sort

Select age, sex, and serum_sodium columns and sort by age in ascending order:

dat$select(
  pl$col("age", "sex", "serum_sodium")
)$sort("age")

shape: (299, 3)
┌──────┬─────┬──────────────┐
│ age  ┆ sex ┆ serum_sodium │
│ ---  ┆ --- ┆ ---          │
│ f64  ┆ cat ┆ i64          │
╞══════╪═════╪══════════════╡
│ 40.0 ┆ 1   ┆ 136          │
│ 40.0 ┆ 0   ┆ 140          │
│ 40.0 ┆ 0   ┆ 141          │
│ 40.0 ┆ 1   ┆ 137          │
│ 40.0 ┆ 1   ┆ 136          │
│ …    ┆ …   ┆ …            │
│ 90.0 ┆ 1   ┆ 134          │
│ 90.0 ┆ 0   ┆ 144          │
│ 94.0 ┆ 1   ┆ 134          │
│ 95.0 ┆ 0   ┆ 138          │
│ 95.0 ┆ 1   ┆ 132          │
└──────┴─────┴──────────────┘

46.8 Slice

top_k() and bottom_k() methods allow you to retrieve the top or bottom k rows based on a specified column.

Return the row with the highest ejection_fraction value:

dat$top_k(1, by = "ejection_fraction")

shape: (1, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat     ┆ i64                      ┆ cat      ┆   ┆ cat ┆ cat     ┆ i64  ┆ cat         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 45.0 ┆ 0       ┆ 582                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 63   ┆ 0           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

Return 10 rows with the lowest age values:

dat$bottom_k(10, by = "age")

shape: (10, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat     ┆ i64                      ┆ cat      ┆   ┆ cat ┆ cat     ┆ i64  ┆ cat         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 40.0 ┆ 0       ┆ 478                      ┆ 1        ┆ … ┆ 1   ┆ 0       ┆ 148  ┆ 0           │
│ 40.0 ┆ 0       ┆ 582                      ┆ 1        ┆ … ┆ 1   ┆ 0       ┆ 244  ┆ 0           │
│ 40.0 ┆ 1       ┆ 101                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 187  ┆ 0           │
│ 40.0 ┆ 1       ┆ 129                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 209  ┆ 0           │
│ 40.0 ┆ 0       ┆ 90                       ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 212  ┆ 0           │
│ 40.0 ┆ 0       ┆ 624                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 214  ┆ 0           │
│ 40.0 ┆ 0       ┆ 244                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 174  ┆ 0           │
│ 41.0 ┆ 0       ┆ 148                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 68   ┆ 0           │
│ 42.0 ┆ 0       ┆ 64                       ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 250  ┆ 0           │
│ 42.0 ┆ 1       ┆ 250                      ┆ 1        ┆ … ┆ 0   ┆ 0       ┆ 65   ┆ 1           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

By contrast, if you want to return all rows with the minimum age, you would use filter():

dat$filter(pl$col("age") == pl$col("age")$min())

shape: (7, 13)
┌──────┬─────────┬──────────────────────────┬──────────┬───┬─────┬─────────┬──────┬─────────────┐
│ age  ┆ anaemia ┆ creatinine_phosphokinase ┆ diabetes ┆ … ┆ sex ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ ---     ┆ ---                      ┆ ---      ┆   ┆ --- ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat     ┆ i64                      ┆ cat      ┆   ┆ cat ┆ cat     ┆ i64  ┆ cat         │
╞══════╪═════════╪══════════════════════════╪══════════╪═══╪═════╪═════════╪══════╪═════════════╡
│ 40.0 ┆ 0       ┆ 478                      ┆ 1        ┆ … ┆ 1   ┆ 0       ┆ 148  ┆ 0           │
│ 40.0 ┆ 0       ┆ 244                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 174  ┆ 0           │
│ 40.0 ┆ 1       ┆ 101                      ┆ 0        ┆ … ┆ 0   ┆ 0       ┆ 187  ┆ 0           │
│ 40.0 ┆ 1       ┆ 129                      ┆ 0        ┆ … ┆ 1   ┆ 0       ┆ 209  ┆ 0           │
│ 40.0 ┆ 0       ┆ 90                       ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 212  ┆ 0           │
│ 40.0 ┆ 0       ┆ 624                      ┆ 0        ┆ … ┆ 1   ┆ 1       ┆ 214  ┆ 0           │
│ 40.0 ┆ 0       ┆ 582                      ┆ 1        ┆ … ┆ 1   ┆ 0       ┆ 244  ┆ 0           │
└──────┴─────────┴──────────────────────────┴──────────┴───┴─────┴─────────┴──────┴─────────────┘

46.9 Relocate

To change order of columns, you can use the select() method.
For example, move the sex column after the age column:

dat$select("age", "sex", pl$all()$exclude("age", "sex"))

shape: (299, 13)
┌──────┬─────┬─────────┬─────────────────────────┬───┬──────────────┬─────────┬──────┬─────────────┐
│ age  ┆ sex ┆ anaemia ┆ creatinine_phosphokinas ┆ … ┆ serum_sodium ┆ smoking ┆ time ┆ DEATH_EVENT │
│ ---  ┆ --- ┆ ---     ┆ e                       ┆   ┆ ---          ┆ ---     ┆ ---  ┆ ---         │
│ f64  ┆ cat ┆ cat     ┆ ---                     ┆   ┆ i64          ┆ cat     ┆ i64  ┆ cat         │
│      ┆     ┆         ┆ i64                     ┆   ┆              ┆         ┆      ┆             │
╞══════╪═════╪═════════╪═════════════════════════╪═══╪══════════════╪═════════╪══════╪═════════════╡
│ 75.0 ┆ 1   ┆ 0       ┆ 582                     ┆ … ┆ 130          ┆ 0       ┆ 4    ┆ 1           │
│ 55.0 ┆ 1   ┆ 0       ┆ 7861                    ┆ … ┆ 136          ┆ 0       ┆ 6    ┆ 1           │
│ 65.0 ┆ 1   ┆ 0       ┆ 146                     ┆ … ┆ 129          ┆ 1       ┆ 7    ┆ 1           │
│ 50.0 ┆ 1   ┆ 1       ┆ 111                     ┆ … ┆ 137          ┆ 0       ┆ 7    ┆ 1           │
│ 65.0 ┆ 0   ┆ 1       ┆ 160                     ┆ … ┆ 116          ┆ 0       ┆ 8    ┆ 1           │
│ …    ┆ …   ┆ …       ┆ …                       ┆ … ┆ …            ┆ …       ┆ …    ┆ …           │
│ 62.0 ┆ 1   ┆ 0       ┆ 61                      ┆ … ┆ 143          ┆ 1       ┆ 270  ┆ 0           │
│ 55.0 ┆ 0   ┆ 0       ┆ 1820                    ┆ … ┆ 139          ┆ 0       ┆ 271  ┆ 0           │
│ 45.0 ┆ 0   ┆ 0       ┆ 2060                    ┆ … ┆ 138          ┆ 0       ┆ 278  ┆ 0           │
│ 45.0 ┆ 1   ┆ 0       ┆ 2413                    ┆ … ┆ 140          ┆ 1       ┆ 280  ┆ 0           │
│ 50.0 ┆ 1   ┆ 0       ┆ 196                     ┆ … ┆ 136          ┆ 1       ┆ 285  ┆ 0           │
└──────┴─────┴─────────┴─────────────────────────┴───┴──────────────┴─────────┴──────┴─────────────┘

46.10 Append new columns

To create a new column, use the with_columns() method. For example, to create a new column with age in days:

dat <- dat$with_columns(
  (pl$col("age") * 365)$alias("Age_days")
)
dat["Age_days"]

shape: (299, 1)
┌──────────┐
│ Age_days │
│ ---      │
│ f64      │
╞══════════╡
│ 27375.0  │
│ 20075.0  │
│ 23725.0  │
│ 18250.0  │
│ 23725.0  │
│ …        │
│ 22630.0  │
│ 20075.0  │
│ 16425.0  │
│ 16425.0  │
│ 18250.0  │
└──────────┘

46.10.1 By group

To create a new column using a grouped operation, use the over() method. For example, to create a column with demeaned serum_sodium by sex:

dat <- dat$with_columns(
  demeaned_sodium_bysex = (pl$col("serum_sodium") -
    pl$col("serum_sodium")$mean()$over("sex"))
)

46.11 Rename columns

Use the rename() method to rename columns. For example, to rename the DEATH_EVENT column to Mortality:

dat <- dat$rename(
  DEATH_EVENT = "Mortality"
)

Note that the syntax is old_name = new_name. This is the opposite of dplyr’s rename() function.

46.12 Resources

R Polars Documentation

46.13 See also

data.table
dplyr