48  Joins

Scenario: You have received two tables with clinical data. Each table contains a column with a unique identifier (ID) plus a number of variables which are unique to each table. You want to merge them into one big table so that for each ID you have all available variables. You want to make sure that the same ID number (e.g. 108) corresponds to the same case in both datasets, but not all IDs needs to be present in both datasets.

Using the same synthetic data as in Chapter 27, we start by creating two polars DataFrames:

a <- pl$DataFrame(
  PID = 101:109,
  Hospital = c(
    "UCSF", "HUP", "Stanford",
    "Stanford", "UCSF", "HUP",
    "HUP", "Stanford", "UCSF"
  ),
  Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
  Sex = as_polars_series(factor(c(1, 1, 0, 1, 0, 0, 1, 0, 0)))
)
a
shape: (9, 4)
┌─────┬──────────┬──────┬─────┐
│ PID ┆ Hospital ┆ Age  ┆ Sex │
│ --- ┆ ---      ┆ ---  ┆ --- │
│ i32 ┆ str      ┆ f64  ┆ cat │
╞═════╪══════════╪══════╪═════╡
│ 101 ┆ UCSF     ┆ 22.0 ┆ 1   │
│ 102 ┆ HUP      ┆ 34.0 ┆ 1   │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0   │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1   │
│ 105 ┆ UCSF     ┆ 53.0 ┆ 0   │
│ 106 ┆ HUP      ┆ 21.0 ┆ 0   │
│ 107 ┆ HUP      ┆ 63.0 ┆ 1   │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0   │
│ 109 ┆ UCSF     ┆ 19.0 ┆ 0   │
└─────┴──────────┴──────┴─────┘
b <- pl$DataFrame(
  PID = 106:112,
  V1 = c(153, 89, 112, 228,  91, 190, 101),
  Department = c("Neurology", "Radiology",
                "Emergency", "Cardiology",
                "Surgery", "Neurology", "Psychiatry")
)
b
shape: (7, 3)
┌─────┬───────┬────────────┐
│ PID ┆ V1    ┆ Department │
│ --- ┆ ---   ┆ ---        │
│ i32 ┆ f64   ┆ str        │
╞═════╪═══════╪════════════╡
│ 106 ┆ 153.0 ┆ Neurology  │
│ 107 ┆ 89.0  ┆ Radiology  │
│ 108 ┆ 112.0 ┆ Emergency  │
│ 109 ┆ 228.0 ┆ Cardiology │
│ 110 ┆ 91.0  ┆ Surgery    │
│ 111 ┆ 190.0 ┆ Neurology  │
│ 112 ┆ 101.0 ┆ Psychiatry │
└─────┴───────┴────────────┘

48.1 Inner join

using the join() method of a polars DataFrame:

ab_inner <- a$join(b, on = "PID", how = "inner")
ab_inner
shape: (4, 6)
┌─────┬──────────┬──────┬─────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age  ┆ Sex ┆ V1    ┆ Department │
│ --- ┆ ---      ┆ ---  ┆ --- ┆ ---   ┆ ---        │
│ i32 ┆ str      ┆ f64  ┆ cat ┆ f64   ┆ str        │
╞═════╪══════════╪══════╪═════╪═══════╪════════════╡
│ 106 ┆ HUP      ┆ 21.0 ┆ 0   ┆ 153.0 ┆ Neurology  │
│ 107 ┆ HUP      ┆ 63.0 ┆ 1   ┆ 89.0  ┆ Radiology  │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0   ┆ 112.0 ┆ Emergency  │
│ 109 ┆ UCSF     ┆ 19.0 ┆ 0   ┆ 228.0 ┆ Cardiology │
└─────┴──────────┴──────┴─────┴───────┴────────────┘

Note that base R merge() also works on polars DataFrames, but returns a data.frame:

merge(a, b, by = "PID")
  PID Hospital Age Sex  V1 Department
1 106      HUP  21   0 153  Neurology
2 107      HUP  63   1  89  Radiology
3 108 Stanford  22   0 112  Emergency
4 109     UCSF  19   0 228 Cardiology

48.2 Outer join

An outer join can be performed by specifying how = "full" and coalesce = TRUE in the join() method:

ab_outer <- a$join(b, on = "PID", how = "full", coalesce = TRUE)
ab_outer
shape: (12, 6)
┌─────┬──────────┬──────┬──────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age  ┆ Sex  ┆ V1    ┆ Department │
│ --- ┆ ---      ┆ ---  ┆ ---  ┆ ---   ┆ ---        │
│ i32 ┆ str      ┆ f64  ┆ cat  ┆ f64   ┆ str        │
╞═════╪══════════╪══════╪══════╪═══════╪════════════╡
│ 101 ┆ UCSF     ┆ 22.0 ┆ 1    ┆ null  ┆ null       │
│ 102 ┆ HUP      ┆ 34.0 ┆ 1    ┆ null  ┆ null       │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0    ┆ null  ┆ null       │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1    ┆ null  ┆ null       │
│ 105 ┆ UCSF     ┆ 53.0 ┆ 0    ┆ null  ┆ null       │
│ …   ┆ …        ┆ …    ┆ …    ┆ …     ┆ …          │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0    ┆ 112.0 ┆ Emergency  │
│ 109 ┆ UCSF     ┆ 19.0 ┆ 0    ┆ 228.0 ┆ Cardiology │
│ 110 ┆ null     ┆ null ┆ null ┆ 91.0  ┆ Surgery    │
│ 111 ┆ null     ┆ null ┆ null ┆ 190.0 ┆ Neurology  │
│ 112 ┆ null     ┆ null ┆ null ┆ 101.0 ┆ Psychiatry │
└─────┴──────────┴──────┴──────┴───────┴────────────┘

Without coalesce = TRUE, a PID_right column would be created:

ab_outer_no_coalesce <- a$join(b, on = "PID", how = "full", coalesce = FALSE)
ab_outer_no_coalesce
shape: (12, 7)
┌──────┬──────────┬──────┬──────┬───────────┬───────┬────────────┐
│ PID  ┆ Hospital ┆ Age  ┆ Sex  ┆ PID_right ┆ V1    ┆ Department │
│ ---  ┆ ---      ┆ ---  ┆ ---  ┆ ---       ┆ ---   ┆ ---        │
│ i32  ┆ str      ┆ f64  ┆ cat  ┆ i32       ┆ f64   ┆ str        │
╞══════╪══════════╪══════╪══════╪═══════════╪═══════╪════════════╡
│ 101  ┆ UCSF     ┆ 22.0 ┆ 1    ┆ null      ┆ null  ┆ null       │
│ 102  ┆ HUP      ┆ 34.0 ┆ 1    ┆ null      ┆ null  ┆ null       │
│ 103  ┆ Stanford ┆ 41.0 ┆ 0    ┆ null      ┆ null  ┆ null       │
│ 104  ┆ Stanford ┆ 19.0 ┆ 1    ┆ null      ┆ null  ┆ null       │
│ 105  ┆ UCSF     ┆ 53.0 ┆ 0    ┆ null      ┆ null  ┆ null       │
│ …    ┆ …        ┆ …    ┆ …    ┆ …         ┆ …     ┆ …          │
│ 108  ┆ Stanford ┆ 22.0 ┆ 0    ┆ 108       ┆ 112.0 ┆ Emergency  │
│ 109  ┆ UCSF     ┆ 19.0 ┆ 0    ┆ 109       ┆ 228.0 ┆ Cardiology │
│ null ┆ null     ┆ null ┆ null ┆ 110       ┆ 91.0  ┆ Surgery    │
│ null ┆ null     ┆ null ┆ null ┆ 111       ┆ 190.0 ┆ Neurology  │
│ null ┆ null     ┆ null ┆ null ┆ 112       ┆ 101.0 ┆ Psychiatry │
└──────┴──────────┴──────┴──────┴───────────┴───────┴────────────┘

Note that the resulting DataFrame contains all cases present in either input datasets and missing values are filled in where data is not available using null.

48.3 Left outer join

You can perform a left outer join by specifying how = "left" in the join() method:

ab_left <- a$join(b, on = "PID", how = "left")
ab_left
shape: (9, 6)
┌─────┬──────────┬──────┬─────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age  ┆ Sex ┆ V1    ┆ Department │
│ --- ┆ ---      ┆ ---  ┆ --- ┆ ---   ┆ ---        │
│ i32 ┆ str      ┆ f64  ┆ cat ┆ f64   ┆ str        │
╞═════╪══════════╪══════╪═════╪═══════╪════════════╡
│ 101 ┆ UCSF     ┆ 22.0 ┆ 1   ┆ null  ┆ null       │
│ 102 ┆ HUP      ┆ 34.0 ┆ 1   ┆ null  ┆ null       │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0   ┆ null  ┆ null       │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1   ┆ null  ┆ null       │
│ 105 ┆ UCSF     ┆ 53.0 ┆ 0   ┆ null  ┆ null       │
│ 106 ┆ HUP      ┆ 21.0 ┆ 0   ┆ 153.0 ┆ Neurology  │
│ 107 ┆ HUP      ┆ 63.0 ┆ 1   ┆ 89.0  ┆ Radiology  │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0   ┆ 112.0 ┆ Emergency  │
│ 109 ┆ UCSF     ┆ 19.0 ┆ 0   ┆ 228.0 ┆ Cardiology │
└─────┴──────────┴──────┴─────┴───────┴────────────┘

Note that the resulting DataFrame contains all cases present in the left input dataset (i.e. the one defined first in the arguments) only.

48.4 Right outer join

You can perform a right outer join by specifying how = "right" in the join() method:

ab_right <- a$join(b, on = "PID", how = "right")
ab_right
shape: (7, 6)
┌──────────┬──────┬──────┬─────┬───────┬────────────┐
│ Hospital ┆ Age  ┆ Sex  ┆ PID ┆ V1    ┆ Department │
│ ---      ┆ ---  ┆ ---  ┆ --- ┆ ---   ┆ ---        │
│ str      ┆ f64  ┆ cat  ┆ i32 ┆ f64   ┆ str        │
╞══════════╪══════╪══════╪═════╪═══════╪════════════╡
│ HUP      ┆ 21.0 ┆ 0    ┆ 106 ┆ 153.0 ┆ Neurology  │
│ HUP      ┆ 63.0 ┆ 1    ┆ 107 ┆ 89.0  ┆ Radiology  │
│ Stanford ┆ 22.0 ┆ 0    ┆ 108 ┆ 112.0 ┆ Emergency  │
│ UCSF     ┆ 19.0 ┆ 0    ┆ 109 ┆ 228.0 ┆ Cardiology │
│ null     ┆ null ┆ null ┆ 110 ┆ 91.0  ┆ Surgery    │
│ null     ┆ null ┆ null ┆ 111 ┆ 190.0 ┆ Neurology  │
│ null     ┆ null ┆ null ┆ 112 ┆ 101.0 ┆ Psychiatry │
└──────────┴──────┴──────┴─────┴───────┴────────────┘

48.5 Specifying columns

If the ID columns in the two input DataFrames have different names, you can specify them using the left_on and right_on arguments:

a <- pl$DataFrame(
  PID = 101:109,
  Hospital = c(
    "UCSF", "HUP", "Stanford",
    "Stanford", "UCSF", "HUP",
    "HUP", "Stanford", "UCSF"
  ),
  Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
  Sex = factor(c(1, 1, 0, 1, 0, 0, 1, 0, 0))
)
a
shape: (9, 4)
┌─────┬──────────┬──────┬─────┐
│ PID ┆ Hospital ┆ Age  ┆ Sex │
│ --- ┆ ---      ┆ ---  ┆ --- │
│ i32 ┆ str      ┆ f64  ┆ cat │
╞═════╪══════════╪══════╪═════╡
│ 101 ┆ UCSF     ┆ 22.0 ┆ 1   │
│ 102 ┆ HUP      ┆ 34.0 ┆ 1   │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0   │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1   │
│ 105 ┆ UCSF     ┆ 53.0 ┆ 0   │
│ 106 ┆ HUP      ┆ 21.0 ┆ 0   │
│ 107 ┆ HUP      ┆ 63.0 ┆ 1   │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0   │
│ 109 ┆ UCSF     ┆ 19.0 ┆ 0   │
└─────┴──────────┴──────┴─────┘
b <- pl$DataFrame(
  PatientID = 106:112,
  V1 = c(153, 89, 112, 228, 91, 190, 101),
  Department = c(
    "Neurology", "Radiology",
    "Emergency", "Cardiology",
    "Surgery", "Neurology", "Psychiatry"
  )
)
b
shape: (7, 3)
┌───────────┬───────┬────────────┐
│ PatientID ┆ V1    ┆ Department │
│ ---       ┆ ---   ┆ ---        │
│ i32       ┆ f64   ┆ str        │
╞═══════════╪═══════╪════════════╡
│ 106       ┆ 153.0 ┆ Neurology  │
│ 107       ┆ 89.0  ┆ Radiology  │
│ 108       ┆ 112.0 ┆ Emergency  │
│ 109       ┆ 228.0 ┆ Cardiology │
│ 110       ┆ 91.0  ┆ Surgery    │
│ 111       ┆ 190.0 ┆ Neurology  │
│ 112       ┆ 101.0 ┆ Psychiatry │
└───────────┴───────┴────────────┘
ab_inner <- a$join(b, left_on = "PID", right_on = "PatientID", how = "inner")
ab_inner
shape: (4, 6)
┌─────┬──────────┬──────┬─────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age  ┆ Sex ┆ V1    ┆ Department │
│ --- ┆ ---      ┆ ---  ┆ --- ┆ ---   ┆ ---        │
│ i32 ┆ str      ┆ f64  ┆ cat ┆ f64   ┆ str        │
╞═════╪══════════╪══════╪═════╪═══════╪════════════╡
│ 106 ┆ HUP      ┆ 21.0 ┆ 0   ┆ 153.0 ┆ Neurology  │
│ 107 ┆ HUP      ┆ 63.0 ┆ 1   ┆ 89.0  ┆ Radiology  │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0   ┆ 112.0 ┆ Emergency  │
│ 109 ┆ UCSF     ┆ 19.0 ┆ 0   ┆ 228.0 ┆ Cardiology │
└─────┴──────────┴──────┴─────┴───────┴────────────┘

48.6 See also

© 2025 E.D. Gennatas