48 Joins
Scenario: You have received two tables with clinical data. Each table contains a column with a unique identifier (ID) plus a number of variables which are unique to each table. You want to merge them into one big table so that for each ID you have all available variables. You want to make sure that the same ID number (e.g. 108) corresponds to the same case in both datasets, but not all IDs needs to be present in both datasets.
Using the same synthetic data as in Chapter 27, we start by creating two polars DataFrames:
a <- pl$DataFrame(
PID = 101:109,
Hospital = c(
"UCSF", "HUP", "Stanford",
"Stanford", "UCSF", "HUP",
"HUP", "Stanford", "UCSF"
),
Age = c(22, 34, 41, 19, 53, 21, 63, 22, 19),
Sex = as_polars_series(factor(c(1, 1, 0, 1, 0, 0, 1, 0, 0)))
)
ashape: (9, 4)
┌─────┬──────────┬──────┬─────┐
│ PID ┆ Hospital ┆ Age ┆ Sex │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat │
╞═════╪══════════╪══════╪═════╡
│ 101 ┆ UCSF ┆ 22.0 ┆ 1 │
│ 102 ┆ HUP ┆ 34.0 ┆ 1 │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0 │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1 │
│ 105 ┆ UCSF ┆ 53.0 ┆ 0 │
│ 106 ┆ HUP ┆ 21.0 ┆ 0 │
│ 107 ┆ HUP ┆ 63.0 ┆ 1 │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 │
└─────┴──────────┴──────┴─────┘
shape: (7, 3)
┌─────┬───────┬────────────┐
│ PID ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- │
│ i32 ┆ f64 ┆ str │
╞═════╪═══════╪════════════╡
│ 106 ┆ 153.0 ┆ Neurology │
│ 107 ┆ 89.0 ┆ Radiology │
│ 108 ┆ 112.0 ┆ Emergency │
│ 109 ┆ 228.0 ┆ Cardiology │
│ 110 ┆ 91.0 ┆ Surgery │
│ 111 ┆ 190.0 ┆ Neurology │
│ 112 ┆ 101.0 ┆ Psychiatry │
└─────┴───────┴────────────┘
48.1 Inner join
using the join() method of a polars DataFrame:
ab_inner <- a$join(b, on = "PID", how = "inner")
ab_innershape: (4, 6)
┌─────┬──────────┬──────┬─────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age ┆ Sex ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat ┆ f64 ┆ str │
╞═════╪══════════╪══════╪═════╪═══════╪════════════╡
│ 106 ┆ HUP ┆ 21.0 ┆ 0 ┆ 153.0 ┆ Neurology │
│ 107 ┆ HUP ┆ 63.0 ┆ 1 ┆ 89.0 ┆ Radiology │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 ┆ 112.0 ┆ Emergency │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 ┆ 228.0 ┆ Cardiology │
└─────┴──────────┴──────┴─────┴───────┴────────────┘
Note that base R merge() also works on polars DataFrames, but returns a data.frame:
merge(a, b, by = "PID") PID Hospital Age Sex V1 Department
1 106 HUP 21 0 153 Neurology
2 107 HUP 63 1 89 Radiology
3 108 Stanford 22 0 112 Emergency
4 109 UCSF 19 0 228 Cardiology
48.2 Outer join
An outer join can be performed by specifying how = "full" and coalesce = TRUE in the join() method:
ab_outer <- a$join(b, on = "PID", how = "full", coalesce = TRUE)
ab_outershape: (12, 6)
┌─────┬──────────┬──────┬──────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age ┆ Sex ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat ┆ f64 ┆ str │
╞═════╪══════════╪══════╪══════╪═══════╪════════════╡
│ 101 ┆ UCSF ┆ 22.0 ┆ 1 ┆ null ┆ null │
│ 102 ┆ HUP ┆ 34.0 ┆ 1 ┆ null ┆ null │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0 ┆ null ┆ null │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1 ┆ null ┆ null │
│ 105 ┆ UCSF ┆ 53.0 ┆ 0 ┆ null ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 ┆ 112.0 ┆ Emergency │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 ┆ 228.0 ┆ Cardiology │
│ 110 ┆ null ┆ null ┆ null ┆ 91.0 ┆ Surgery │
│ 111 ┆ null ┆ null ┆ null ┆ 190.0 ┆ Neurology │
│ 112 ┆ null ┆ null ┆ null ┆ 101.0 ┆ Psychiatry │
└─────┴──────────┴──────┴──────┴───────┴────────────┘
Without coalesce = TRUE, a PID_right column would be created:
ab_outer_no_coalesce <- a$join(b, on = "PID", how = "full", coalesce = FALSE)
ab_outer_no_coalesceshape: (12, 7)
┌──────┬──────────┬──────┬──────┬───────────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age ┆ Sex ┆ PID_right ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat ┆ i32 ┆ f64 ┆ str │
╞══════╪══════════╪══════╪══════╪═══════════╪═══════╪════════════╡
│ 101 ┆ UCSF ┆ 22.0 ┆ 1 ┆ null ┆ null ┆ null │
│ 102 ┆ HUP ┆ 34.0 ┆ 1 ┆ null ┆ null ┆ null │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0 ┆ null ┆ null ┆ null │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1 ┆ null ┆ null ┆ null │
│ 105 ┆ UCSF ┆ 53.0 ┆ 0 ┆ null ┆ null ┆ null │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 ┆ 108 ┆ 112.0 ┆ Emergency │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 ┆ 109 ┆ 228.0 ┆ Cardiology │
│ null ┆ null ┆ null ┆ null ┆ 110 ┆ 91.0 ┆ Surgery │
│ null ┆ null ┆ null ┆ null ┆ 111 ┆ 190.0 ┆ Neurology │
│ null ┆ null ┆ null ┆ null ┆ 112 ┆ 101.0 ┆ Psychiatry │
└──────┴──────────┴──────┴──────┴───────────┴───────┴────────────┘
Note that the resulting DataFrame contains all cases present in either input datasets and missing values are filled in where data is not available using null.
48.3 Left outer join
You can perform a left outer join by specifying how = "left" in the join() method:
ab_left <- a$join(b, on = "PID", how = "left")
ab_leftshape: (9, 6)
┌─────┬──────────┬──────┬─────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age ┆ Sex ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat ┆ f64 ┆ str │
╞═════╪══════════╪══════╪═════╪═══════╪════════════╡
│ 101 ┆ UCSF ┆ 22.0 ┆ 1 ┆ null ┆ null │
│ 102 ┆ HUP ┆ 34.0 ┆ 1 ┆ null ┆ null │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0 ┆ null ┆ null │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1 ┆ null ┆ null │
│ 105 ┆ UCSF ┆ 53.0 ┆ 0 ┆ null ┆ null │
│ 106 ┆ HUP ┆ 21.0 ┆ 0 ┆ 153.0 ┆ Neurology │
│ 107 ┆ HUP ┆ 63.0 ┆ 1 ┆ 89.0 ┆ Radiology │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 ┆ 112.0 ┆ Emergency │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 ┆ 228.0 ┆ Cardiology │
└─────┴──────────┴──────┴─────┴───────┴────────────┘
Note that the resulting DataFrame contains all cases present in the left input dataset (i.e. the one defined first in the arguments) only.
48.4 Right outer join
You can perform a right outer join by specifying how = "right" in the join() method:
ab_right <- a$join(b, on = "PID", how = "right")
ab_rightshape: (7, 6)
┌──────────┬──────┬──────┬─────┬───────┬────────────┐
│ Hospital ┆ Age ┆ Sex ┆ PID ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ cat ┆ i32 ┆ f64 ┆ str │
╞══════════╪══════╪══════╪═════╪═══════╪════════════╡
│ HUP ┆ 21.0 ┆ 0 ┆ 106 ┆ 153.0 ┆ Neurology │
│ HUP ┆ 63.0 ┆ 1 ┆ 107 ┆ 89.0 ┆ Radiology │
│ Stanford ┆ 22.0 ┆ 0 ┆ 108 ┆ 112.0 ┆ Emergency │
│ UCSF ┆ 19.0 ┆ 0 ┆ 109 ┆ 228.0 ┆ Cardiology │
│ null ┆ null ┆ null ┆ 110 ┆ 91.0 ┆ Surgery │
│ null ┆ null ┆ null ┆ 111 ┆ 190.0 ┆ Neurology │
│ null ┆ null ┆ null ┆ 112 ┆ 101.0 ┆ Psychiatry │
└──────────┴──────┴──────┴─────┴───────┴────────────┘
48.5 Specifying columns
If the ID columns in the two input DataFrames have different names, you can specify them using the left_on and right_on arguments:
shape: (9, 4)
┌─────┬──────────┬──────┬─────┐
│ PID ┆ Hospital ┆ Age ┆ Sex │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat │
╞═════╪══════════╪══════╪═════╡
│ 101 ┆ UCSF ┆ 22.0 ┆ 1 │
│ 102 ┆ HUP ┆ 34.0 ┆ 1 │
│ 103 ┆ Stanford ┆ 41.0 ┆ 0 │
│ 104 ┆ Stanford ┆ 19.0 ┆ 1 │
│ 105 ┆ UCSF ┆ 53.0 ┆ 0 │
│ 106 ┆ HUP ┆ 21.0 ┆ 0 │
│ 107 ┆ HUP ┆ 63.0 ┆ 1 │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 │
└─────┴──────────┴──────┴─────┘
shape: (7, 3)
┌───────────┬───────┬────────────┐
│ PatientID ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- │
│ i32 ┆ f64 ┆ str │
╞═══════════╪═══════╪════════════╡
│ 106 ┆ 153.0 ┆ Neurology │
│ 107 ┆ 89.0 ┆ Radiology │
│ 108 ┆ 112.0 ┆ Emergency │
│ 109 ┆ 228.0 ┆ Cardiology │
│ 110 ┆ 91.0 ┆ Surgery │
│ 111 ┆ 190.0 ┆ Neurology │
│ 112 ┆ 101.0 ┆ Psychiatry │
└───────────┴───────┴────────────┘
ab_inner <- a$join(b, left_on = "PID", right_on = "PatientID", how = "inner")
ab_innershape: (4, 6)
┌─────┬──────────┬──────┬─────┬───────┬────────────┐
│ PID ┆ Hospital ┆ Age ┆ Sex ┆ V1 ┆ Department │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ f64 ┆ cat ┆ f64 ┆ str │
╞═════╪══════════╪══════╪═════╪═══════╪════════════╡
│ 106 ┆ HUP ┆ 21.0 ┆ 0 ┆ 153.0 ┆ Neurology │
│ 107 ┆ HUP ┆ 63.0 ┆ 1 ┆ 89.0 ┆ Radiology │
│ 108 ┆ Stanford ┆ 22.0 ┆ 0 ┆ 112.0 ┆ Emergency │
│ 109 ┆ UCSF ┆ 19.0 ┆ 0 ┆ 228.0 ┆ Cardiology │
└─────┴──────────┴──────┴─────┴───────┴────────────┘