library(data.table)
options(datatable.print.class = TRUE)36 Reshaping
36.1 Long to wide using dcast()
36.1.1 Example 1: key-value pairs
Using the same example seen in Chapter 26:
dat_long <- data.frame(
Account_ID = c(8001, 8002, 8003, 8004, 8001, 8002, 8003, 8004,
8001, 8002, 8003, 8004, 8001, 8002, 8003, 8004),
Age = c(67.8017038366664, 42.9198507293701, 46.2301756642422,
39.665983196671, 67.8017038366664, 42.9198507293701,
46.2301756642422, 39.665983196671, 67.8017038366664,
42.9198507293701, 46.2301756642422, 39.665983196671,
67.8017038366664, 42.9198507293701, 46.2301756642422,
39.665983196671),
Admission = c("ED", "Planned", "Planned", "ED", "ED", "Planned",
"Planned", "ED", "ED", "Planned", "Planned", "ED", "ED", "Planned",
"Planned", "ED"),
Lab_key = c("RBC", "RBC", "RBC", "RBC", "WBC", "WBC", "WBC", "WBC",
"Hematocrit", "Hematocrit", "Hematocrit", "Hematocrit",
"Hemoglobin", "Hemoglobin", "Hemoglobin", "Hemoglobin"),
Lab_value = c(4.63449321082268, 3.34968550627897, 4.27037213597765,
4.93897736897793, 8374.22887757195, 7612.37380499927,
8759.27855519425, 6972.28096216548, 36.272693147236,
40.5716317809522, 39.9888624177955, 39.8786884058422,
12.6188444991545, 12.1739747363806, 15.1293426442183,
14.8885696185238)
)
dat_long_dt <- as.data.table(dat_long)
dat_long_dt Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <char> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8002 42.91985 Planned RBC 3.349686
3: 8003 46.23018 Planned RBC 4.270372
4: 8004 39.66598 ED RBC 4.938977
5: 8001 67.80170 ED WBC 8374.228878
6: 8002 42.91985 Planned WBC 7612.373805
7: 8003 46.23018 Planned WBC 8759.278555
8: 8004 39.66598 ED WBC 6972.280962
9: 8001 67.80170 ED Hematocrit 36.272693
10: 8002 42.91985 Planned Hematocrit 40.571632
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8004 39.66598 ED Hematocrit 39.878688
13: 8001 67.80170 ED Hemoglobin 12.618844
14: 8002 42.91985 Planned Hemoglobin 12.173975
15: 8003 46.23018 Planned Hemoglobin 15.129343
16: 8004 39.66598 ED Hemoglobin 14.888570
data.table’s long to wide procedure is defined with a convenient formula notation:
dat_long2wide_dt <- dcast(dat_long_dt,
Account_ID + Age + Admission ~ Lab_key,
value.var = "Lab_value")
dat_long2wide_dtKey: <Account_ID, Age, Admission>
Account_ID Age Admission Hematocrit Hemoglobin RBC WBC
<num> <num> <char> <num> <num> <num> <num>
1: 8001 67.80170 ED 36.27269 12.61884 4.634493 8374.229
2: 8002 42.91985 Planned 40.57163 12.17397 3.349686 7612.374
3: 8003 46.23018 Planned 39.98886 15.12934 4.270372 8759.279
4: 8004 39.66598 ED 39.87869 14.88857 4.938977 6972.281
Instead of listing all variables you can use ..., which corresponds to all variables not otherwise mentioned in the formula or in value.var:
dcast(dat_long_dt,
... ~ Lab_key,
value.var = "Lab_value")Key: <Account_ID, Age, Admission>
Account_ID Age Admission Hematocrit Hemoglobin RBC WBC
<num> <num> <char> <num> <num> <num> <num>
1: 8001 67.80170 ED 36.27269 12.61884 4.634493 8374.229
2: 8002 42.91985 Planned 40.57163 12.17397 3.349686 7612.374
3: 8003 46.23018 Planned 39.98886 15.12934 4.270372 8759.279
4: 8004 39.66598 ED 39.87869 14.88857 4.938977 6972.281
36.1.2 Example 2: Timepoints
Another simple synthetic dataset:
ID Timepoint Score
<int> <fctr> <int>
1: 1 Timepoint_A 11
2: 2 Timepoint_A 12
3: 3 Timepoint_A 13
4: 4 Timepoint_A 14
5: 1 Timepoint_B 21
6: 2 Timepoint_B 22
7: 3 Timepoint_B 23
8: 4 Timepoint_B 24
9: 1 Timepoint_C 51
10: 2 Timepoint_C 52
11: 3 Timepoint_C 53
12: 4 Timepoint_C 54
Using dcast() again to convert to wide format:
dt_long ID Timepoint Score
<int> <fctr> <int>
1: 1 Timepoint_A 11
2: 2 Timepoint_A 12
3: 3 Timepoint_A 13
4: 4 Timepoint_A 14
5: 1 Timepoint_B 21
6: 2 Timepoint_B 22
7: 3 Timepoint_B 23
8: 4 Timepoint_B 24
9: 1 Timepoint_C 51
10: 2 Timepoint_C 52
11: 3 Timepoint_C 53
12: 4 Timepoint_C 54
dcast(dt_long, ID ~ Timepoint,
value.var = "Score")Key: <ID>
ID Timepoint_A Timepoint_B Timepoint_C
<int> <int> <int> <int>
1: 1 11 21 51
2: 2 12 22 52
3: 3 13 23 53
4: 4 14 24 54
36.1.3 dcast() + aggregate
If your ID ~ Timepoint combination does not define a unique row in your input dataset, you need to specify an aggregate function.
For example, suppose you have four subjects with IDs “A”, “B”, “C”, “D” who had a couple variables measured 3 times in the AM and 3 times in the PM.
ID Timepoint Var1 Var2
<char> <char> <num> <num>
1: A AM 9.181079 20.17476
2: A AM NA 22.59615
3: A AM 9.781764 19.57141
4: A PM 10.039389 21.41549
5: A PM 9.782060 19.18831
6: A PM 9.813699 19.42839
7: B AM 9.277730 NA
8: B AM NA 19.76434
9: B AM 9.167315 21.06945
10: B PM 9.456382 20.02977
11: B PM 11.946675 21.60579
12: B PM 10.672535 19.76325
13: C AM NA 22.48464
14: C AM 8.887012 18.18991
15: C AM 10.803414 NA
16: C PM 10.257296 20.39847
17: C PM 8.609107 21.58825
18: C PM 9.556412 18.78391
19: D AM NA 19.01404
20: D AM 9.121972 19.58881
21: D AM 10.018011 20.06586
22: D PM 9.628977 NA
23: D PM 10.387528 NA
24: D PM 9.006255 20.79668
ID Timepoint Var1 Var2
To convert the above data.table to wide format and get mean AM and PM values using the fun.aggregate argument:
Key: <ID>
ID Var1_AM Var1_PM Var2_AM Var2_PM
<char> <num> <num> <num> <num>
1: A 9.481421 9.878383 20.78077 20.01073
2: B 9.222522 10.691864 20.41690 20.46627
3: C 9.845213 9.474272 20.33727 20.25688
4: D 9.569991 9.674253 19.55624 20.79668
You can apply multiple aggregating functions by passing a list to fun.aggregate:
Key: <ID>
ID Var1_mean_AM Var1_mean_PM Var2_mean_AM Var2_mean_PM Var1_max_AM
<char> <num> <num> <num> <num> <num>
1: A 9.481421 9.878383 20.78077 20.01073 9.781764
2: B 9.222522 10.691864 20.41690 20.46627 9.277730
3: C 9.845213 9.474272 20.33727 20.25688 10.803414
4: D 9.569991 9.674253 19.55624 20.79668 10.018011
Var1_max_PM Var2_max_AM Var2_max_PM Var1_min_AM Var1_min_PM Var2_min_AM
<num> <num> <num> <num> <num> <num>
1: 10.03939 22.59615 21.41549 9.181079 9.782060 19.57141
2: 11.94668 21.06945 21.60579 9.167315 9.456382 19.76434
3: 10.25730 22.48464 21.58825 8.887012 8.609107 18.18991
4: 10.38753 20.06586 20.79668 9.121972 9.006255 19.01404
Var2_min_PM
<num>
1: 19.18831
2: 19.76325
3: 18.78391
4: 20.79668
Note how na.rm = TRUE was successfully applied to all aggregating functions
36.2 Wide to long: melt()
36.2.1 Example 1: Timepoints
dt_wide <- data.table(
ID = 1:4,
Timepoint_A = 11:14,
Timepoint_B = 21:24,
Timepoint_C = 51:54)
dt_wide ID Timepoint_A Timepoint_B Timepoint_C
<int> <int> <int> <int>
1: 1 11 21 51
2: 2 12 22 52
3: 3 13 23 53
4: 4 14 24 54
Use melt() to convert from wide to long format:
ID Timepoint Score
<int> <fctr> <int>
1: 1 Timepoint_A 11
2: 2 Timepoint_A 12
3: 3 Timepoint_A 13
4: 4 Timepoint_A 14
5: 1 Timepoint_B 21
6: 2 Timepoint_B 22
7: 3 Timepoint_B 23
8: 4 Timepoint_B 24
9: 1 Timepoint_C 51
10: 2 Timepoint_C 52
11: 3 Timepoint_C 53
12: 4 Timepoint_C 54
36.2.2 Example 2: Key-value pairs
Using the same synthetic data as in Chapter 26:
set.seed(2022)
dt_wide <- data.table(
Account_ID = c(8001, 8002, 8003, 8004),
Age = rnorm(4, mean = 57, sd = 12),
RBC = rnorm(4, mean = 4.8, sd = 0.5),
WBC = rnorm(4, mean = 7250, sd = 1500),
Hematocrit = rnorm(4, mean = 40.2, sd = 4),
Hemoglobin = rnorm(4, mean = 13.6, sd = 1.5),
Admission = sample(c("ED", "Planned"), size = 4, replace = TRUE)
)
dt_wide Account_ID Age RBC WBC Hematocrit Hemoglobin Admission
<num> <num> <num> <num> <num> <num> <char>
1: 8001 67.80170 4.634493 8374.229 36.27269 12.61884 ED
2: 8002 42.91985 3.349686 7612.374 40.57163 12.17397 Planned
3: 8003 46.23018 4.270372 8759.279 39.98886 15.12934 Planned
4: 8004 39.66598 4.938977 6972.281 39.87869 14.88857 Planned
Convert from wide to long format using melt() again:
Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <fctr> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8002 42.91985 Planned RBC 3.349686
3: 8003 46.23018 Planned RBC 4.270372
4: 8004 39.66598 Planned RBC 4.938977
5: 8001 67.80170 ED WBC 8374.228878
6: 8002 42.91985 Planned WBC 7612.373805
7: 8003 46.23018 Planned WBC 8759.278555
8: 8004 39.66598 Planned WBC 6972.280962
9: 8001 67.80170 ED Hematocrit 36.272693
10: 8002 42.91985 Planned Hematocrit 40.571632
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8004 39.66598 Planned Hematocrit 39.878688
13: 8001 67.80170 ED Hemoglobin 12.618844
14: 8002 42.91985 Planned Hemoglobin 12.173975
15: 8003 46.23018 Planned Hemoglobin 15.129343
16: 8004 39.66598 Planned Hemoglobin 14.888570
or using column names:
Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <fctr> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8002 42.91985 Planned RBC 3.349686
3: 8003 46.23018 Planned RBC 4.270372
4: 8004 39.66598 Planned RBC 4.938977
5: 8001 67.80170 ED WBC 8374.228878
6: 8002 42.91985 Planned WBC 7612.373805
7: 8003 46.23018 Planned WBC 8759.278555
8: 8004 39.66598 Planned WBC 6972.280962
9: 8001 67.80170 ED Hematocrit 36.272693
10: 8002 42.91985 Planned Hematocrit 40.571632
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8004 39.66598 Planned Hematocrit 39.878688
13: 8001 67.80170 ED Hemoglobin 12.618844
14: 8002 42.91985 Planned Hemoglobin 12.173975
15: 8003 46.23018 Planned Hemoglobin 15.129343
16: 8004 39.66598 Planned Hemoglobin 14.888570
If desired, you can set the ID column as the key, which will sort the data.table by its values:
setorder(dt_long, "Account_ID")
dt_long Account_ID Age Admission Lab_key Lab_value
<num> <num> <char> <fctr> <num>
1: 8001 67.80170 ED RBC 4.634493
2: 8001 67.80170 ED WBC 8374.228878
3: 8001 67.80170 ED Hematocrit 36.272693
4: 8001 67.80170 ED Hemoglobin 12.618844
5: 8002 42.91985 Planned RBC 3.349686
6: 8002 42.91985 Planned WBC 7612.373805
7: 8002 42.91985 Planned Hematocrit 40.571632
8: 8002 42.91985 Planned Hemoglobin 12.173975
9: 8003 46.23018 Planned RBC 4.270372
10: 8003 46.23018 Planned WBC 8759.278555
11: 8003 46.23018 Planned Hematocrit 39.988862
12: 8003 46.23018 Planned Hemoglobin 15.129343
13: 8004 39.66598 Planned RBC 4.938977
14: 8004 39.66598 Planned WBC 6972.280962
15: 8004 39.66598 Planned Hematocrit 39.878688
16: 8004 39.66598 Planned Hemoglobin 14.888570