R includes a number of commands to apply functions on splits of your data. aggregate()
is a powerful tools to perform such “group-by” operations. The function accepts either:
a formula as the first argument and a data.frame passed to the data
argument
an R objects (vector, data.frame, list) as the first argument and one or more factors passed to the by
argument
We shall see how to perform each operation below with each approach. The formula interface might be easier to work with interactively on the console. Note that while you can programmatically create a formula, it is easier to use vector inputs when calling aggregate()
programmatically. For this example, we shall use the penguin data from the palmerpenguins package:
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Single variable by single grouping
Note that the formula method defaults to na.action = na.omit
Using the formula interface:
aggregate ( bill_length_mm ~ species ,
data = penguins ,
mean , na.rm = TRUE )
species bill_length_mm
1 Adelie 38.79139
2 Chinstrap 48.83382
3 Gentoo 47.50488
Using R objects directly:
aggregate ( penguins $ bill_length_mm ,
by = list ( penguins $ species ) ,
mean , na.rm = TRUE )
Group.1 x
1 Adelie 38.79139
2 Chinstrap 48.83382
3 Gentoo 47.50488
Note that, unlike the formula notation, if your input is a vector which is unnamed, the output columns are also unnamed.
If instead of passing a vector, you pass a data.frame or list with one or more named elements, the output includes the names:
aggregate ( penguins [ "bill_length_mm" ] ,
by = penguins [ "species" ] ,
mean , na.rm = TRUE )
species bill_length_mm
1 Adelie 38.79139
2 Chinstrap 48.83382
3 Gentoo 47.50488
By creating a list instead of indexing the given data.frame also allows you to set custom names:
aggregate ( list ( `Bill length` = penguins $ bill_length_mm ) ,
by = list ( Species = penguins $ species ) ,
mean , na.rm = TRUE )
Species Bill.length
1 Adelie 38.79139
2 Chinstrap 48.83382
3 Gentoo 47.50488
Multiple variables by single grouping
Formula notation:
aggregate ( cbind ( bill_length_mm , flipper_length_mm ) ~ species ,
data = penguins ,
mean )
species bill_length_mm flipper_length_mm
1 Adelie 38.79139 189.9536
2 Chinstrap 48.83382 195.8235
3 Gentoo 47.50488 217.1870
Objects:
aggregate ( penguins [ , c ( "bill_length_mm" , "flipper_length_mm" ) ] ,
by = list ( Species = penguins $ species ) ,
mean , na.rm = TRUE )
Species bill_length_mm flipper_length_mm
1 Adelie 38.79139 189.9536
2 Chinstrap 48.83382 195.8235
3 Gentoo 47.50488 217.1870
Single variable by multiple groups
Formula notation:
aggregate ( bill_length_mm ~ species + island , data = penguins , mean )
species island bill_length_mm
1 Adelie Biscoe 38.97500
2 Gentoo Biscoe 47.50488
3 Adelie Dream 38.50179
4 Chinstrap Dream 48.83382
5 Adelie Torgersen 38.95098
Objects:
aggregate ( penguins [ "bill_length_mm" ] ,
by = list ( Species = penguins $ species ,
Island = penguins $ island ) ,
mean , na.rm = TRUE )
Species Island bill_length_mm
1 Adelie Biscoe 38.97500
2 Gentoo Biscoe 47.50488
3 Adelie Dream 38.50179
4 Chinstrap Dream 48.83382
5 Adelie Torgersen 38.95098
Multiple variables by multiple groupings
Formula notation:
aggregate ( cbind ( bill_length_mm , flipper_length_mm ) ~ species + island ,
data = penguins , mean )
species island bill_length_mm flipper_length_mm
1 Adelie Biscoe 38.97500 188.7955
2 Gentoo Biscoe 47.50488 217.1870
3 Adelie Dream 38.50179 189.7321
4 Chinstrap Dream 48.83382 195.8235
5 Adelie Torgersen 38.95098 191.1961
Objects:
aggregate ( penguins [ , c ( "bill_length_mm" , "flipper_length_mm" ) ] ,
by = list ( Species = penguins $ species ,
Island = penguins $ island ) ,
mean , na.rm = TRUE )
Species Island bill_length_mm flipper_length_mm
1 Adelie Biscoe 38.97500 188.7955
2 Gentoo Biscoe 47.50488 217.1870
3 Adelie Dream 38.50179 189.7321
4 Chinstrap Dream 48.83382 195.8235
5 Adelie Torgersen 38.95098 191.1961
Using with()
R’s with()
allows you to use expression of the form with(data, expression)
. data
can be a data.frame, list, or environment, and within the expression you can refer to any elements of data
directly by their name.
For example, with(df, expression)
means you can use the data.frame’s column names directly within the expression without the need to use df[["column_name"]]
or df$column_name
:
with ( penguins ,
aggregate ( list ( `Bill length` = bill_length_mm ) ,
by = list ( Species = species ) ,
mean , na.rm = TRUE ) )
Species Bill.length
1 Adelie 38.79139
2 Chinstrap 48.83382
3 Gentoo 47.50488
See also
tapply()
for an alternative methods of applying function on subsets of a single variable (probably faster).
For large datasets, it is recommended to use data.table for fast group-by data summarization.