51  Introduction to ggplot2

ggplot2, created by Hadley Wickham (Wickham 2011), follows the Grammar of Graphics approach of Leland Wilkinson (Wilkinson 2012). It therefore features a very different syntax from base R graphics functions seen in Chapter 49. It is based on the grid graphics package, which, perhaps confusingly, is included in base R.

The general idea is to start by defining the data and then add and/or modify graphical elements in a stepwise manner, which allows one to build complex and layered visualizations. A simplified interface to ggplot graphics is provided in the qplot() function of ggplot2. This chapter focuses on the basics of the ggplot() function, which is more flexible and powerful.

Do not try to combine base R graphics and ggplot2 graphics in the same plot. They are based on different graphics engines and are not compatible.

51.1 Setup

51.1.1 Packages

Load ggplot2

library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)
# You can alternatively load all tidyverse packages with:
# library(tidyverse)

51.1.2 Synthetic Data

library(data.table)
set.seed(2022)
dt <- data.table(
    PID = sample(8001:9000, size = 100),
    Age = rnorm(100, mean = 33, sd = 8),
    Weight = rnorm(100, mean = 70, sd = 9),
    SysBP = rnorm(100, mean = 110, sd = 6),
    DiaBP = rnorm(100, mean = 80, sd = 6),
    Sex = factor(sample(c("Female", "Male"), size = 100, replace = TRUE))
)
dt[, SysBP := SysBP + 0.5 * Age]
dt[Sex == "Male", Weight := Weight + rnorm(.N, mean = 16, sd = 1.5)]
dt[Sex == "Male", Age := Age + rnorm(.N, mean = 6, sd = 1.8)]
dt <- as_tibble(dt)

Define a color palette, palette_, and a version of the same palette at 2/3 transparency, palette_a, for use in plots:

palette_ <- c("#43A4AC", "#FA9860")
palette_a <- adjustcolor(palette_, 0.666)

Confusingly, ggplot2 uses the aes() function, short for aesthetics, to define the plot data.

51.2 Grammar of Graphics

The general syntax of a ggplot command is as follows:

ggplot(data, aes(x, y)) + geom_*() + theme_*() + scale_*()

Where:

  • data is the data frame or tibble containing the data to be plotted.
  • aes(x, y) defines which variables are mapped to which axes, etc. as well as other mappings, like color, shape, size, etc.
  • geom_*() specifies the type of geometric object to be used for the plot, such as points, lines, bars, etc.
  • theme_*() allows customization of the plot’s appearance, such as background, grid lines, text size, etc.
  • scale_*() allows customization of the scales used for the axes and other aesthetics, such as color scales, axis limits, etc.

51.3 Box plot

ggplot requires a categorical x-axis to draw boxplots. This means we need to convert our dataset from wide to long format.

dt_long <- dt |> pivot_longer(
    cols = c("SysBP", "DiaBP"),
    names_to = "Measure",
    values_to = "Value"
)
dt_long
# A tibble: 200 × 6
     PID   Age Weight Sex    Measure Value
   <int> <dbl>  <dbl> <fct>  <chr>   <dbl>
 1  8228  28.5   82.2 Male   SysBP   122. 
 2  8228  28.5   82.2 Male   DiaBP    79.1
 3  8435  50.1   72.6 Female SysBP   136. 
 4  8435  50.1   72.6 Female DiaBP    84.3
 5  8718  31.0   73.0 Female SysBP   124. 
 6  8718  31.0   73.0 Female DiaBP    78.1
 7  8823  30.0   77.5 Male   SysBP   126. 
 8  8823  30.0   77.5 Male   DiaBP    80.9
 9  8843  40.5   86.7 Male   SysBP   133. 
10  8843  40.5   86.7 Male   DiaBP    68.6
# ℹ 190 more rows
p <- ggplot(dt_long, aes(Measure, Value)) + 
    geom_boxplot()
p

We can specify color and fill to change the color of the boxplot border and fill, respectively.

p <- ggplot(dt_long, aes(Measure, Value)) + 
    geom_boxplot(color = palette_[1:2], fill = palette_a[1:2])
p

51.3.1 Grouped boxplot

p <- ggplot(dt, aes(x = Sex, y = Age)) + 
    geom_boxplot(colour = palette_[2:1], fill = palette_a[2:1])
p

51.4 Histogram

p <- ggplot(dt, aes(Age)) + 
    geom_histogram()
p

We can specify the number of bins to use with the bins argument and the border and fill colors with color and fill, respectively, as above. xlab() can be used to define the x-axis label.

p <- ggplot(dt, aes(Age)) + 
    geom_histogram(bins = 24, color = palette_[1], fill = palette_a[1]) +
    xlab("Age (years)")
p

51.4.1 Grouped Histogram

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "identity")
p

scale_fill_manual can be used to define the colors of the bars:

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "identity") +
    scale_fill_manual(values = palette_a[2:1])
p

position = "identity" displays overlapping bars. Alternatively, position = "dodge" can be used to display groups’ bars side by side instead. Finally, position = "stack", is the (unfortunate) default and results in vertically stacked bars, which can be useful for showing totals but otherwise confusing.

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "dodge") +
    scale_fill_manual(values = palette_a[2:1])
p

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_histogram(bins = 24, position = "stack") +
    scale_fill_manual(values = palette_a[2:1])
p

51.5 Density plot

p <- ggplot(dt, aes(x = Age)) + 
    geom_density(color = palette_[1], fill = palette_a[1])
p

51.5.1 Grouped density plot

p <- ggplot(dt, aes(x = Age, fill = Sex)) + 
    geom_density(color = NA) +
    scale_fill_manual(values = palette_a[2:1])
p

51.6 Barplot

schools <- data.frame(UCSF = 4, Stanford = 7, Penn = 12)

ggplot2 requires an explicit column in the data that define the categorical x-axis:

schools_df <- data.frame(
    University = factor(colnames(schools), 
        levels = c("UCSF", "Stanford", "Penn")),
     N_schools = as.numeric(schools[1, ])
)
ggplot(schools_df, aes(University, N_schools)) +
  geom_bar(
    stat = "identity",
    color = palette_a[1],
    fill = palette_a[1]
  )

51.7 Scatterplot

p <- ggplot(dt, aes(Age, SysBP)) +
    geom_point(color = palette_a[1])
p

51.7.1 Grouped Scatterplot

p <- ggplot(dt, aes(Age, SysBP, col = Sex)) +
    geom_point() +
    scale_color_manual(values = palette_[2:1])
p

51.8 Themes

ggplot2 includes several built-in themes that can be applied to plots to change their overall appearance. The default theme is theme_grey(), but other popular themes include theme_minimal(), theme_bw(), and theme_classic().

p <- ggplot(dt, aes(Age, SysBP)) +
  geom_point() +
  theme_minimal() # or theme_bw(), theme_classic(), etc.

51.9 Faceting

Faceting allows you to create multiple plots based on the values of one or more categorical variables. This is useful for comparing distributions or relationships across different groups.

p <- ggplot(dt, aes(Age, SysBP)) +
  geom_point() +
  facet_wrap(~Sex)
p

51.10 Adding labels

p <- ggplot(dt, aes(Age, SysBP)) +
    geom_point() +
    labs(
        title = "Systolic BP vs Age",
        x = "Age (years)",
        y = "Systolic BP (mmHg)"
    )
p

51.11 Save plot to file

We’ll use the grouped boxplot example from above to show how to save each type of plot to file, using a PDF output as an example.

p <- ggplot(dt, aes(x = Sex, y = Age)) + 
    geom_boxplot(colour = palette_[2:1], fill = palette_a[2:1])
ggsave("Age_by_Sex_ggplot.pdf", p,
    width = 5.5, height = 5.5, scale = 1, units = "in")

51.12 See also

51.13 Resources

© 2025 E.D. Gennatas