Distributions

with {ggplot2} and {plotnine}

Python

How do SWAC schools compare to other schools for cost? A question like this may best be answered by considering the shape of data, not just by finding one number. Histograms, boxplots, and density curves show us whether scores cluster, stretch, or skew. They allow us to see whether a number is typical, on the high or low end, or far outside the norm. From chapter 7 through chapter 9, Wilke’s Fundamentals of Data Visualization discusses important considerations for showing these distributions and making clear their message. Here, we’ll go further and implement them in code.

These code explanations will be prepared using the scorecard and schools data sets from {scorecard-db}, drawn from data originally collated by the U.S. Department of Education, along with the universities data set from {swac}.

Prepare datasets in R and Python

R
Python

library(tidyverse)
library(swac)

if (!file.exists("scorecard.rds")) {
  download.file("https://github.com/gadenbuie/scorecard-db/raw/refs/heads/main/data/tidy/scorecard.rds", "scorecard.rds")
}

if (!file.exists("school.rds")) {
  download.file("https://github.com/gadenbuie/scorecard-db/raw/refs/heads/main/data/tidy/school.rds", "school.rds")
}

scorecard <- readRDS("scorecard.rds")
school <- readRDS("school.rds") |> 
  mutate(control = case_match(
    control,
    "Public" ~ "Public",
    "Nonprofit" ~ "Private (nonprofit)",
    "For-profit" ~ "Private (for profit)"
  ))

higher_ed <- school |> 
  left_join(scorecard)

higher_ed_18 <- higher_ed |> 
  filter(academic_year == "2018-19")

write_csv(higher_ed, "data/higher_ed.csv")
write_csv(universities, "data/universities.csv")

from plotnine import (
  ggplot, aes, labs,
  geom_histogram, geom_density, geom_boxplot
) 
import plotnine as p9
import pandas as pd

universities = pd.read_csv("data/universities.csv")
higher_ed = pd.read_csv("data/higher_ed.csv")
higher_ed_18 = higher_ed[higher_ed["academic_year"] == "2018-19"]

Histograms

A simple histogram made with geom_histogram() shows the most common price points students face in a particular academic year. By default, {ggplot2} will show data with 30 bins; for this data {plotnine} shows 91 bins:

R
Python

cost_hist <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = cost_avg)) +
  geom_histogram()

cost_hist

cost_hist = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "cost_avg"))
  + geom_histogram()
)

cost_hist.show()

This histogram shows a blocky curve rising sharply after zero on the X-axis and falling sharply just before around 50,000 on the X-axis. This X-axis shows the average cost endured by a university student, and the Y-axis shows the number of schools in the data set that hit that average cost. Each bar shows “binned” data, which means it pools together similar numbers. For example, if values ranged from zero to a hundred, a histogram with 20 bars would group together values 0 to 4, 5 to 9, and so on.

A histogram’s highest point lets us see the most common bin of values—in this case, the most common cost_avg looks to be around $15,000—and the width of the curve shows how likely numbers will cluster around a common value. Sometimes a histogram has a really wide curve, suggesting a lot of variance in the data; in this case, our curve looks tall and skinny, which tells us that universities tend to cluster around that common average cost.

Note

Barely perceptible in this histogram—but shown by the X-axis—is the long tail to the right. Beyond the $60,000 mark, we get just a few tiny blips above $105,000. Between those blips and the larger curve are empty spaces. Within that empty space, the number of schools in each bin is too small to make a mark on the vertical scale. It’s probably zero, but we’d be wise to check the data to be sure.

Setting bin number or width

Defaults can be changed by manually setting the number of bins or their width. Choosing something suited to the data guarantees as particular output and helps us better understand what it might show.

The bins argument sets the number of bars to divide up things. Because every bin will be the same size and they’ll stretch for the full range of data, some bins might have zero schools in them.

R
Python

cost_hist <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = cost_avg)) +
  geom_histogram(
    bins = 50
  )

cost_hist

cost_hist = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "cost_avg"))
  + geom_histogram(
    bins = 50
  )
)

cost_hist.show()

Instead of setting the number of bins, we can instead set the range of each with binwidth. This is helpful for breaking down numbers into human-friendly ranges. For instance, we might perhaps best understand college costs in bins of $5,000:

R
Python

cost_hist <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = cost_avg)) +
  geom_histogram(
    binwidth = 5000
  )

cost_hist

cost_hist = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "cost_avg"))
  + geom_histogram(
    binwidth = 5000
  )
)

cost_hist.show()

Finishing with polish

As always, we keep three points in mind when preparing visualizations:

Accuracy
Message
Beauty

Our visualization and data don’t get into any funny business, so we’ve probably already hit the mark for accuracy. (But see the note below!) We could nevertheless do more to clarify a message, and we haven’t yet made choices to enhance its beauty.

Adding a title and tweaking labels will improve the message, but that’s not enough to fix problems with values on the X-axis. With a binwidth = 5000, the first bin range should cover 0 to 4999, but it’s shown as centered on 0. We can fix this by changing the center of bins to a value that’s half of our binwidth.

Additionally, while each bar shows a nice, understandable $5,000 chunk of cost, it’s undistinguishable from its neighbors. Setting a color will add borders making each bar’s offset countable from the labels on the X-axis; at the same time, we’ll set fill to a money-themed green, with transparency allowing horizontal grid lines to show through for context.

R
Python

Code defining abbrev_dollar()

abbrev_dollar <- function(n) {
  case_when(
    n >= 10^6 ~ round(n/(10^6), 1) |> paste0("M"),
    n >= 10^3 ~ round(n/(10^3), 1) |> paste0("K"),
    .default = as.character(n)
  ) |> 
    {\(x) paste0("$", x)}()
}

library(scales)

cost_hist <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = cost_avg)) +
  geom_histogram(
    center = 2500,
    binwidth = 5000,
    color = "black",
    fill = "forestgreen", 
    alpha = 0.4) +
  labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "The most common price point was in the range of $15,000 to 20,000.",
    x = "average cost",
    y = "schools") +
  theme_linedraw() +
  scale_y_continuous(
    expand = expansion(mult = c(0,0.01)),
    labels = label_comma()) +
  scale_x_continuous(labels = abbrev_dollar) #+
  # theme(plot.margin = unit(c(0.17, 0.37, 0.17, 0.17), "cm"))

cost_hist

Code defining abbrev_dollar()

def abbrev_dollar(number):
  result = []
  for n in number:
    if n >= 10**6:
      result.append(f"${n / (10**6):.0f}M")
    if n >= 10**3:
      result.append(f"${n / (10**3):.0f}K")
    else:
      result.append(f"${n:.0f}")
  return result

import mizani.labels as ml

cost_hist = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "cost_avg"))
  + geom_histogram(
    center = 2500,
    binwidth = 5000,
    color = "black",
    fill = "forestgreen", 
    alpha = 0.4) 
  + labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "The most common price point was in the range of $15,000 to 20,000.",
    x = "average cost",
    y = "schools")
  + p9.theme_linedraw()
  + p9.scale_y_continuous(
    labels=ml.comma_format(),
    expand = [0,0,0.05,0]
  ) 
  + p9.scale_x_continuous(labels = abbrev_dollar)
)

cost_hist.show()

By explicitly defining the center of each bin, we can now see more clearly that some schools have an average cost below $0. These negative values are probably unexpected, but they’re understandable considering the financial assistance that is sometimes available for students. Most importantly, the values are in the data, so the chart is more accurate when they’re clearly shown.

Strip plots

If we want something more than a histogram, we might first consider direct representation of data with points using geom_point(). This instinct is good, but it has limitations:

R
Python

cost_point <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = reorder(control, cost_avg, na.rm = TRUE),
      y = cost_avg)) +
  geom_point(
    mapping = aes(color = control),
    show.legend = FALSE)

cost_point

import numpy as np

cost_points = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "reorder(control, cost_avg, np.mean)",
      y = "cost_avg"))
  + p9.geom_point(
    mapping = aes(color = "control"),
    show_legend = False)
)

cost_points.show()

The overlap of points makes it difficult to see how many there are. We can get a better representation of the distribution by adding jitter with geom_jitter() and transparency with an alpha value.

R
Python

cost_point <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = reorder(control, cost_avg, na.rm = TRUE),
      y = cost_avg)) +
  geom_jitter(
    mapping = aes(color = control),
    alpha = 0.3,
    show.legend = FALSE)

cost_point

cost_points = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "reorder(control, cost_avg, np.mean)",
      y = "cost_avg"))
  + p9.geom_jitter(
    mapping = aes(color = "control"),
    alpha = 0.3,
    show_legend = False)
)

cost_points.show()

With a strip plot that uses jitter, the distribution of values is conveyed in the density of points.

Sina plots

Points represented with geom_jitter(), will fall randomly, which makes things look a little inconsistent. If we instead stack the points up from the center line and outward, we get a cleaner shape that shows the distribution of values in the width of the curves. This kind of plot is called a sina plot.

R
Python

In R, we use geom_sina() from {ggforce} to make simple sina plots.

library(ggforce)
cost_sina <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = reorder(control, cost_avg, na.rm = TRUE),
      y = cost_avg)) +
  geom_sina(
    mapping = aes(fill = control),
    shape = 21,
    show.legend = FALSE)

cost_sina

1: By definining the shape as something between 21 and 25, we’re choosing points that take a color for the border and fill for the inside. If we instead use color without defining a shape, we’ll struggle to see individual points.

In Python, {plotnine} includes a native geom_sina() for sina plots.

cost_sina = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "reorder(control, cost_avg, np.mean)",
      y = "cost_avg"))
  + p9.geom_sina(
    mapping = aes(fill = "control"),
    show_legend = False)
)

cost_sina.show()

Violin plots

If the points themselves aren’t as important as their general shape, a violin plot made with geom_violin() conveys the same information.

R
Python

cost_violin <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = reorder(control, cost_avg, na.rm = TRUE),
      y = cost_avg)) +
  geom_violin(
    mapping = aes(fill = control),
    show.legend = FALSE) +
  labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "More than three quarters of public schools had costs below $15,000.",
    x = NULL,
    y = "average cost") +
  scale_y_continuous(labels = abbrev_dollar) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank())

cost_violin

cost_violin = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "reorder(control, cost_avg, np.mean)",
      y = "cost_avg"))
  + p9.geom_violin(
    mapping = aes(fill = "control"),
    show_legend = False)
)

cost_violin.show()

Boxplots

Another alternative is a boxplot, which conveys many statistics in a visual form. All of these statistics are calculated automatically with geom_boxplot().

R
Python

cost_box <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = reorder(control, cost_avg, na.rm = TRUE),
      y = cost_avg)) +
  geom_boxplot(
    mapping = aes(fill = control),
    show.legend = FALSE) +
  labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "More than three quarters of public schools had costs below $15,000.",
    x = NULL,
    y = "average cost") +
  scale_y_continuous(labels = abbrev_dollar) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank())

cost_box

cost_box = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "reorder(control, cost_avg, np.mean)",
      y = "cost_avg"))
  + geom_boxplot(
    mapping = aes(
      fill = "control"),
    show_legend = False
    ) 
  + labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "More than three quarters of public schools had costs below $15,000.",
    x = "",
    y = "average cost")
  + p9.scale_y_continuous(labels = abbrev_dollar)
  + p9.theme_minimal()
  + p9.theme(panel_grid_major_x = p9.element_blank())
)

cost_box.show()

Reading a boxplot requires a little background knowledge. While boxplots satisfy many of the same objectives as density curves, they more clearly indicate certain statistics like median, range, interquartile range, and outliers.

These statistics allow us to better understand the boxplots in the figure above, which shows the stark difference in costs between public and private colleges and universities. In each boxplot, the central box indicates the middle 50% of values. For public schools, the box is wholly below $15,000, so more than 75% of those schools cost less than that amount. Meanwhile, for private not-for-profit colleges and universities, the central box is wholly above $15,000, so more than 75% of those schools exceed that threshhold.

Density curves

Boxplots and histograms are helpful for showing counts and other summary statistics, but they may not be the best way to answer our original question—“how do SWAC schools compare to other schools for cost?” Density curves serve the moment when the general shape of data is more important than specific numbers. They’re visualized with geom_density().

R
Python

cost_dens <- 
  higher_ed_18 |> 
  ggplot(mapping = aes(x = cost_avg)) +
  geom_density()

cost_dens

cost_dens = (
  higher_ed_18 >>
  ggplot(mapping = aes(x = "cost_avg"))
  + geom_density() 
)

cost_dens.show()

In its default form, a density plot is a rather unsatisfying squiggle. Values on the Y-axis show a number for “density”—a unit which changes with every chart shape to make the area under the curve equivalent to 1—but even these unintuitive values could at least be shown in decimals. Moreover, when following the principle of proportional ink, the area under the curve really should be filled in. Other decisions we made for the histogram work well for a density curve, too.

Tip

To get something more meaningful on the Y-axis, showing numbers of schools as with the histogram above, try adding stat = "bin" inside geom_density().

R
Python

cost_dens <- 
  higher_ed_18 |> 
  ggplot(
    mapping = aes(
      x = cost_avg)) +
  geom_density(
    color = "forestgreen",
    fill = "forestgreen", 
    alpha = 0.4) +
  labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "The most common price point was in the range of $15,000 to 20,000.",
    x = "average cost") +
  theme_minimal() +
  scale_y_continuous(labels = label_number()) +
  scale_x_continuous(labels = abbrev_dollar)

cost_dens

1: From {scales}, label_number() converts values from scientific notation to decimal notation, which might be more user friendly.

cost_dens = (
  higher_ed_18 >>
  ggplot(
    mapping = aes(
      x = "cost_avg"))
  + geom_density(
    color = "forestgreen",
    fill = "forestgreen", 
    alpha = 0.4) 
  + labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "The most common price point was in the range of $15,000 to 20,000.",
    x = "average cost")
  + p9.theme_minimal()
  + p9.scale_x_continuous(labels = abbrev_dollar)
)

cost_dens.show()

Mapping subgroups

Mapping a column to fill (instead of defining a color like “forestgreen”) shows a breakdown of distributions for subgroups of data. Here, for instance, we might overlay the distribution of costs for HBCUs against other schools in the data. When overlaying density curves like this, it’s helpful to add transparency to the curves with alpha.

R
Python

cost_dens_sub <- 
  higher_ed_18 |> 
  ggplot() +
  geom_density(
    mapping = aes(
      x = cost_avg,
      fill = is_hbcu,
      color = is_hbcu),
    alpha = 0.3)

cost_dens_sub

cost_dens_sub = (
  higher_ed_18 >>
  ggplot()
  + geom_density(
    mapping = aes(
      x = "cost_avg",
      fill = "is_hbcu",
      color = "is_hbcu"),
    alpha = 0.3) 
)

cost_dens_sub.show()

We know from the data that there are fewer HBCUs than there are other schools, but their curve is far taller than the others. This discrepancy makes sense when we consider what the “density” value means, but it can still lead to confusion when multiple curves are overlaid on each other. To simplify things, we might instead go for a scaled value, where each curve is stretched to a common scale of 0 to 1. The easiest way to do this is to map y to ..scaled...

R
Python

cost_dens_sub <- 
  higher_ed_18 |> 
  ggplot() +
  geom_density(
    mapping = aes(
      x = cost_avg,
      y = ..scaled..,
      fill = is_hbcu,
      color = is_hbcu),
    alpha = 0.3)

cost_dens_sub

cost_dens_sub = (
  higher_ed_18 >>
  ggplot()
  + geom_density(
    mapping = aes(
      x = "cost_avg",
      y = "..scaled..",
      fill = "is_hbcu",
      color = "is_hbcu"),
    alpha = 0.3) 
)

cost_dens_sub.show()

This figure can be polished by changing defaults:

R
Python

cost_dens_sub <- 
  cost_dens_sub + 
  labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "HBCU costs fall within a narrower band, but their range is typical of all schools.",
    fill = "designation", color = "designation",
    x = "average cost") +
  theme_minimal() +
  scale_y_continuous(labels = label_number()) +
  scale_x_continuous(labels = abbrev_dollar) +
  scale_fill_manual(
    values = c(
      "TRUE" = "purple",
      "FALSE" = "skyblue"),
    labels = c(
      "TRUE" = "HBCU",
      "FALSE" = "Other")) +
  scale_color_manual(
    values = c(
      "TRUE" = "purple",
      "FALSE" = "skyblue"),
    labels = c(
      "TRUE" = "HBCU",
      "FALSE" = "Other"))

cost_dens_sub

1: Using a named vector inside the labels argument of a color scale or fill scale is the easiest way to change what’s displayed in the legend. Unfortunately, named vectors won’t target NA values, but it’s also possible to use an unnammed vector or a function like this: labels = \(x) if_else(is.na(x), "unknown", x).
2: Make sure to use the same labels for both color and fill, or you’ll end up with two legends.

cost_dens_sub = (
  cost_dens_sub
  + labs(
    title = "Average cost of U.S. higher education, AY 2018-2019",
    subtitle = "HBCU costs fall within a narrower band, but their range is typical of all schools.",
    fill = "designation", color = "designation",
    x = "average cost")
  + p9.theme_minimal()
  + p9.scale_x_continuous(labels = abbrev_dollar)
  + p9.scale_fill_manual(
    values = {
      True: "purple",
      False: "skyblue"},
    labels = {
      True: "HBCU",
      False: "Other",
      np.nan :"unknown"})
  + p9.scale_color_manual(
    values = {
      True: "purple",
      False: "skyblue"},
    labels = {
      True: "HBCU",
      False: "Other",
      np.nan :"unknown"})
)

cost_dens_sub.show()

1: Using a dictionary inside the labels argument of a color scale or fill scale is the easiest way to change what’s displayed in the legend.
2: Make sure to use the same labels for both color and fill, or you’ll end up with two legends.

Combining multiple plots

The best way to answer our original question might be to produce multiple plots. Three plots can show costs of SWAC schools set beside other HBCUs, juxtaposed with public and private schools, and compared to all data nationally.

We can combine multiple subplots using an intuitive interface common to both R (using {ggplot2} with {patchwork}) and Python (using {plotnine}). Here, a slash / indicates vertical spacing,¹ and & is used in place of + to apply changes universally to every subplot.

R
Python

Start by preparing each figure on its own.

For all three charts, we’ll want a cleaner data set from the start. This will make it so that we don’t have to think too hard about redefining values in the legends.

combo_ready <- higher_ed_18 |> 
  mutate(
    designation = if_else(is_hbcu, "HBCU", "not HBCU"),
    conference = if_else(id %in% universities$id, "SWAC", "other"),
    funding = control |> 
      tolower() |> 
      str_remove_all(" [(]nonprofit[)]") |> 
      factor(levels = c("public", "private", "private (for profit)"))
  )

We’ll define one universal title on the first figure and individual subtitles for each subfigure.

cost_dens_all <- 
  combo_ready |> 
  filter(!is.na(is_hbcu)) |> 
  ggplot(mapping = aes(
    x = cost_avg,
    y = ..scaled..)) +
  geom_density(
    mapping = aes(
      fill = designation,
      color = designation),
    alpha = 0.5) +
  geom_density(
    mapping = aes(linetype = "all schools"),
    color = "red",
    key_glyph = "path",
    show.legend = c(
      linetype = TRUE,
      color = FALSE,
      fill = FALSE)) +
  labs(
    title = "Average cost of US higher education, AY 2018-2019",
    subtitle = "HBCU costs fall within a narrower band, but the range is typical of all schools.") +
  scale_fill_manual(values = c(
    "HBCU" = "purple",
    "not HBCU" = "skyblue")) +
  scale_color_manual(values = c(
    "HBCU" = "purple",
    "not HBCU" = "skyblue"))

cost_dens_all

1: Here we’re adding a second density curve, showing the overall shape of all of the data. This makes it easier to see how HBCU designation changes shape from the overall dataset. (In this case, the overall data is very close to the data for schools that are not known to be HBCUs.) By not setting a fill aesthetic, we’re avoiding the difficulty that comes from interpreting more than two overlaid colors.
2: Although usually we map an aesthetic to a particular column, it’s also possible to set a value here manually, thereby forcing something to appear in the legend.
3: Setting key_glyph here changes the shape used in the legend.
4: We’ll typically set show.legend to TRUE or FALSE, but it can accept a named vector for each of the aesthetics shown in the legend.

cost_dens_hbcu <- 
  combo_ready |> 
  filter(is_hbcu) |> 
  ggplot(
    mapping = aes(
      x = cost_avg,
      y = ..scaled..)) +
  geom_density(
    mapping = aes(
      fill = funding,
      color = funding),
    alpha = 0.4) +
  geom_density(
    mapping = aes(linetype = "all HBCUs"),
    color = "purple",
    key_glyph = "path",
    show.legend = c(linetype = TRUE, color = FALSE, fill = FALSE)) +
  labs(
    subtitle = "Private HBCUs cost noticeably more than public HBCUs.") +
  scale_fill_manual(values = c(
    "public" = "deeppink",
    "private" = "darkorange")) +
  scale_color_manual(values = c(
    "public" = "deeppink",
    "private" = "darkorange"))

cost_dens_hbcu

cost_dens_swac <- 
  combo_ready |> 
  filter(is_hbcu) |> 
  ggplot(
    mapping = aes(
      x = cost_avg,
      y = ..scaled..)) +
  geom_density(
    mapping = aes(
      color = conference,
      fill = conference),
    alpha = 0.4) +
  geom_density(
    mapping = aes(linetype = "all HBCUs"),
    color = "purple",
    key_glyph = "path",
    show.legend = c(linetype = TRUE, color = FALSE, fill = FALSE)) +
  labs(
    subtitle = "Costs at SWAC schools are typical for all HBCUs.") +
  scale_fill_manual(values = c(
    "SWAC" = "blue",
    "other" = "limegreen")) +
  scale_color_manual(values = c(
    "SWAC" = "blue",
    "other" = "limegreen"))

cost_dens_swac

library(patchwork)

cost_dens_all / cost_dens_hbcu / cost_dens_swac +
  plot_layout(axes = "collect") &
  labs(
    linetype = NULL,
    y = NULL,
    x = "average cost") &
  scale_linetype_manual(
    values = "dashed",
    guide  = guide_legend(order = 1)) &
  scale_x_continuous(
    labels = abbrev_dollar,
    limits = range(combo_ready$cost_avg, na.rm = TRUE)) &
  theme_minimal() &
  theme(
    legend.justification = "left",
    axis.text.y = element_blank())

1: plot_layout() is used to change specific behavior of {patchwork}. Here, axes = "collect" will simplify the display of labels to avoid repetition.
2: Setting common limits will make sure the X-axis for each chart lines up.

Start by preparing each figure on its own.

For all three charts, we’ll want a cleaner data set from the start. This will make it so that we don’t have to think too hard about redefining values in the legends.

combo_ready = higher_ed_18.copy()

combo_ready["designation"] = combo_ready["is_hbcu"].apply(
    lambda x: "HBCU" if x is True else ("not HBCU" if x is False else np.nan)
)

combo_ready["conference"] = np.where(
  combo_ready["id"].isin(universities["id"]),
  "SWAC",
  "other"
)

combo_ready["funding"] = pd.Categorical(
  combo_ready["control"].replace({
    "Public": "public",
    "Private (nonprofit)": "private",
    "Private (for profit)": "private"
  }),
  categories=["public", "private"],
  ordered=True
)

We’ll define one universal title on the first figure and individual subtitles for each subfigure.

# first
cost_dens_all = (
  combo_ready[combo_ready["designation"].notna()] >>
  ggplot(mapping = aes(
    x = "cost_avg",
    y = "..scaled.."))
  + geom_density(
    mapping = aes(
      fill = "designation",
      color = "designation"),
    alpha = 0.5)
  + geom_density(
    mapping = aes(linetype = ["all schools"]),
    color = "red",
    # key_glyph = "path",
    show_legend = {
      "linetype": True,
      "color": False,
      "fill": False})
  + labs(
    title = "Average cost of US higher education, AY 2018-2019",
    subtitle = "HBCU costs fall within a narrower band, but the range is typical of all schools.",
    x = "") 
  + p9.scale_fill_manual(values = {
    "HBCU": "purple",
    "not HBCU": "skyblue"})
  + p9.scale_color_manual(values = {
    "HBCU": "purple",
    "not HBCU": "skyblue"})
  + p9.guides(linetype = p9.guide_legend(
    override_aes = {"shape": "o"}
  ))
)

cost_dens_all.show()

1: Here we’re adding a second density curve, showing the overall shape of all of the data. This makes it easier to see how HBCU designation changes shape from the overall dataset. (In this case, the overall data is very close to the data for schools that are not known to be HBCUs.) By not setting a fill aesthetic, we’re avoiding the difficulty that comes from interpreting more than two overlaid colors.
2: Although usually we map an aesthetic to a particular column, it’s also possible to set a value here manually, thereby forcing something to appear in the legend.
3: Unfortunately, {plotnine} doesn’t seem to allow the key_glyph to be set easily to change the shape in the legend.
4: We’ll typically set show_legend to True or False, but it can accept a dictionary defining each aesthetic in the legend.

# second
cost_dens_hbcu = ( 
  combo_ready[combo_ready["designation"] == "HBCU"] >>
  ggplot(
    mapping = aes(
      x = "cost_avg",
      y = "..scaled.."))
  + geom_density(
    mapping = aes(
      fill = "funding",
      color = "funding"),
    alpha = 0.4)
  + geom_density(
    mapping = aes(linetype = ["all HBCUs"]),
    color = "purple",
    # key_glyph = "path",
    show_legend = {
      "linetype": True, 
      "color": False, 
      "fill": False})
  + labs(
    subtitle = "Private HBCUs cost noticeably more than public HBCUs.",
    x = "")
  + p9.scale_fill_manual(values = {
    "public": "deeppink",
    "private": "darkorange"})
  + p9.scale_color_manual(values = {
    "public": "deeppink",
    "private": "darkorange"})
)

cost_dens_hbcu.show()

# third
cost_dens_swac = ( 
  combo_ready[combo_ready["designation"] == "HBCU"] >>
  ggplot(
    mapping = aes(
      x = "cost_avg",
      y = "..scaled.."))
  + geom_density(
    mapping = aes(
      color = "conference",
      fill = "conference"),
    alpha = 0.4)
  + geom_density(
    mapping = aes(linetype = ["all HBCUs"]),
    color = "purple",
    # key_glyph = "path",
    show_legend = {
      "linetype": True, 
      "color": False, 
      "fill": False})
  + labs(
    subtitle = "Costs at SWAC schools are typical for all HBCUs.",
    x = "average cost")
  + p9.scale_fill_manual(values = {
    "SWAC": "blue",
    "other": "limegreen"})
  + p9.scale_color_manual(values = {
    "SWAC": "blue",
    "other": "limegreen"})
)

cost_dens_swac.show()

full_combo = (
  cost_dens_all / cost_dens_hbcu / cost_dens_swac
  # + p9.plot_layout(axes = "collect")
  & labs(
    linetype = "",
    y = "")
  & p9.scale_linetype_manual(values = "dashed")
  & p9.scale_x_continuous(
    labels = abbrev_dollar,
    limits = [np.nanmin(combo_ready["cost_avg"]), np.nanmax(combo_ready["cost_avg"])]
    ) 
  & p9.guides(
    linetype = p9.guide_legend(order = 1), 
    color = p9.guide_legend(order = 2),
    fill = p9.guide_legend(order = 2))
  & p9.theme_minimal()
  & p9.theme(
    legend_justification = "left",
    axis_text_y = p9.element_blank(),
    figure_size = [6.5, 7.5])
)

full_combo.save("full_combo-p9.png", dpi=300)

1: {plotnine} doesn’t yet support collecting axis titles with plot_layout(), so we’ll manually set them in each subfigure.
2: Setting common limits will make sure the X-axis for each chart lines up.
3: Manually setting the order of legends lets our linetype legend clarify which subgrouping is showed in the subfigure. The first subfigure, for instance, shows all schools by designation, and the second and third show all HBCUs by funding or by conference.
4: Composite {plotnine} figures don’t yet have a .show() method for use in Quarto, but we can manually save the output to a file and display it in Markdown with ![](full_combo-p9.png)

Footnotes

As / is used to stack plots vertically, | can be used to set them side by side. Parentheses are used to group plots for compositing and placement.↩︎