Maps

with {sf} and {ggplot2}

Note

Coming soon: This page is currently being updated with standardized datasets and parallel Python examples.

Where does the SWAC story unfold? Chapter 15 of Wilke’s Fundamentals of Data Visualization lays out the essentials of map-making; here we put them into practice with {sf} and {ggplot2}. We’ll begin by shading each state by residents per institution with the SWAC dataset. Then we’ll add enrollment figures from the Department of Education and population counts from the U.S. Census to show how geography and demographics intertwine.

{sf} and shape files

The letters of {sf} stand for “simple features,” but we’ll need another kind of “sf” for everything to work: shape files. These describe the geophysical location and shape of things, from natural places like rivers, lakes, and mountains to artificial constraints like countries, states, and cities. Many different things can be mapped, so the process will vary a bit depending on the shape files available.

The useful tigris package provides easy shape files for many locations in the United States. We’ll start any U.S. mapping by loading {sf} and {tigris}. Since we’ll also be loading data with {readr}, adjusting data with {dplyr}, modifying strings with {stringr}, and displaying maps with {ggplot2}, we’ll use {tidyverse} to load this collection of packages at once.

Once {tigris} is loaded, we can use it to download shape files for American states. Taking a glimpse() of the result lets us see the columns and types:

library(sf)
library(tigris)
library(tidyverse)

us_shapes <- states(progress_bar = FALSE)

glimpse(us_shapes)

Rows: 56
Columns: 16
$ REGION   <chr> "3", "3", "2", "2", "3", "1", "4", "1", "3", "1", "1", "3", "…
$ DIVISION <chr> "5", "5", "3", "4", "5", "1", "8", "1", "5", "1", "1", "5", "…
$ STATEFP  <chr> "54", "12", "17", "27", "24", "44", "16", "33", "37", "50", "…
$ STATENS  <chr> "01779805", "00294478", "01779784", "00662849", "01714934", "…
$ GEOID    <chr> "54", "12", "17", "27", "24", "44", "16", "33", "37", "50", "…
$ GEOIDFQ  <chr> "0400000US54", "0400000US12", "0400000US17", "0400000US27", "…
$ STUSPS   <chr> "WV", "FL", "IL", "MN", "MD", "RI", "ID", "NH", "NC", "VT", "…
$ NAME     <chr> "West Virginia", "Florida", "Illinois", "Minnesota", "Marylan…
$ LSAD     <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "…
$ MTFCC    <chr> "G4000", "G4000", "G4000", "G4000", "G4000", "G4000", "G4000"…
$ FUNCSTAT <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "…
$ ALAND    <dbl> 62266513826, 138965379385, 143778206717, 206244791203, 251512…
$ AWATER   <dbl> 488918898, 45968913048, 6216848695, 18937236061, 6979843236, …
$ INTPTLAT <chr> "+38.6472854", "+28.3989775", "+40.1028754", "+46.3159573", "…
$ INTPTLON <chr> "-080.6183274", "-082.5143005", "-089.1526108", "-094.1996043…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-77.75438 3..., MULTIPOLYGON (((…

The resulting data frame has 56 rows for the 50 states and some additional regions like the District of Columbia and American territories. The final column is called “geometry”, providing an ominous-looking MULTIPOLYGON data type. This column is special. Each cell contains information representing the lines, curves, and wrinkles of a state’s borders.

Viewing the class of the object shows that it’s special, too:

class(us_shapes)

[1] "sf"         "data.frame"

It’s technically a data frame, but it’s an {sf} object first. That special status is necessary for putting things on the map. Later on, we’ll be doing things that might jeopardize that {sf} status, but there are simple tricks to keeping it.

Creating a simple map

Objects with class of {sf} are easy for us to visualize.

us_shapes |> 
  ggplot() +
  geom_sf()

This map takes a long time to render because it’s so big and so detailed. And we might not even want to see it all. Alaska’s Aleutian Islands wrap around past longitude 180º and reappear in the northeast. Southwest of them are the U.S. Marshall Islands in the Pacific Ocean. To simplify things, we’ll limit our consideration to the southeastern United States. At the same time, it’s maybe helpful to reiterate that we can use any column to fill states with color in ways we’re used to doing with {ggplot2}.

region3_states <- 
  us_shapes |> 
  filter(REGION == 3) |> 
  select(state = NAME, 
         abbrev = STUSPS,
         division = DIVISION)

states_by_division <- 
  region3_states |> 
  ggplot() +
  geom_sf(aes(fill = division)) + 
  theme_minimal()

states_by_division

Projecting useful data

Now that we’re comfortable making simple maps, let’s use that new power toward something interesting. What, for example, can we learn about this region’s institutions of higher education? Where are they, and how well served are each state’s residents, comparatively?

Prepping data

The U.S. Department of Education’s College Scorecard is a good place to start when studying colleges and universities, including the students who attend them and the challenges they face. This is a pretty large data set, so only 20 columns and 6 rows are shown here:

higher_ed <- read_csv("data/higher_ed.csv")

higher_ed |> 
  select(1:20) |> 
  head()

There’s a lot of useful information here. For now, we’re mostly interested in the number of institutions per state. We can simplify things down to two columns using count(), leaving just one row per state.

higher_ed_summary <- higher_ed |> 
  count(state, sort = TRUE) |> 
  rename(abbrev = state,
         institutions = n)

higher_ed_summary

Focusing on region 3, it’s not surprising that bigger states like Texas have more institutions of higher education, but we might be surprised to see Florida rank higher.

Joining data sets

To limit this data to states in region 3 and prepare it for mapping, grab the relevant state names and abbreviations from region3_states. An inner_join() will combine it with higher_ed, limiting which states are included and adding the geometry data. As long as the {sf} object comes first in these joins, that special status will be retained:

higher_ed_region3 <- 
  region3_states |> 
  inner_join(higher_ed_summary,
             by = "abbrev")

higher_ed_region3 |> 
  ggplot(aes(fill = institutions)) +
  geom_sf(color = "white")

This hasn’t cleared up our confusion about Texas and Florida. Is it possible that Florida needs more colleges and universities because more people live there?

Adding more data

We can learn more if we standardize by population, which can be downloaded from the U.S. Census. Once again, an inner_join() lets us combine this data with mapping data to create an easy visualization:

if(!file.exists("data/NST-EST2022-POP.xlsx")) {
  download.file(url = "https://www2.census.gov/programs-surveys/popest/tables/2020-2022/state/totals/NST-EST2022-POP.xlsx",
                destfile = "data/NST-EST2022-POP.xlsx")
}

state_populations <- 
  "data/NST-EST2022-POP.xlsx" |> 
  readxl::read_excel(skip = 3) |>
  rename(state_name = `...1`) |>
  mutate(state = str_remove_all(state_name, "[.]")) |> 
  select(state, population = `2022`)

state_populations_region3 <- 
  region3_states |> 
  inner_join(state_populations,
             by = "state")

state_populations_region3 |> 
  ggplot(aes(fill = population)) +
  geom_sf(color = "white") +
  scale_fill_continuous(
    labels = scales::label_comma()
    )

1: Adjust the labels option in any scale to change the numbers shown. In particular, the scales package includes some helpful functions for adjusting numeric labels in axes and legends. The label_comma() function adds a comma divider every three digits to simplify reading. Other functions beginning with label_ are worth trying out to suit particular use cases.

Using another inner_join() will add higher_ed_summary so we can calculate the number of colleges and universities per capita, based on the 2022 census. Adding label_number() to the color scale makes sure we see decimals instead of scientific notation. These numbers are still a little confusing, but we’ll clarify them later.

higher_ed_combined <- 
  left_join(state_populations_region3,
            higher_ed_summary,
            by = "abbrev") |> 
  mutate(percap = institutions / population)

schools_per_state_plot <- 
  higher_ed_combined |> 
  ggplot(aes(fill = percap)) +
  geom_sf(color = "white") +
  scale_fill_continuous(
    labels = scales::label_number())

schools_per_state_plot

When adjusted for population, Texas falls to the bottom of the list for the number of colleges and universities per capita.

Converting coordinate systems

We can go further still with the data we have. To compare the actual number of institutions against these state-wide numbers per capita, wouldn’t it be nice to show the specific placement of each college and university?

One of the nice things about {ggplot2} is that layers can be added with a plus sign. We can add any number of layers to this map as long as it’s prepared using the same coordinate system. GPS coordinates, for example, are typically reported as longitude and latitude. We can see these lines of longitude and latitude, along with those coordinates, in the grids and axes of the maps we’ve made so far.

But looks can be deceiving. Despite the presence of these GPS coordinates, our maps have actually made use of some other coordinate system behind the scenes. If we’re going to add more layers, we’ll need to make sure every layer uses the same coordinate system. Converting these coordinate systems may seem daunting, but the approach is similar for every data set.

To add a point for every college or university in region 3, we’ll need to prepare them as shape files and then make sure they’re projected into the same system as the base map. Our starting higher_ed data set already includes columns for LONGITUDE and LATITUDE. We’ll once again limit things to states region 3 and use an inner_join() to combine our data sets. Next, we can drop any rows that lack location data, simplify things to three columns, and make the conversion. The st_as_sf() function converts our GPS coordinates into something that can be understood by the {sf} and {ggplot2} packages. Within that function, crs = 4326 indicates that we’re moving from the coordinate reference system (or “CRS”) used with longitude and latitude.

higher_ed_smaller <- 
  higher_ed |> 
  rename(abbrev = state) |> 
  drop_na(longitude) |> 
  select(abbrev, 
         name, 
         longitude, 
         latitude)

college_dots <- 
  region3_states |> 
  data.frame() |> # this step forcibly drops the {sf} status
  select(state, abbrev) |> 
  inner_join(higher_ed_smaller,
             by = "abbrev") |> 
  st_as_sf(
    coords = c("longitude", "latitude"), 
    crs = 4326)

In the next step, st_crs() allows us to compare coordinate reference systems between college_dots and the higher_ed_combined data set we plan to project it onto.

st_crs(college_dots)[[1]]

[1] "EPSG:4326"

st_crs(higher_ed_combined)[[1]]

[1] "NAD83"

They don’t match. But they need to if we’re going to overlay them on top of each other. Fortunately, st_transform() makes the necessary adjustments:

college_dots <- 
  college_dots |> 
  st_transform(crs = st_crs(us_shapes))

Adding layers

Now that we’ve converted everything to a common CRS, we can use this object for a new layer added to our existing plot. Set the data field inside an additional geom_sf() layer, and adjust the style.

states_and_schools <- 
  schools_per_state_plot +
  geom_sf(data = college_dots,
          fill = "blue",
          color = "white",
          shape = 21,
          alpha = 0.4)

states_and_schools

Final touches

We’ve come a long way and have a lot to be proud of, but things can be polished further. Before calling a map “done,” think once again about the three things that are crucial to any data visualization. First reconsider accuracy, and do everything possible to ensure a visualization avoids introducing distortions. Second, consider the message and whether it can be communicated more clearly if some part of were changed. Finally, consider attractiveness, and look toward changes from the default that might lend a map some extra beauty.

Accuracy: Map projection

One area of accuracy to consider with maps comes from the projection we use. Notice, for example, that all of our lines of latitude are straight, running from left to right: 25ºN, 30ºN, and so on. This is common when using data from the tigris package, since it supplies shape files using a CRS with the code 4269.

Note

Whether we like CRS 4269, dislike it, or even understand what it really means isn’t important for us to get this far. What’s crucial is that all of our data gets charted using the same CRS. Using a consistent CRS lets everything line up as expected.

Even though we’re putting our map on a flat surface, the world isn’t flat. Distortion from the projection we choose can risk the accuracy of our final map. To limit distortion, these lines might better be shown with a curve.

The Albers projection is a good option for maps in North America, since it preserves area without changing shape too much.¹ For the area covering the United States, we’ll use an Albers projection with the CRS code 5070, which we can specify using the coord_sf() function in the last line here:

states_and_schools <- 
  states_and_schools +
  coord_sf(crs = st_crs(5070))

states_and_schools

This version of the map limits distortion of size. When polishing a map, always at least consider the distortion from the coordinate system being used—and consider whether the map might be improved by choosing a different projection. This consideration becomes even more important with maps showing larger areas.

Message: Labels and scales

From our explorations, we know that Texas has the second-highest number of colleges and universities in this region, but it also has the lowest number of colleges and universities per capita. Unfortunately, the numbers in our color scale are hard to interpret.

We want to show the number of institutions per person, but the values might be clearer if displayed as fractions, in terms of persons per institution. Inverting these numbers will get us that value.

schools_per_capita <- 
  higher_ed_combined |> 
  ggplot(aes(fill = 1/percap)) +
  geom_sf(color = "white") +
  geom_sf(data = college_dots,
          fill = "blue",
          color = "white",
          shape = 21,
          alpha = 0.4) +
  coord_sf(crs = st_crs(5070))

1: Changing fill from percap to 1/percap will invert it.

Setting trans = "reverse" will ensure the color scale goes the right direction, and adding a prefix to our labels lets us display them as fractions. The cut_short_scale() function prints numbers like “100K” instead of “100,000”:

schools_per_capita <- 
  schools_per_capita  +
  labs(fill = "institutions\nper resident") +
  scale_fill_continuous(
    trans = "reverse",
    labels = scales::label_number(
      prefix = "1/",
      scale_cut = scales::cut_short_scale())
    )

schools_per_capita

1: Reverse the scale’s direction by adding trans = "reverse".
2: Options inside the scales package’s function label_number() let us further adjust how things are shown. For instance, we want to show these inverted values as simple fractions, and the prefix option lets us add text before the number shown. Similarly, we can simplify how the scale is “cut” by combining the scale_cut option with cut_short_scale() to show, for example, “1000” as “1K” and “1000000” as “1M.”

This scale makes it clear that West Virginia has about one college or university for every 100,000 residents, while Texas has one for every 300,000.

Adding a title and subtitle helps clarify the map’s takeaway message. A caption at the bottom sources the data to give credit and to communicate the efforts we’ve gone to to make sure things are accurate.

schools_per_capita <- 
  schools_per_capita +
  labs(title = "Region 3 Colleges and Universities",
       subtitle = "Texas has 94 colleges or universities, the second highest number in the region,\nbut each serves more than 300,000 residents.",
       caption = "Data: U.S. Department of Education, U.S. Census")

schools_per_capita

Attractiveness: Theme and color

Things look good so far, and we might stop here. Before we do, we can adjust the theme and colors to something that might be more pleasing to the eye.

For this map, we’re addressing data on a state level, so lines of latitude and longitude aren’t necessary. What’s more, their inclusion north of Oklahoma and west of Texas mistakenly suggests that the region is an island. Using theme_void() removes the grid lines and the background, placing the map on a clean white space.

The default continuous color palette in {ggplot2} only scales values in terms of luminance, showing every value as a lighter or darker shade of blue. Other color palettes simultaneously scale values in terms of hue, sliding not only from darker to lighter but also around part of a color wheel. Adding this second dimension of change can help viewers perceive subtler variations in the middle values of a range. A Viridis color palette, set with scale_viridis_c(), offers good balance when shifting luminance and hue.

schools_per_capita +
  scale_fill_viridis_c(
    trans = "reverse",
    labels = scales::label_number(
      prefix = "1/",
      scale_cut = scales::cut_short_scale())) +
  theme_void()

This final version hits each requirement. It shows data accurately, it conveys a clear message, and it is arguably more attractive.

Footnotes

To limit distortion, choose a CRS made for the area being mapped. An Albers projection isn’t the only good option; it’s just one of a few types of “conic” projections that are well suited for visualizing this part of the world. It’s also the choice of the U.S. Census and the U.S. Geological Survey, so nobody should reasonably fault us for choosing it to show maps of this area. More options and areas for CRS codes with the Albers projection can be found on spatialreference.org.↩︎