Critique and Create Project 1: Amounts

Introduction

These projects ask you to critique a series of visualizations and then to apply best practices as you create your own. Follow along below, but ultimately do the work using the corresponding .qmd file on Posit Cloud.

This week’s data

The spills2019 data set is sourced from the U.S. Coast Guard’s National Response Center. Take a look at it to understand the kind of data included. Only 100 rows are shown here, but the full set we’re using has over 30,000.

Critique

As an example, I’ll show two visualizations of amounts that can be drawn from this data set. Pay attention to the techniques (the code) used to create the visualizations, but also be willing to judge the merits of both. In other words, consider how each visualization was made, and why each visualization is well made or not well made.

For this first project, the polished visualization does show you the code used to create it. In addition to this code, consult the visualizing amounts methods page.

Rough figure

Code

rough_draft <- 
  spills2019 |> 
  drop_na(DESC_REMEDIAL_ACTION) |> 
  count(DESC_REMEDIAL_ACTION, sort = TRUE) |> 
  slice_head(n = 20)

rough_draft |>
  ggplot(aes(
    y = reorder(DESC_REMEDIAL_ACTION, n),
    x = n
  )) +
  geom_col()

For this visualization, I used a few techniques to make a working visualization.

Dropped any rows that had NA in the DESC_REMEDIAL_ACTION column.
Used count() to tabulate the number of times each value was used in that column, and sorted the result.
Used slice_head() to limit the resulting table to 20 rows.
Used ggplot() and geom_col() to create a bar chart.
Set the X-axis to the column called n, made in step 2.
Sorted the Y-axis by the values of n using reorder().

Compare this chart to the improved versions in the next two section.

Half-improved figure

Code

rough_draft |> 
  ggplot(aes(y = reorder(DESC_REMEDIAL_ACTION, n),
             x = n)) +
  geom_col(aes(fill = DESC_REMEDIAL_ACTION=="NONE"),
           show.legend = FALSE) +
  theme_minimal() +
  labs(x = "Incidents",
       y = "Remedial action",
       title = "Many environmental spills recorded in 2019 had no remedial action.") +
  scale_x_continuous(expand = c(0,0), 
                     labels = scales::label_comma()) +
  theme(panel.grid.major.y = element_blank(),
        plot.title.position = "plot")

This chart starts from the same data, which is a little messy. What does it improve? What more is it doing, and what less does it show? Where does it still fall short? (No need to answer these!)

Polished figure

Code

polished_draft <- 
  spills2019 |> 
  drop_na(DESC_REMEDIAL_ACTION) |> 
  
  # count the total number of incidents
  mutate(total_num = n()) |> 
  
  # clean up some messy records
  mutate(DESC_REMEDIAL_ACTION = DESC_REMEDIAL_ACTION |> 
           str_remove_all("[.]$") |> 
           str_replace_all("NOTIFICATIONS", "NOTIFICATION")) |> 
  
  # split up remedial actions when multiple actions are recorded
  mutate(DESC_REMEDIAL_ACTION = strsplit(DESC_REMEDIAL_ACTION, ", ")) |> 
  unnest_longer(DESC_REMEDIAL_ACTION) |> 
  
  # change remedial actions from uppercase to mixed case
  mutate(DESC_REMEDIAL_ACTION = DESC_REMEDIAL_ACTION |> 
           str_to_sentence()) |> 
  
  # instead of count(), use group_by() and summarize() to keep total_num
  group_by(total_num, DESC_REMEDIAL_ACTION) |> 
  summarize(number = n()) |> 
  ungroup() |> 
  
  # put rows in order by number
  arrange(desc(number)) |> 
  
  # take the top 20
  slice_head(n = 20)

polished_draft |> 
  ggplot(aes(y = reorder(DESC_REMEDIAL_ACTION, number),
             x = number/total_num)) +
  geom_col(aes(fill = DESC_REMEDIAL_ACTION=="None"),
           show.legend = FALSE) +
  theme_minimal() +
  labs(x = "Incidents",
       y = "Remedial action",
       title = "Many environmental spills recorded in 2019 had no remedial action.") +
  scale_x_continuous(expand = c(0,0), 
                     labels = scales::label_percent()) +
  scale_fill_manual(values = c("black", "red")) +
  theme(panel.grid.major.y = element_blank(),
        plot.title.position = "plot")

Critique the visualization

This last version of the figure takes a different technique. Hopefully it’s a stronger visualization in the end.

You respond

Wilke chapters 22, 23, and 29 lay out some important practices for visualizations. Consider them as you answer these three questions:

In what noticeable ways has the polished version improved upon the first rough draft?
Which of these changes do you think made the biggest difference?
Is there anything you see that could still be improved?

Rough table

In addition to an attractive figure, it is sometimes helpful to show the numbers in a table. Both the rough table and the polished table will start from the polished data set prepared above, since that data is already a little cleaner.

Code

polished_draft |> 
  gt()

total_num	DESC_REMEDIAL_ACTION	number
25265	None	2081
25265	Investigation underway	1457
25265	Absorbents applied	1231
25265	Clean up underway	1211
25265	Booms applied	1033
25265	Contractor has been hired	942
25265	Made notification	638
25265	Clean up crew on-site	546
25265	Notification	505
25265	Making notification	489
25265	Cleanup completed	482
25265	Dissipate naturally	387
25265	Investigation is underway	304
25265	Material contained	289
25265	Shutdown system	251
25265	Clean up crew enroute	244
25265	Vac truck used	227
25265	Isolated area	196
25265	Secured operations	132
25265	Repairs made	128

This rough table doesn’t provide much more detail than the visualization, and it unhelpfully includes one column of repeating values.

Polished table

Code

polished_draft |> 
  slice_head(n = 10) |> 
  mutate(percent = number / total_num) |> 
  select(-total_num) |> 
  gt() |> 
  fmt_percent(columns = percent) |> 
  fmt_number(columns = number, decimals = 0) |> 
  tab_spanner(label = "Incidents", columns = c("number", "percent")) |> 
  cols_label(matches("percent") ~ "%",
             matches("number") ~ "n",
             contains("DESC_R") ~ "Remedial action") |> 
  opt_stylize(style = 6, color = "cyan") |> 
  tab_header(title = "Responding to environmental hazards",
             subtitle = "USCG National Response Center, 2019")

Responding to environmental hazards
USCG National Response Center, 2019
Remedial action	Incidents
Remedial action	n	%
None	2,081	8.24%
Investigation underway	1,457	5.77%
Absorbents applied	1,231	4.87%
Clean up underway	1,211	4.79%
Booms applied	1,033	4.09%
Contractor has been hired	942	3.73%
Made notification	638	2.53%
Clean up crew on-site	546	2.16%
Notification	505	2.00%
Making notification	489	1.94%

Critique the table

The second table took more effort to prepare. At least some of that effort was worth it.

You respond

Wilke chapter 22 discusses some principles of table design. Consider it as you answer the first of these two questions:

Compare the end results. Which parts of the second table show an improvement from the first?
Consider the code used to generate the second table. Which steps are unclear or need explanation? (You’re welcome to tinker around with the code to see how making slight changes can make a difference, but please by considering the example provided for you.)

Create

Recreating a visualization

Taking inspiration from the code for the Rough figure section above, modify spills2019 until it has 20 rows and the first few rows look like this:

Then use that table to recreate the following visualization:

You code

Write code to recreate this table and figure.

Improving a visualization

Do something to the above visualization to improve it, or create a new visualizaton of some amounts from this data set. Your final product doesn’t have to be as fully polished as the Polished figure shown above, but do consider the relative strengths and weaknesses of the visualization you’ve just recreated. Write your code here:

You code

Write code to improve upon this figure.

Creating a table

Finally, take inspiration from the Polished table section above to create a table of numbers here. The final result can be quite simple.

You code

Write code to improve upon this figure.