Environmental Data Analysis and Visualization

Now You See It: Basics of Visualization

Warm-up activity

  • Create a project and file system for this week’s lecture

  • Download the middlesexEColi.csv dataset from Canvas to the appropriate place in your file system and read it into R as a tibble

    • This is data from E. coli tests conducted on Middlesex County surface water sources between 2018 and 2022.
  • Open a script, and using ggplot2, create a histogram of the results of these tests. The horizontal (x) axis is colony forming units per 100mL sample (cfu/100mL)

  • If there’s time, try adding the geom_hline function with the argument xintercept=235 to draw a vertical line at 235, and color it red.

    • The value of 235 cfu/100mL is the State of Massachusetts accepted level for E. coli

Sensor of the day

The Mystic River Water Quality Buoy!

Sensor of the day

Mystic River Buoy Readings
National Water Quality Monitoring Council

Introducing ggplot2

#Load tidyverse 
require(tidyverse)  

#Create plot with ggplot2 
myPlot<-ggplot(iris,aes(x=Sepal.Length,y=Petal.Length,color=Species)) +   
  geom_point()  

#View plot 
myPlot

Introducing ggplot2

A layered grammar of graphics

A framework (per Wilkinson et al. 2005; Wickham 2010) used to describe the components of a data visualization in terms of a set of layered objects

Aesthetic mapping

https://wilkelab.org

Additional components

Component Description Example
Statistics Statistical transformations or summaries of data mean, log transformation, smoothing spline
Facets Divisions in data used for multi-plotting side-by-side plot, 2 x 2 plot
Coordinates Space used for plotting values Cartesian 2D space, polar coordinate space
Themes Non-data ink Font size, shading of background grid, location of tick marks

Why visualize?

Visualization helps us to identify patterns and structures in data that are not evident from tables or numerical summaries.

Faith, J. Tyler. 2018. “Paleodietary Change and Its Implications for Aridity Indices Derived from δ18O of Herbivore Tooth Enamel.” https://doi.org/10.1016/j.palaeo.2017.11.045.

Why visualize?

Marine Reservoir Correction data from calib.org

Why visualize?

Mean x: 9

Mean y: 7.5

Pearson correlation coefficient (r): 0.816

Coefficient of determination (R2): 0.67

Why visualize?

Avenue, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons

Why visualize?

Matejka and Fitzmaurice, 2017. “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing.” https://doi.org/10.1145/3025453.3025912.

Also see: datasaurus package

How to make a good visualization

  • Choose the right chart for the data

  • Maximize the data-to-ink ratio

  • Make deliberate design decisions

What kind of visualization?

The type of visualization that should be used depends on the kind of information being conveyed.

What kind of visualization?

Wilke, Claus. 2019. Fundamentals of Data Visualization

What kind of visualization?

Wilke, Claus. 2019. Fundamentals of Data Visualization

Data-ink

Data-ink (per Tufte 19831) refers to ink (or pixels) that, if erased, would reduce the information being presented.

Ideally, the ratio between data and total ink should be close to 1.

Activity: To erase or not to erase?

https://simplexct.com/data-ink-ratio

Activity: To erase or not to erase?

https://simplexct.com/data-ink-ratio

Chartjunk

The most egregious use of non-data ink is often referred to as chartjunk.

Su, Yu-Sung. 2008. “It’s Easy to Produce Chartjunk Using Microsoft®Excel 2007 but Hard to Make Good Graphs.” https://doi.org/10.1016/j.csda.2008.03.007.

Chartjunk

Su, Yu-Sung. 2008. “It’s Easy to Produce Chartjunk Using Microsoft®Excel 2007 but Hard to Make Good Graphs.” https://doi.org/10.1016/j.csda.2008.03.007.

Data-ink: Can it go too far?

Can chartjunk be useful?

Being Deliberate About Design

How does this choice help someone understand the data?

  • Contrast

  • Clarity

  • Highlighting

  • Messaging

Contrast and clarity:Data-ink and themes

ggplot(gmAsia2007,aes(x=gdpPercap,y=lifeExp,size=pop))+
  geom_point() +
  scale_x_log10() +
  theme_classic()

Themes and data-ink

While each graphical element can be modified individually, themes provide a way to modify the overall look of the “non-data ink”

Themes and data-ink

ggplot(gmAsia2007,aes(x=gdpPercap,y=lifeExp,size=pop))+
  geom_point() +
  scale_x_log10()

Themes and data-ink

ggplot(gmAsia2007,aes(x=gdpPercap,y=lifeExp,size=pop))+
  geom_point() +   
  scale_x_log10()+
  theme_bw()

Themes and data-ink

ggplot(gmAsia2007,aes(x=gdpPercap,y=lifeExp,size=pop))+
  geom_point() +   
  scale_x_log10()+
  theme_classic()

Accessibility

Some design choices affect some audiences more than others

  • Color palette

  • Text and symbol sizes

Wilke, Claus. 2019. Fundamentals of Data Visualization

Being deliberately misleading

https://eagereyes.org/blog/2013/banking-45-degrees

Being deliberately misleading

https://eagereyes.org/blog/2013/baselines

Being deliberately misleading

https://infolific.com/technology/internet/seo-lie-factor/

The big picture

  • Visualization is foremost about making data more understandable

  • Guidelines like maximizing data-ink and being deliberate about design help us make decisions that will facilitate this goal

  • The grammar of graphics helps us to make these decisions in an explicit way by connecting elements

  • The ggplot2 package provides a way for us to put that grammar to work inside of the data environment we’re creating in R

Coursekeeping

  • Visualization critiques begin next week

    • Check the list on Canvas to see when your critique is due
  • Coding assignment #1 is due on Thursday

Next week

  • Introducing data analysis

  • Finding a statistic for assessing your data

  • Visualizing stats