Environmental Data Analysis and Visualization

Now You See It: Basics of Visualization

Warm-up activity

  • Create a project and file system for this week’s lecture

  • Download the middlesexEColi.csv dataset from Canvas to the appropriate place in your file system and read it into R

    • This is data from E. coli tests conducted on Middlesex County surface water sources between 2018 and 2022.
  • Create a histogram of the results of these tests. The horizontal (x) axis is colony forming units per 100mL sample (cfu/100mL)

  • After you’ve plotted the histogram, you can use the abline function with the argument v=235 to draw a vertical line at 235, and color it red.

    • The value of 235 cfu/100mL is the State of Massachusetts accepted level for E. coli

Sensor of the day

The Mystic River Water Quality Buoy!

Sensor of the day

Mystic River Buoy Readings
National Water Quality Monitoring Council

Why visualize?

Visualization helps us to identify patterns and structures in data that are not evident from tables or numerical summaries.

Faith, J. Tyler. 2018. “Paleodietary Change and Its Implications for Aridity Indices Derived from δ18O of Herbivore Tooth Enamel.” https://doi.org/10.1016/j.palaeo.2017.11.045.

Why visualize?

Marine Reservoir Correction data from calib.org

Why visualize?

Mean x: 9

Mean y: 7.5

Pearson correlation coefficient (r): 0.816

Coefficient of determination (R2): 0.67

Why visualize?

Avenue, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons

Why visualize?

Matejka and Fitzmaurice, 2017. “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing.” https://doi.org/10.1145/3025453.3025912.

Also see: datasaurus package

How to make a good visualization

  • Choose the right chart for the data

  • Maximize the data-to-ink ratio

  • Make deliberate design decisions

What kind of visualization?

The type of visualization that should be used depends on the kind of information being conveyed.

What kind of visualization?

Wilke, Claus. 2019. Fundamentals of Data Visualization

What kind of visualization?

Wilke, Claus. 2019. Fundamentals of Data Visualization

Data-ink

Data-ink (per Tufte 19831) refers to ink (or pixels) that, if erased, would reduce the information being presented.

Ideally, the ratio between data and total ink should be close to 1.

Activity: To erase or not to erase?

https://simplexct.com/data-ink-ratio

Activity: To erase or not to erase?

https://simplexct.com/data-ink-ratio

Chartjunk

The most egregious use of non-data ink is often referred to as chartjunk.

Su, Yu-Sung. 2008. "It's Easy to Produce Chartjunk Using Microsoft®Excel 2007 but Hard to Make Good Graphs." https://doi.org/10.1016/j.csda.2008.03.007.

Chartjunk

Su, Yu-Sung. 2008. "It's Easy to Produce Chartjunk Using Microsoft®Excel 2007 but Hard to Make Good Graphs." https://doi.org/10.1016/j.csda.2008.03.007.

Data-ink: Can it go too far?

Can chartjunk be useful?

Being Deliberate About Design

How does this choice help someone understand the data?

  • Contrast

  • Clarity

  • Highlighting

  • Messaging

Accessibility

Some design choices affect some audiences more than others

  • Color palette

  • Text and symbol sizes

Wilke, Claus. 2019. Fundamentals of Data Visualization

A layered grammar of graphics

A framework (per Wilkinson et al. 2005; Wickham 2010) used to describe the components of a data visualization in terms of a set of layered objects

Aesthetic mapping

https://wilkelab.org

Additional components

Component Description Example
Statistics Statistical transformations or summaries of data mean, log transformation, smoothing spline
Facets Divisions in data used for multi-plotting side-by-side plot, 2 x 2 plot
Coordinates Space used for plotting values Cartesian 2D space, polar coordinate space
Themes Non-data ink Font size, shading of background grid, location of tick marks

Introducing ggplot2

#Load tidyverse
require(tidyverse)

#Create plot with ggplot2
myPlot<-ggplot(iris,aes(x=Sepal.Length,y=Petal.Length)) +
  geom_point()

#View plot
myPlot

Introducing ggplot2

Introducing ggplot2

#Load tidyverse 
require(tidyverse)  

#Create plot with ggplot2 
myPlot<-ggplot(iris,aes(x=Sepal.Length,y=Petal.Length,color=Species)) +   
  geom_point()  

#View plot 
myPlot

Introducing ggplot2

Activity: My first plot, part 2

  • Use the as_tibble function to convert the faithful dataset from dataframe to a tibble

  • Using ggplot2, plot the faithful dataset as a scatterplot, with waiting time as the x variable