Create a project and file system for this week’s lecture
Download the middlesexEColi.csv dataset from Canvas to the appropriate place in your file system and read it into R
Create a histogram of the results of these tests. The horizontal (x) axis is colony forming units per 100mL sample (cfu/100mL)
After you’ve plotted the histogram, you can use the abline
function with the argument v=235
to draw a vertical line at 235, and color it red.
The Mystic River Water Quality Buoy!
Visualization helps us to identify patterns and structures in data that are not evident from tables or numerical summaries.
Marine Reservoir Correction data from calib.org
Mean x: 9
Mean y: 7.5
Pearson correlation coefficient (r): 0.816
Coefficient of determination (R2): 0.67
Also see: datasaurus package
Choose the right chart for the data
Maximize the data-to-ink ratio
Make deliberate design decisions
The type of visualization that should be used depends on the kind of information being conveyed.
Data-ink (per Tufte 19831) refers to ink (or pixels) that, if erased, would reduce the information being presented.
Ideally, the ratio between data and total ink should be close to 1.
The most egregious use of non-data ink is often referred to as chartjunk.
How does this choice help someone understand the data?
Contrast
Clarity
Highlighting
Messaging
Some design choices affect some audiences more than others
Color palette
Text and symbol sizes
A framework (per Wilkinson et al. 2005; Wickham 2010) used to describe the components of a data visualization in terms of a set of layered objects
Component | Description | Example |
---|---|---|
Statistics | Statistical transformations or summaries of data | mean, log transformation, smoothing spline |
Facets | Divisions in data used for multi-plotting | side-by-side plot, 2 x 2 plot |
Coordinates | Space used for plotting values | Cartesian 2D space, polar coordinate space |
Themes | Non-data ink | Font size, shading of background grid, location of tick marks |
Use the as_tibble
function to convert the faithful
dataset from dataframe to a tibble
Using ggplot2, plot the faithful
dataset as a scatterplot, with waiting time as the x variable