Create a project and file system for this week’s lecture
Download the middlesexEColi.csv dataset from Canvas to the appropriate place in your file system and read it into R as a tibble
Open a script, and using ggplot2, create a histogram of the results of these tests. The horizontal (x) axis is colony forming units per 100mL sample (cfu/100mL)
If there’s time, try adding the geom_hline
function with the argument xintercept=235
to draw a vertical line at 235, and color it red.
The Mystic River Water Quality Buoy!
A framework (per Wilkinson et al. 2005; Wickham 2010) used to describe the components of a data visualization in terms of a set of layered objects
Component | Description | Example |
---|---|---|
Statistics | Statistical transformations or summaries of data | mean, log transformation, smoothing spline |
Facets | Divisions in data used for multi-plotting | side-by-side plot, 2 x 2 plot |
Coordinates | Space used for plotting values | Cartesian 2D space, polar coordinate space |
Themes | Non-data ink | Font size, shading of background grid, location of tick marks |
Visualization helps us to identify patterns and structures in data that are not evident from tables or numerical summaries.
Marine Reservoir Correction data from calib.org
Mean x: 9
Mean y: 7.5
Pearson correlation coefficient (r): 0.816
Coefficient of determination (R2): 0.67
Also see: datasaurus package
Choose the right chart for the data
Maximize the data-to-ink ratio
Make deliberate design decisions
The type of visualization that should be used depends on the kind of information being conveyed.
Data-ink (per Tufte 19831) refers to ink (or pixels) that, if erased, would reduce the information being presented.
Ideally, the ratio between data and total ink should be close to 1.
The most egregious use of non-data ink is often referred to as chartjunk.
How does this choice help someone understand the data?
Contrast
Clarity
Highlighting
Messaging
While each graphical element can be modified individually, themes provide a way to modify the overall look of the “non-data ink”
Some design choices affect some audiences more than others
Color palette
Text and symbol sizes
Visualization is foremost about making data more understandable
Guidelines like maximizing data-ink and being deliberate about design help us make decisions that will facilitate this goal
The grammar of graphics helps us to make these decisions in an explicit way by connecting elements
The ggplot2 package provides a way for us to put that grammar to work inside of the data environment we’re creating in R
Visualization critiques begin next week
Coding assignment #1 is due on Thursday
Introducing data analysis
Finding a statistic for assessing your data
Visualizing stats