Environmental Data Analysis and Visualization

Wrangling in the Tidyverse

A brief reminder

This course deals with the representation of data as a means to communicating about important issues. Our ability to consider a wide and topical range of data representations depends on respect for each other and a commitment to our shared motivation to learn.

Visualization Critique

nytimes.com

Visualization Critique

chartcipher.com

Visualization Critique

visualcapitalist.com

Visualization Critique

visualizingpalestine.org

Next week’s critiques

Mark
Jenny
Mia

Data

Data are individual pieces of information.

Datasets

Data in the field

Messy data

The process of data entry is often designed and conducted with limited consideration given to the process of data analysis.

https://openscapes.org/blog/2020-10-12-tidy-data/

Tidy data principles

Each variable forms a column

Each observation forms a row

Each cell contains a single value

After Wickham, H. 2014. Tidy data. The Journal of Statistical Software 59.

Vectorization

Querying

Consistency

Ease of identification and re-use

Data wrangling: the big three

Column headers are values, not variable names.

Data wrangling: the big three

Multiple variables are stored in a single column.

Data wrangling: the big three

Variables stored in both rows and columns.

Pipes in R

The pipe operator %>% allows us to combine multiple functions into an ordered set of transformations

Pipes in R

nycflights

# A tibble: 32,735 × 16
    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>  
 1  2013     6    30      940        15     1216        -4 VX      N626VA 
 2  2013     5     7     1657        -3     2104        10 DL      N3760C 
 3  2013    12     8      859        -1     1238        11 DL      N712TW 
 4  2013     5    14     1841        -4     2122       -34 DL      N914DL 
 5  2013     7    21     1102        -3     1230        -8 9E      N823AY 
 6  2013     1     1     1817        -3     2008         3 AA      N3AXAA 
 7  2013    12     9     1259        14     1617        22 WN      N218WN 
 8  2013     8    13     1920        85     2032        71 B6      N284JB 
 9  2013     9    26      725       -10     1027        -8 AA      N3FSAA 
10  2013     4    30     1323        62     1549        60 EV      N12163 
# ℹ 32,725 more rows
# ℹ 7 more variables: flight <int>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>

Pipes in R

nycflights %>%
  filter(month==1)

# A tibble: 2,610 × 16
    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>  
 1  2013     1     1     1817        -3     2008         3 AA      N3AXAA 
 2  2013     1    23     2024        37     2141        29 EV      N17115 
 3  2013     1    15     1626        -3     1941        10 B6      N594JB 
 4  2013     1    17      626        -4      846         3 US      N554UW 
 5  2013     1     8      902        -3     1006       -17 B6      N281JB 
 6  2013     1    15     1947       167     2241       171 AA      N5EGAA 
 7  2013     1     1     1454        -4     1554       -21 EV      N11544 
 8  2013     1    30     1306        -9     1430        -1 EV      N13969 
 9  2013     1     4     1942        -3     2249       -40 B6      N637JB 
10  2013     1     8     1859        -6     2158       -27 AA      N322AA 
# ℹ 2,600 more rows
# ℹ 7 more variables: flight <int>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>

Data pipelines

Pipelines start with raw data (e.g., tibble) as the input
This table is passed as the first argument to the next function (so they don’t need the data argument)
The result of a pipeline is a dataset that can be fed into visualizations and analysis.

Data pipelines

nycflights %>%
  mutate(speed = distance / air_time * 60) %>%
  select(year:day,dep_time,carrier,flight,speed) %>%
  filter(month==1) %>%
  unite(col="Date",c(month,day,year),sep="/")

# A tibble: 2,610 × 5
   Date      dep_time carrier flight speed
   <chr>        <int> <chr>    <int> <dbl>
 1 1/1/2013      1817 AA         353  319.
 2 1/23/2013     2024 EV        4412  319.
 3 1/15/2013     1626 B6         369  414 
 4 1/17/2013      626 US        1433  311.
 5 1/8/2013       902 B6          56  340.
 6 1/15/2013     1947 AA         575  396.
 7 1/1/2013      1454 EV        4390  363.
 8 1/30/2013     1306 EV        4120  277.
 9 1/4/2013      1942 B6         645  466.
10 1/8/2013      1859 AA          21  441.
# ℹ 2,600 more rows

Activity: Going with the directional movement of a fluid

Working in pairs or small groups, use the different tidyverse functions to create a pipeline to wrangle raw data into the desired form

Datasets will come from Canvas and the modeldata package

Use cheatsheets on Canvas for function descriptions

Example 1

Body mass and flipper length for penguins from Biscoe and Dream islands with a body mass over 400 grams.

Example 2

Price, number of baths, and square meters for three-bedroom homes in the Roseville, Orangevale, and Citrus Heights neighborhoods of Sacramento.

Example 3

ID Number, Diameter in mm, and number of rings on top 100 abalone by ring count.

Example 3

rawData<-read_csv("data/abalone.csv")

rawData %>%
  mutate(diameter_mm=diameter * 200) %>%
  select(...1,diameter_mm,rings) %>%
  slice_max(order_by=rings,n=100)

# A tibble: 136 × 3
    ...1 diameter_mm rings
   <dbl>       <dbl> <dbl>
 1   481         117    29
 2  2109         107    27
 3  2210          93    27
 4   295          99    26
 5  2202          98    25
 6  3150         108    24
 7  3281         108    24
 8   314          94    23
 9   315          97    23
10   502         104    23
# ℹ 126 more rows

Can I use pipes with ggplot2?

YES!

rawData %>%
  mutate(diameter_mm=diameter * 200) %>%
  select(...1,diameter_mm,rings) %>%
  slice_max(order_by=rings,n=100) %>%
  ggplot(aes(x=diameter_mm,y=rings)) + geom_point() + labs(x="Diameter (mm)",y="Rings")

Can I use pipes with Base R?

YES!

rawData %>%
  mutate(wholeWeight=weight.whole * 200) %>%
  mutate(shuckedWeight=weight.shucked * 200) %>%
  select(wholeWeight,shuckedWeight) %>%
  colSums()

  wholeWeight shuckedWeight 
     692331.2      300215.6

Can I use pipes with Base R?

YES!*

rawData %>%   
  mutate(wholeWeight=weight.whole * 200) %>%   
  mutate(shuckedWeight=weight.shucked * 200) %>%   
  select(wholeWeight,shuckedWeight) %>%   
  sum(wholeWeight)

Error: object 'wholeWeight' not found

Can I use pipes with Base R?

YES!*

rawData %>%
  mutate(wholeWeight=weight.whole * 200) %>%
  mutate(shuckedWeight=weight.shucked * 200) %>%
  select(wholeWeight,shuckedWeight) %>%
  {sum(.$wholeWeight)}

[1] 692331.2

*certain rules apply when first argument is not the first argument, or when referencing column names

Can I use pipes for modeling data?

YES!

rawData %>%
  mutate(diameter_mm=diameter * 200) %>%
  select(...1,diameter_mm,rings) %>%
  lm(rings ~ diameter_mm, data=.) %>%
  summary()


Call:
lm(formula = rings ~ diameter_mm, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1868 -1.6932 -0.7200  0.9066 15.9999 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.318574   0.172737   13.42   <2e-16 ***
diameter_mm 0.093350   0.002057   45.37   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.639 on 4175 degrees of freedom
Multiple R-squared:  0.3302,    Adjusted R-squared:  0.3301 
F-statistic:  2059 on 1 and 4175 DF,  p-value: < 2.2e-16

Coursekeeping

Coding exercise 2 is now available on Canvas, due November 6!

the-decoder.com

Next week

Visualizing more relationships
Fine controls with ggplot2
Time is weird