Data wrangling includes operations aimed at manipulating data into a clearer or more useful form. This may be for restructuring data into forms that are usable as arguments in analytical functions, transforming values to facilitate visualizations, creating polished data for use by the public, or even just to clarify the look and structure of the data for your own benefit. The kinds of operations that might be undertaken depend on what the data looks like in its raw (received) state. However, there are some common operations that have been built into the packages that make up the tidyverse, and we’ll explore some here.
1.0.1 Introducing dplyr
The dplyr (dee-ply-er) package contains a set of tools designed principally for activities associated with data wrangling. These tools follow a sort of grammar, whereby the name of the function relates directly to what it does to the data (e.g., filter “filters”, select “selects”, etc.). This is followed by arguments that follow a general structure:
The first argument in a dyplr function is always a table (e.g., tibble, data frame)
Additional arguments describe operations on columns in that table (with column names expressed without $ or quotes)
The result is always a table
There’s a few reasons why you might want to use these functions. The first is code clarity: most of the operations we’ve done so far use operators specific to Base R (e.g., $), whereas these have more intuitive names and follow the same patterns across their usage. Another is that the dplyr tools have been designed specifically with data wrangling in mind, and so contain a number of options and integration that allow for smoother transitions between data states and, ultimately, fewer lines of code.
To illustrate how these work, we’ll start by going through some of the functions used to create subsets of data.
1.0.2 Subsetting
Oftentimes, the data we receive may have more variables, observations, or values than we would like. Let’s look at the tree inventory dataset:
Here, we have over 14,000 observations across 41 different variables. There are both practical and analytical reasons you might want to subset this dataset. For example, if you have to regularly refer to a few columns, it may get frustrating to have to keep scrolling across them or calling for a list of column names. Or maybe you are being hired to work on a question about trees in a specific city, and the values from the others are superfluous to your needs. Let’s look at a few functions that can help us address these by subsetting the data into smaller packages.
1.0.3 Filter
The filter function subsets the data according to a given set of criteria. So, carrying on from our last example, let’s say you were doing an analysis of street trees in Modesto, California:
Quite clear, yes? The first argument is a table (the treeData tibble), and the second is criteria referring to the City column. The result is another table, but this one only containing observations from Modesto.
By itself, this works precisely the same as selecting using the $ operator and square brackets:
So why bother with filter? This function enables us to make more complex subsets with less code, and the code that it produces is ultimately more readable. We can get a feel for this by asking for additional information. For example, let’s say we wanted to look at data from Modesto and Sacramento. With filter, it would look something like this:
OK, let’s try something a little more complicated: what if we wanted to filter to all the maple trees in Modesto? Here, we can combine our dplyr code with a helper function from elsewhere in the tidyverse: the str_detect function from the stringr package. This will detect whether a text string we provide matches anywhere in another text string.
Compare those last two examples in terms of their readability. The code in the tidyverse version more or less follows a pattern consistent with what you are asking it to do (‘filter the data to instances where the city is Modesto and the common name includes “maple”’). The Base R version is a little less straightforward. It uses redundant references to the name of the dataset (treeData), multiple types of brackets, commas separating both arguments and table dimensions, and sometimes reverses the pattern of inputs (data, argument vs argument, data).
Try it yourself!
Try writing the following Base R sub-setting routines using the filter function:
It should be made clear, though, that anything you want to do in tidyverse can be done in Base R, and if you ultimately decide you are more comfortable with that syntax then there is no harm using it. However, as you work on more complex analyeses, and work with messier data, the methods used in Base R will be more and more dependent on coding conventions that were developed without a data science workflow in mind. This creates opportunities for errors to creep in, such as misplaced commas and brackets, incorrect orders of arguments, etc. The tidyverse grammar offers more straightforward path through the data science. We’ll illustrate these concepts with a few additional functions below.
1.0.4 Select
The select function allows us to select particular columns to include in a smaller dataset. For example, let’s say that we are interested only in some of the measurements on the trees and less concerned with their location. We can use select by first giving the data table, followed by the names of the columns:
Note that here we are using the backticks to indicate non-conforming column names (i.e., column names with whitespace or other meaningful symbols in R). Again, we could do this with square brackets by giving it the column indices:
Since this kind of subsetting requires us to give column names as character values, we don’t need the backticks. But here our argument is a vector of column names inside of square brackets.
The power of select lies in its ability to take additional arguments and interface with helper functions. For example, we can use the starts_with function to request only the columns containing DBH values:
We can add additional arguments to this call. For example, we can use ignore.case to not include the first column with DBH (the overall DBH measurement) where it is in all capitals and only keep the individual stem measurements that use lower case:
The slice function is actually one of a few different slice-based functions that serve different purposes. For example, the slice_sample allows us to take a random subset of rows from our data:
slice_sample(treeData,n=100)
# A tibble: 100 × 41
DbaseID Region City Source TreeID Zone `Park/Street` SpCode ScientificName
<dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 11298 GulfCo Charl… CHSMa… 28835 Q10 Street CELA Celtis laevig…
2 3205 SWDsrt Glend… GDLMa… 1703 -1 Park MOAL Morus alba
3 7846 NoEast Queen… JFKMa… 39761 409 Street ULAM Ulmus america…
4 8311 TpIntW Boise… BOIMa… 15538 5A Street CASP Catalpa speci…
5 1555 InlEmp Clare… CLMma… 15728 4 -1 GIBI Ginkgo biloba
6 17041 SacVal Sacra… SMFMa… 31003 East… Street CEDE Cedrus deodara
7 8820 InterW Albuq… ABQMa… 463 Flig… Park PRCE Prunus cerasi…
8 11745 Piedmt Charl… CLTMa… 43273 G6 Street LA6 Lagerstroemia…
9 17261 SacVal Sacra… SMFMa… 80811 Cent… Street GIBI Ginkgo biloba
10 13761 PacfNW Longv… Longv… 10401 -1 Street ACSA2 Acer saccharum
# ℹ 90 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
# street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
# ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
# CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
# `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
# Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …
Or we can get the top fifty trees by height using slice_max:
slice_max(treeData,`TreeHt (m)`,n=50)
# A tibble: 55 × 41
DbaseID Region City Source TreeID Zone `Park/Street` SpCode ScientificName
<dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 2665 NoCalC Berke… JBKMa… 29784 1 Park SESE Sequoia sempe…
2 13635 PacfNW Longv… Longv… 8919 -1 Park POTR2 Populus balsa…
3 2617 NoCalC Berke… JBKMa… 28535 2 Park EUGL Eucalyptus gl…
4 2636 NoCalC Berke… JBKMa… 28918 2 Park EUGL Eucalyptus gl…
5 11741 Piedmt Charl… CLTMa… 43077 G6 Street QURU Quercus rubra
6 2614 NoCalC Berke… JBKMa… 28524 2 Park EUGL Eucalyptus gl…
7 2649 NoCalC Berke… JBKMa… 29288 2 Park SESE Sequoia sempe…
8 2840 NoCalC Berke… JBKMa… 70598 1 Park EUGL Eucalyptus gl…
9 12252 Piedmt Charl… CLTMa… 125399 F6 Street QUAL Quercus alba
10 13658 PacfNW Longv… Longv… 9192 -1 Park POTR2 Populus balsa…
# ℹ 45 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
# street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
# ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
# CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
# `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
# Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …
1.0.6 Distinct
A last function worth mentioning for subsetting is the distinct function. As you might imagine, this pulls all unique values from a given column in a dataset.
distinct(treeData,City)
# A tibble: 17 × 1
City
<chr>
1 Modesto, CA
2 Santa Monica, CA
3 Claremont, CA
4 Berkeley, CA
5 Glendale, AZ
6 Fort Collins, CO
7 Minneapolis, MN
8 Indianapolis, IN
9 Queens, NY
10 Boise, ID
11 Albuquerque, NM
12 Honolulu, HI
13 Charleston, SC
14 Charlotte, NC
15 Orlando, FL
16 Longview, WA
17 Sacramento, CA
This can be used with multiple columns to get the unique combinations of variables present in the data:
distinct(treeData,City,CommonName)
# A tibble: 357 × 2
City CommonName
<chr> <chr>
1 Modesto, CA Silver maple
2 Modesto, CA European white birch
3 Modesto, CA Chinese hackberry
4 Modesto, CA Camphor tree
5 Modesto, CA Raywood ash
6 Modesto, CA Hesse ash
7 Modesto, CA Moraine ash
8 Modesto, CA Marshall ash
9 Modesto, CA Modesto ash
10 Modesto, CA Ginkgo
# ℹ 347 more rows
Try it yourself!
You can quickly get a complex subset by combining some of these functions. Using the tree data and the functions above, create the following datasets:
The database ID and all columns measured in meters for all trees in the Quercus genus
The database ID, Common Name, and DBH of the top 200 trees by leaf area in California
The database ID, scientific name, street name, and crown height for trees on located on an “Elm Street”
Stretch activity
How might you visualize some the datasets you just created? Add some visualizations of these datasets to your Quarto document.