Data wrangling includes operations aimed at manipulating data into a clearer or more useful form. This may be for restructuring data into forms that are usable as arguments in analytical functions, transforming values to facilitate visualizations, creating polished data for use by the public, or even just to clarify the look and structure of the data for your own benefit. The kinds of operations that might be undertaken depend on what the data looks like in its raw (received) state. However, there are some common operations that have been built into the packages that make up the tidyverse, and we’ll explore some here.
Introducing dplyr
The dplyr (dee-ply-er) package contains a set of tools designed principally for activities associated with data wrangling. These tools follow a sort of grammar, whereby the name of the function relates directly to what it does to the data (e.g., filter “filters”, select “selects”, etc.). The function name is followed by arguments that follow a general structure:
The first argument in a dyplr function is always a table (e.g., tibble, data frame)
Further arguments describe operations on columns in that table (with column names that are expressed without $ or quotes)
The result is always a table
There’s a few reasons why you might want to use these functions. The first is code clarity: most of the operations we’ve done so far use operators specific to Base R (e.g., $), whereas these have more intuitive names and follow the same patterns across their usage (helping with code readability). Another is that the dplyr tools have been designed specifically with data wrangling in mind, and so contain a number of options and integration that allow for smoother transitions between data states and, ultimately, fewer lines of code.
To illustrate how these work, we’ll start by going through some of the functions used to create subsets of data.
Subsetting
Oftentimes, the data we receive may have more variables, observations, or values than we would like. Let’s look at the tree inventory dataset:
Here, we have over 14,000 observations across 41 different variables. There are both practical and analytic reasons you might want to subset this dataset. For example, if you have to regularly refer to a few columns, it may get frustrating to have to keep scrolling across them or calling for a list of column names. Or maybe you are being hired to work on a question about trees in a specific city, and the values from the others are superfluous to your needs. Let’s look at a few functions that can help us address these by subsetting the data into smaller packages.
Filter
The filter function subsets the data according to a given set of criteria. So, carrying on from our last example, let’s say you were doing an analysis of street trees in Modesto, California:
Quite clear, yes? The first argument is a table (the treeData tibble), and the second is criteria referring to the City column. The result is another table, but this one only containing observations from Modesto.
By itself, this works precisely the same as selecting using the $ operator and square brackets:
So why bother with filter? Why not just stick with square brackets? This function enables us to make more complex subsets with less code, and the code that it produces is ultimately more readable. We can get a feel for this by asking for additional information. For example, let’s say we wanted to look at data from Modesto and Sacramento. With filter, it would look something like this:
This doesn’t seem so different; however, when you’re working with square brackets, you need to remember to include the comma at the end of the statement to indicate that you are subsetting row-wise. Woe be it to those who forget that precious comma.
OK, let’s try something a little more complicated: what if we wanted to filter to all the maple trees in Modesto? Here, we can combine our dplyr code with a helper function from elsewhere in the tidyverse: the str_detect function from the stringr package. This will detect whether a text string we provide matches anywhere in another text string.
Compare those last two examples in terms of their readability. The code in the tidyverse version more or less follows a pattern consistent with what you are asking it to do (‘filter the data to instances where the city is Modesto and the common name includes “maple”’). The Base R version is a little less straightforward. It uses redundant references to the name of the dataset (treeData), multiple types of brackets, commas separating both arguments and table dimensions, and sometimes reverses the pattern of inputs (data, argument vs argument, data).
Try it yourself!
Try writing the following Base R sub-setting routines using the filter function:
It should be made clear, though, that anything you want to do in tidyverse can be done in Base R, and if you ultimately decide you are more comfortable with the Base R syntax then there is no harm using it. However, as you work on more complex analyses, and work with messier data, the methods used in Base R will be more and more dependent on coding conventions that were developed without a data science workflow in mind. This creates opportunities for errors to creep in, such as misplaced commas and brackets, incorrect orders of arguments, etc. The tidyverse grammar offers more straightforward path through the data science. We’ll illustrate these concepts with a few additional functions below.
Select
The select function allows us to select particular columns to include in a smaller dataset. For example, let’s say that we are interested only in some of the measurements on the trees and less concerned with their location. We can use select by first giving the data table, followed by the names of the columns:
Note that here we are using the backticks to indicate non-conforming column names (i.e., column names with whitespace or other meaningful symbols in R). Again, we could do this with square brackets by giving it the column indices:
Since this kind of subsetting requires us to give column names as character values, we don’t need the backticks. But here our argument is a vector of column names inside of square brackets.
The power of select lies in its ability to take additional arguments and interface with helper functions. For example, we can use the starts_with function to request only the columns containing DBH values:
The starts_with function actually comes from the tidyselect package, which is also loaded as part of tidyverse. We can also add additional arguments to this call to refine it. For example, we can use ignore.case to not include the first column with DBH (the overall DBH measurement) where it is in all capitals and only keep the individual stem measurements that use lower case:
Finally, when selecting columns, you also have an opportunity to rename them as you’d prefer, assigning them by entering the new name in quotes (e.g., “LeafArea” and the :
The slice function is actually one of a few different slice-based functions that serve different purposes. For example, the slice_sample allows us to take a random subset of rows from our data:
Or we can get the top fifty trees by height using slice_max:
slice_max(treeData,`TreeHt (m)`,n=50)
# A tibble: 55 × 41
DbaseID Region City Source TreeID Zone `Park/Street` SpCode ScientificName
<dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 2665 NoCalC Berke… JBKMa… 29784 1 Park SESE Sequoia sempe…
2 13635 PacfNW Longv… Longv… 8919 -1 Park POTR2 Populus balsa…
3 2617 NoCalC Berke… JBKMa… 28535 2 Park EUGL Eucalyptus gl…
4 2636 NoCalC Berke… JBKMa… 28918 2 Park EUGL Eucalyptus gl…
5 11741 Piedmt Charl… CLTMa… 43077 G6 Street QURU Quercus rubra
6 2614 NoCalC Berke… JBKMa… 28524 2 Park EUGL Eucalyptus gl…
7 2649 NoCalC Berke… JBKMa… 29288 2 Park SESE Sequoia sempe…
8 2840 NoCalC Berke… JBKMa… 70598 1 Park EUGL Eucalyptus gl…
9 12252 Piedmt Charl… CLTMa… 125399 F6 Street QUAL Quercus alba
10 13658 PacfNW Longv… Longv… 9192 -1 Park POTR2 Populus balsa…
# ℹ 45 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
# street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
# ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
# CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
# `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
# Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …
Distinct
A last function worth mentioning for subsetting is the distinct function. As you might imagine, this pulls all unique values from a given column in a dataset.
distinct(treeData,City)
# A tibble: 17 × 1
City
<chr>
1 Modesto, CA
2 Santa Monica, CA
3 Claremont, CA
4 Berkeley, CA
5 Glendale, AZ
6 Fort Collins, CO
7 Minneapolis, MN
8 Indianapolis, IN
9 Queens, NY
10 Boise, ID
11 Albuquerque, NM
12 Honolulu, HI
13 Charleston, SC
14 Charlotte, NC
15 Orlando, FL
16 Longview, WA
17 Sacramento, CA
This can be used with multiple columns to get the unique combinations of variables present in the data:
distinct(treeData,City,CommonName)
# A tibble: 357 × 2
City CommonName
<chr> <chr>
1 Modesto, CA Silver maple
2 Modesto, CA European white birch
3 Modesto, CA Chinese hackberry
4 Modesto, CA Camphor tree
5 Modesto, CA Raywood ash
6 Modesto, CA Hesse ash
7 Modesto, CA Moraine ash
8 Modesto, CA Marshall ash
9 Modesto, CA Modesto ash
10 Modesto, CA Ginkgo
# ℹ 347 more rows
Try it yourself!
You can quickly get a complex subset by combining some of these functions. Using the tree data and the functions above, create the following datasets:
The database ID and all columns measured in meters for all trees in the Quercus genus
The database ID, Common Name, and DBH of the top 200 trees by leaf area in California
The database ID, scientific name, street name, and crown height for trees on located on an “Elm Street”
Stretch activity
How might you visualize some the datasets you just created? Add some visualizations of these datasets to your Quarto document.