1  Getting started with wrangling

Data wrangling includes operations aimed at manipulating data into a clearer or more useful form. This may be for restructuring data into forms that are usable as arguments in analytical functions, transforming values to facilitate visualizations, creating polished data for use by the public, or even just to clarify the look and structure of the data for your own benefit. The kinds of operations that might be undertaken depend on what the data looks like in its raw (received) state. However, there are some common operations that have been built into the packages that make up the tidyverse, and we’ll explore some here.

Introducing dplyr

The dplyr (dee-ply-er) package contains a set of tools designed principally for activities associated with data wrangling. These tools follow a sort of grammar, whereby the name of the function relates directly to what it does to the data (e.g., filter “filters”, select “selects”, etc.). The function name is followed by arguments that follow a general structure:

  • The first argument in a dyplr function is always a table (e.g., tibble, data frame)

  • Further arguments describe operations on columns in that table (with column names that are expressed without $ or quotes)

  • The result is always a table

There’s a few reasons why you might want to use these functions. The first is code clarity: most of the operations we’ve done so far use operators specific to Base R (e.g., $), whereas these have more intuitive names and follow the same patterns across their usage (helping with code readability). Another is that the dplyr tools have been designed specifically with data wrangling in mind, and so contain a number of options and integration that allow for smoother transitions between data states and, ultimately, fewer lines of code.

To illustrate how these work, we’ll start by going through some of the functions used to create subsets of data.

Subsetting

Oftentimes, the data we receive may have more variables, observations, or values than we would like. Let’s look at the tree inventory dataset:

treeData<-read_csv("data/TS3_Raw_tree_data.csv")
treeData
# A tibble: 14,487 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2       2 InlVal Modes… Motow…      2 Nurs… Nursery       BEPE   Betula pendula
 3       3 InlVal Modes… Motow…      3 Nurs… Nursery       CESI4  Celtis sinens…
 4       4 InlVal Modes… Motow…      4 Nurs… Nursery       CICA   Cinnamomum ca…
 5       5 InlVal Modes… Motow…      5 Nurs… Nursery       FRAN_R Fraxinus angu…
 6       6 InlVal Modes… Motow…      6 Nurs… Nursery       FREX_H Fraxinus exce…
 7       7 InlVal Modes… Motow…      7 Nurs… Nursery       FRHO   Fraxinus holo…
 8       8 InlVal Modes… Motow…      8 Nurs… Nursery       FRPE_M Fraxinus penn…
 9       9 InlVal Modes… Motow…      9 Nurs… Nursery       FRVE_G Fraxinus velu…
10      10 InlVal Modes… Motow…     10 Nurs… Nursery       GIBI   Ginkgo biloba 
# ℹ 14,477 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Here, we have over 14,000 observations across 41 different variables. There are both practical and analytic reasons you might want to subset this dataset. For example, if you have to regularly refer to a few columns, it may get frustrating to have to keep scrolling across them or calling for a list of column names. Or maybe you are being hired to work on a question about trees in a specific city, and the values from the others are superfluous to your needs. Let’s look at a few functions that can help us address these by subsetting the data into smaller packages.

Filter

The filter function subsets the data according to a given set of criteria. So, carrying on from our last example, let’s say you were doing an analysis of street trees in Modesto, California:

filter(treeData,City=="Modesto, CA")
# A tibble: 634 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2       2 InlVal Modes… Motow…      2 Nurs… Nursery       BEPE   Betula pendula
 3       3 InlVal Modes… Motow…      3 Nurs… Nursery       CESI4  Celtis sinens…
 4       4 InlVal Modes… Motow…      4 Nurs… Nursery       CICA   Cinnamomum ca…
 5       5 InlVal Modes… Motow…      5 Nurs… Nursery       FRAN_R Fraxinus angu…
 6       6 InlVal Modes… Motow…      6 Nurs… Nursery       FREX_H Fraxinus exce…
 7       7 InlVal Modes… Motow…      7 Nurs… Nursery       FRHO   Fraxinus holo…
 8       8 InlVal Modes… Motow…      8 Nurs… Nursery       FRPE_M Fraxinus penn…
 9       9 InlVal Modes… Motow…      9 Nurs… Nursery       FRVE_G Fraxinus velu…
10      10 InlVal Modes… Motow…     10 Nurs… Nursery       GIBI   Ginkgo biloba 
# ℹ 624 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Quite clear, yes? The first argument is a table (the treeData tibble), and the second is criteria referring to the City column. The result is another table, but this one only containing observations from Modesto.

By itself, this works precisely the same as selecting using the $ operator and square brackets:

treeData[treeData$City=="Modesto, CA",]
# A tibble: 634 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2       2 InlVal Modes… Motow…      2 Nurs… Nursery       BEPE   Betula pendula
 3       3 InlVal Modes… Motow…      3 Nurs… Nursery       CESI4  Celtis sinens…
 4       4 InlVal Modes… Motow…      4 Nurs… Nursery       CICA   Cinnamomum ca…
 5       5 InlVal Modes… Motow…      5 Nurs… Nursery       FRAN_R Fraxinus angu…
 6       6 InlVal Modes… Motow…      6 Nurs… Nursery       FREX_H Fraxinus exce…
 7       7 InlVal Modes… Motow…      7 Nurs… Nursery       FRHO   Fraxinus holo…
 8       8 InlVal Modes… Motow…      8 Nurs… Nursery       FRPE_M Fraxinus penn…
 9       9 InlVal Modes… Motow…      9 Nurs… Nursery       FRVE_G Fraxinus velu…
10      10 InlVal Modes… Motow…     10 Nurs… Nursery       GIBI   Ginkgo biloba 
# ℹ 624 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

So why bother with filter? Why not just stick with square brackets? This function enables us to make more complex subsets with less code, and the code that it produces is ultimately more readable. We can get a feel for this by asking for additional information. For example, let’s say we wanted to look at data from Modesto and Sacramento. With filter, it would look something like this:

filter(treeData,City %in% c("Modesto, CA","Sacramento, CA"))
# A tibble: 1,635 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2       2 InlVal Modes… Motow…      2 Nurs… Nursery       BEPE   Betula pendula
 3       3 InlVal Modes… Motow…      3 Nurs… Nursery       CESI4  Celtis sinens…
 4       4 InlVal Modes… Motow…      4 Nurs… Nursery       CICA   Cinnamomum ca…
 5       5 InlVal Modes… Motow…      5 Nurs… Nursery       FRAN_R Fraxinus angu…
 6       6 InlVal Modes… Motow…      6 Nurs… Nursery       FREX_H Fraxinus exce…
 7       7 InlVal Modes… Motow…      7 Nurs… Nursery       FRHO   Fraxinus holo…
 8       8 InlVal Modes… Motow…      8 Nurs… Nursery       FRPE_M Fraxinus penn…
 9       9 InlVal Modes… Motow…      9 Nurs… Nursery       FRVE_G Fraxinus velu…
10      10 InlVal Modes… Motow…     10 Nurs… Nursery       GIBI   Ginkgo biloba 
# ℹ 1,625 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

We’ve learned how to do this similarly in Base R:

treeData[treeData$City %in% c("Modesto, CA", "Sacramento, CA"),]
# A tibble: 1,635 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2       2 InlVal Modes… Motow…      2 Nurs… Nursery       BEPE   Betula pendula
 3       3 InlVal Modes… Motow…      3 Nurs… Nursery       CESI4  Celtis sinens…
 4       4 InlVal Modes… Motow…      4 Nurs… Nursery       CICA   Cinnamomum ca…
 5       5 InlVal Modes… Motow…      5 Nurs… Nursery       FRAN_R Fraxinus angu…
 6       6 InlVal Modes… Motow…      6 Nurs… Nursery       FREX_H Fraxinus exce…
 7       7 InlVal Modes… Motow…      7 Nurs… Nursery       FRHO   Fraxinus holo…
 8       8 InlVal Modes… Motow…      8 Nurs… Nursery       FRPE_M Fraxinus penn…
 9       9 InlVal Modes… Motow…      9 Nurs… Nursery       FRVE_G Fraxinus velu…
10      10 InlVal Modes… Motow…     10 Nurs… Nursery       GIBI   Ginkgo biloba 
# ℹ 1,625 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

This doesn’t seem so different; however, when you’re working with square brackets, you need to remember to include the comma at the end of the statement to indicate that you are subsetting row-wise. Woe be it to those who forget that precious comma.

OK, let’s try something a little more complicated: what if we wanted to filter to all the maple trees in Modesto? Here, we can combine our dplyr code with a helper function from elsewhere in the tidyverse: the str_detect function from the stringr package. This will detect whether a text string we provide matches anywhere in another text string.

modestoMaples<-filter(treeData,City=="Modesto, CA" & str_detect(CommonName,"maple"))
modestoMaples
# A tibble: 29 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2      43 InlVal Modes… Motow…   2961 -1    -1            ACSA1  Acer sacchari…
 3      96 InlVal Modes… Motow…   9733 -1    -1            ACSA1  Acer sacchari…
 4     306 InlVal Modes… Motow…  31877 -1    -1            ACSA1  Acer sacchari…
 5     309 InlVal Modes… Motow…  32147 -1    -1            ACSA1  Acer sacchari…
 6     310 InlVal Modes… Motow…  32193 -1    -1            ACSA1  Acer sacchari…
 7     314 InlVal Modes… Motow…  32333 -1    -1            ACSA1  Acer sacchari…
 8     318 InlVal Modes… Motow…  33269 -1    -1            ACSA1  Acer sacchari…
 9     324 InlVal Modes… Motow…  34001 -1    -1            ACSA1  Acer sacchari…
10     329 InlVal Modes… Motow…  35120 -1    -1            ACSA1  Acer sacchari…
# ℹ 19 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Doing this in Base R involves using the grepl function.

modestoMaples<-treeData[treeData$City=="Modesto, CA" & grepl("maple",treeData$CommonName),]
modestoMaples
# A tibble: 29 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2      43 InlVal Modes… Motow…   2961 -1    -1            ACSA1  Acer sacchari…
 3      96 InlVal Modes… Motow…   9733 -1    -1            ACSA1  Acer sacchari…
 4     306 InlVal Modes… Motow…  31877 -1    -1            ACSA1  Acer sacchari…
 5     309 InlVal Modes… Motow…  32147 -1    -1            ACSA1  Acer sacchari…
 6     310 InlVal Modes… Motow…  32193 -1    -1            ACSA1  Acer sacchari…
 7     314 InlVal Modes… Motow…  32333 -1    -1            ACSA1  Acer sacchari…
 8     318 InlVal Modes… Motow…  33269 -1    -1            ACSA1  Acer sacchari…
 9     324 InlVal Modes… Motow…  34001 -1    -1            ACSA1  Acer sacchari…
10     329 InlVal Modes… Motow…  35120 -1    -1            ACSA1  Acer sacchari…
# ℹ 19 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Compare those last two examples in terms of their readability. The code in the tidyverse version more or less follows a pattern consistent with what you are asking it to do (‘filter the data to instances where the city is Modesto and the common name includes “maple”’). The Base R version is a little less straightforward. It uses redundant references to the name of the dataset (treeData), multiple types of brackets, commas separating both arguments and table dimensions, and sometimes reverses the pattern of inputs (data, argument vs argument, data).

Try it yourself!

Try writing the following Base R sub-setting routines using the filter function:

treeData[treeData$CrnBase>5,]

treeData[treeData$`Park/Street`!="Park",]

treeData[treeData$ScientificName=="Pinus radiata" | treeData$ScientificName=="Sequoia sempervirens",]

It should be made clear, though, that anything you want to do in tidyverse can be done in Base R, and if you ultimately decide you are more comfortable with the Base R syntax then there is no harm using it. However, as you work on more complex analyses, and work with messier data, the methods used in Base R will be more and more dependent on coding conventions that were developed without a data science workflow in mind. This creates opportunities for errors to creep in, such as misplaced commas and brackets, incorrect orders of arguments, etc. The tidyverse grammar offers more straightforward path through the data science. We’ll illustrate these concepts with a few additional functions below.

Select

The select function allows us to select particular columns to include in a smaller dataset. For example, let’s say that we are interested only in some of the measurements on the trees and less concerned with their location. We can use select by first giving the data table, followed by the names of the columns:

treeMetrics<-select(treeData,DbaseID,ScientificName,`DBH (cm)`,`TreeHt (m)`,`Leaf (m2)`)
treeMetrics
# A tibble: 14,487 × 5
   DbaseID ScientificName                    `DBH (cm)` `TreeHt (m)` `Leaf (m2)`
     <dbl> <chr>                                  <dbl>        <dbl>       <dbl>
 1       1 Acer saccharinum                         2.5          2           2.5
 2       2 Betula pendula                           2.5          1.5         1.9
 3       3 Celtis sinensis                          2.5          1.8         2.2
 4       4 Cinnamomum camphora                      2.5          2           2  
 5       5 Fraxinus angustifolia 'Raywood'          2.5          2           2.2
 6       6 Fraxinus excelsior 'Hessei'              2.5          2           2.2
 7       7 Fraxinus holotricha                      2.5          2           2.2
 8       8 Fraxinus pennsylvanica 'Marshall'        2.5          2           2.2
 9       9 Fraxinus velutina 'Modesto'              2.5          2           2.1
10      10 Ginkgo biloba                            2.5          1.6         1.3
# ℹ 14,477 more rows

Note that here we are using the backticks to indicate non-conforming column names (i.e., column names with whitespace or other meaningful symbols in R). Again, we could do this with square brackets by giving it the column indices:

treeMetrics<-treeData[,c("DbaseID","ScientificName","DBH (cm)","TreeHt (m)","Leaf (m2)")]
treeMetrics
# A tibble: 14,487 × 5
   DbaseID ScientificName                    `DBH (cm)` `TreeHt (m)` `Leaf (m2)`
     <dbl> <chr>                                  <dbl>        <dbl>       <dbl>
 1       1 Acer saccharinum                         2.5          2           2.5
 2       2 Betula pendula                           2.5          1.5         1.9
 3       3 Celtis sinensis                          2.5          1.8         2.2
 4       4 Cinnamomum camphora                      2.5          2           2  
 5       5 Fraxinus angustifolia 'Raywood'          2.5          2           2.2
 6       6 Fraxinus excelsior 'Hessei'              2.5          2           2.2
 7       7 Fraxinus holotricha                      2.5          2           2.2
 8       8 Fraxinus pennsylvanica 'Marshall'        2.5          2           2.2
 9       9 Fraxinus velutina 'Modesto'              2.5          2           2.1
10      10 Ginkgo biloba                            2.5          1.6         1.3
# ℹ 14,477 more rows

Since this kind of subsetting requires us to give column names as character values, we don’t need the backticks. But here our argument is a vector of column names inside of square brackets.

The power of select lies in its ability to take additional arguments and interface with helper functions. For example, we can use the starts_with function to request only the columns containing DBH values:

allDBH<-select(treeData,DbaseID,starts_with("dbh"))
allDBH
# A tibble: 14,487 × 10
   DbaseID `DBH (cm)`  dbh1  dbh2  dbh3  dbh4  dbh5  dbh6  dbh7  dbh8
     <dbl>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1       1        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 2       2        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 3       3        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 4       4        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 5       5        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 6       6        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 7       7        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 8       8        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
 9       9        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
10      10        2.5   2.5    -1    -1    -1    -1    -1    -1    -1
# ℹ 14,477 more rows

The starts_with function actually comes from the tidyselect package, which is also loaded as part of tidyverse. We can also add additional arguments to this call to refine it. For example, we can use ignore.case to not include the first column with DBH (the overall DBH measurement) where it is in all capitals and only keep the individual stem measurements that use lower case:

allDBH<-select(treeData,DbaseID,starts_with("dbh",ignore.case = FALSE))
allDBH
# A tibble: 14,487 × 9
   DbaseID  dbh1  dbh2  dbh3  dbh4  dbh5  dbh6  dbh7  dbh8
     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1       1   2.5    -1    -1    -1    -1    -1    -1    -1
 2       2   2.5    -1    -1    -1    -1    -1    -1    -1
 3       3   2.5    -1    -1    -1    -1    -1    -1    -1
 4       4   2.5    -1    -1    -1    -1    -1    -1    -1
 5       5   2.5    -1    -1    -1    -1    -1    -1    -1
 6       6   2.5    -1    -1    -1    -1    -1    -1    -1
 7       7   2.5    -1    -1    -1    -1    -1    -1    -1
 8       8   2.5    -1    -1    -1    -1    -1    -1    -1
 9       9   2.5    -1    -1    -1    -1    -1    -1    -1
10      10   2.5    -1    -1    -1    -1    -1    -1    -1
# ℹ 14,477 more rows

There are several of these helper functions that work in this case:

#columns ending with the word "Name"
allNames<-select(treeData,ends_with("Name",ignore.case = TRUE))
allNames
# A tibble: 14,487 × 2
   ScientificName                    CommonName          
   <chr>                             <chr>               
 1 Acer saccharinum                  Silver maple        
 2 Betula pendula                    European white birch
 3 Celtis sinensis                   Chinese hackberry   
 4 Cinnamomum camphora               Camphor tree        
 5 Fraxinus angustifolia 'Raywood'   Raywood ash         
 6 Fraxinus excelsior 'Hessei'       Hesse ash           
 7 Fraxinus holotricha               Moraine ash         
 8 Fraxinus pennsylvanica 'Marshall' Marshall ash        
 9 Fraxinus velutina 'Modesto'       Modesto ash         
10 Ginkgo biloba                     Ginkgo              
# ℹ 14,477 more rows
#columns containing the word "Street"
allStreet<-select(treeData,DbaseID,contains("Street",ignore.case = FALSE))
allStreet
# A tibble: 14,487 × 5
   DbaseID `Park/Street` OnStreet FromStreet ToStreet
     <dbl> <chr>         <chr>    <chr>      <chr>   
 1       1 Nursery       -1       -1         -1      
 2       2 Nursery       -1       -1         -1      
 3       3 Nursery       -1       -1         -1      
 4       4 Nursery       -1       -1         -1      
 5       5 Nursery       -1       -1         -1      
 6       6 Nursery       -1       -1         -1      
 7       7 Nursery       -1       -1         -1      
 8       8 Nursery       -1       -1         -1      
 9       9 Nursery       -1       -1         -1      
10      10 Nursery       -1       -1         -1      
# ℹ 14,477 more rows

Finally, when selecting columns, you also have an opportunity to rename them as you’d prefer, assigning them by entering the new name in quotes (e.g., “LeafArea” and the :

treeMetrics<-select(treeData,"ID"=DbaseID,"Name"=ScientificName,"DBH"=`DBH (cm)`,"Height"=`TreeHt (m)`,"LeafArea"=`Leaf (m2)`)
treeMetrics
# A tibble: 14,487 × 5
      ID Name                                DBH Height LeafArea
   <dbl> <chr>                             <dbl>  <dbl>    <dbl>
 1     1 Acer saccharinum                    2.5    2        2.5
 2     2 Betula pendula                      2.5    1.5      1.9
 3     3 Celtis sinensis                     2.5    1.8      2.2
 4     4 Cinnamomum camphora                 2.5    2        2  
 5     5 Fraxinus angustifolia 'Raywood'     2.5    2        2.2
 6     6 Fraxinus excelsior 'Hessei'         2.5    2        2.2
 7     7 Fraxinus holotricha                 2.5    2        2.2
 8     8 Fraxinus pennsylvanica 'Marshall'   2.5    2        2.2
 9     9 Fraxinus velutina 'Modesto'         2.5    2        2.1
10    10 Ginkgo biloba                       2.5    1.6      1.3
# ℹ 14,477 more rows

If you only want to rename some columns without selecting a subset of them, you can use the rename function:

treeMetrics<-rename(treeMetrics,"DBH_cm"=DBH,"Height_m"=Height,"LeafArea_m2"=LeafArea)
treeMetrics
# A tibble: 14,487 × 5
      ID Name                              DBH_cm Height_m LeafArea_m2
   <dbl> <chr>                              <dbl>    <dbl>       <dbl>
 1     1 Acer saccharinum                     2.5      2           2.5
 2     2 Betula pendula                       2.5      1.5         1.9
 3     3 Celtis sinensis                      2.5      1.8         2.2
 4     4 Cinnamomum camphora                  2.5      2           2  
 5     5 Fraxinus angustifolia 'Raywood'      2.5      2           2.2
 6     6 Fraxinus excelsior 'Hessei'          2.5      2           2.2
 7     7 Fraxinus holotricha                  2.5      2           2.2
 8     8 Fraxinus pennsylvanica 'Marshall'    2.5      2           2.2
 9     9 Fraxinus velutina 'Modesto'          2.5      2           2.1
10    10 Ginkgo biloba                        2.5      1.6         1.3
# ℹ 14,477 more rows

Slice

The slice function is used to select a subset of rows using their indices (i.e., row numbers). For example, to get the first 100 rows, you could use:

slice(treeData,1:100)
# A tibble: 100 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1       1 InlVal Modes… Motow…      1 Nurs… Nursery       ACSA1  Acer sacchari…
 2       2 InlVal Modes… Motow…      2 Nurs… Nursery       BEPE   Betula pendula
 3       3 InlVal Modes… Motow…      3 Nurs… Nursery       CESI4  Celtis sinens…
 4       4 InlVal Modes… Motow…      4 Nurs… Nursery       CICA   Cinnamomum ca…
 5       5 InlVal Modes… Motow…      5 Nurs… Nursery       FRAN_R Fraxinus angu…
 6       6 InlVal Modes… Motow…      6 Nurs… Nursery       FREX_H Fraxinus exce…
 7       7 InlVal Modes… Motow…      7 Nurs… Nursery       FRHO   Fraxinus holo…
 8       8 InlVal Modes… Motow…      8 Nurs… Nursery       FRPE_M Fraxinus penn…
 9       9 InlVal Modes… Motow…      9 Nurs… Nursery       FRVE_G Fraxinus velu…
10      10 InlVal Modes… Motow…     10 Nurs… Nursery       GIBI   Ginkgo biloba 
# ℹ 90 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

The slice function is actually one of a few different slice-based functions that serve different purposes. For example, the slice_sample allows us to take a random subset of rows from our data:

slice_sample(treeData,n=100)
# A tibble: 100 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1    7493 NoEast Queen… JFKMa… 318442 411   Street        PLAC   Platanus x ac…
 2    9368 InterW Albuq… ABQMa…  13784 Grav… Park          GLTR   Gleditsia tri…
 3   13671 PacfNW Longv… Longv…   9295 -1    Park          FRLA   Fraxinus lati…
 4    1687 InlEmp Clare… CLMma…  18917 5     -1            SCMO   Schinus molle 
 5     996 SoCalC Santa… SMAMa… 100285 -1    -1            MEEX   Metrosideros …
 6   17691 SacVal Sacra… SMFMa…  44247 Sout… Street        QUAG1  Quercus agrif…
 7    8533 TpIntW Boise… BOIMa…  30983 7A    Street        FRAM   Fraxinus amer…
 8     594 InlVal Modes… Motow…  78936 -1    -1            FRPE_M Fraxinus penn…
 9    4359 NMtnPr Fort … FNLMa…    642 -1    Street        ULPU   Ulmus pumila  
10    5794 LoMidW India… INDMa…    325 E06   -1            FRAM   Fraxinus amer…
# ℹ 90 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Or we can get the top fifty trees by height using slice_max:

slice_max(treeData,`TreeHt (m)`,n=50)
# A tibble: 55 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1    2665 NoCalC Berke… JBKMa…  29784 1     Park          SESE   Sequoia sempe…
 2   13635 PacfNW Longv… Longv…   8919 -1    Park          POTR2  Populus balsa…
 3    2617 NoCalC Berke… JBKMa…  28535 2     Park          EUGL   Eucalyptus gl…
 4    2636 NoCalC Berke… JBKMa…  28918 2     Park          EUGL   Eucalyptus gl…
 5   11741 Piedmt Charl… CLTMa…  43077 G6    Street        QURU   Quercus rubra 
 6    2614 NoCalC Berke… JBKMa…  28524 2     Park          EUGL   Eucalyptus gl…
 7    2649 NoCalC Berke… JBKMa…  29288 2     Park          SESE   Sequoia sempe…
 8    2840 NoCalC Berke… JBKMa…  70598 1     Park          EUGL   Eucalyptus gl…
 9   12252 Piedmt Charl… CLTMa… 125399 F6    Street        QUAL   Quercus alba  
10   13658 PacfNW Longv… Longv…   9192 -1    Park          POTR2  Populus balsa…
# ℹ 45 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Distinct

A last function worth mentioning for subsetting is the distinct function. As you might imagine, this pulls all unique values from a given column in a dataset.

distinct(treeData,City)
# A tibble: 17 × 1
   City            
   <chr>           
 1 Modesto, CA     
 2 Santa Monica, CA
 3 Claremont, CA   
 4 Berkeley, CA    
 5 Glendale, AZ    
 6 Fort Collins, CO
 7 Minneapolis, MN 
 8 Indianapolis, IN
 9 Queens, NY      
10 Boise, ID       
11 Albuquerque, NM 
12 Honolulu, HI    
13 Charleston, SC  
14 Charlotte, NC   
15 Orlando, FL     
16 Longview, WA    
17 Sacramento, CA  

This can be used with multiple columns to get the unique combinations of variables present in the data:

distinct(treeData,City,CommonName)
# A tibble: 357 × 2
   City        CommonName          
   <chr>       <chr>               
 1 Modesto, CA Silver maple        
 2 Modesto, CA European white birch
 3 Modesto, CA Chinese hackberry   
 4 Modesto, CA Camphor tree        
 5 Modesto, CA Raywood ash         
 6 Modesto, CA Hesse ash           
 7 Modesto, CA Moraine ash         
 8 Modesto, CA Marshall ash        
 9 Modesto, CA Modesto ash         
10 Modesto, CA Ginkgo              
# ℹ 347 more rows
Try it yourself!

You can quickly get a complex subset by combining some of these functions. Using the tree data and the functions above, create the following datasets:

  • The database ID and all columns measured in meters for all trees in the Quercus genus

  • The database ID, Common Name, and DBH of the top 200 trees by leaf area in California

  • The database ID, scientific name, street name, and crown height for trees on located on an “Elm Street”

Stretch activity

How might you visualize some the datasets you just created? Add some visualizations of these datasets to your Quarto document.