1  A motivating example

Let’s say that the city of Charlotte, NC wants to protect critical habitats in its urban forests. A recent study has shown that the endangered Carolina northern flying squirrel has a preference for Willow Oak trees above a height of 25 meters. The city parks department works with local volunteers to identify these trees; however, taking accurate height measurements can be difficult and time-consuming. They would like to find an easier way to estimate height, so they turn to you, the trusty data scientist, to try and answer this question.

The question we want to answer is: can we use the relationship between tree height and some other measurement(s) to estimate, or model, tree height? So first, we would want to know whether a relationship exists between measurements.

First, let’s get some tree data into R:

library(tidyverse) 

treeData<-read_csv("data/TS3_Raw_tree_data.csv") 

treeDataNC<-treeData %>%
  filter(City=="Charlotte, NC") %>%
  filter(CommonName=="Willow oak")

treeDataNC
# A tibble: 49 × 41
   DbaseID Region City   Source TreeID Zone  `Park/Street` SpCode ScientificName
     <dbl> <chr>  <chr>  <chr>   <dbl> <chr> <chr>         <chr>  <chr>         
 1   11435 Piedmt Charl… CLTMa…  10225 F6    Street        QUPH   Quercus phell…
 2   11439 Piedmt Charl… CLTMa…  11053 F6    Street        QUPH   Quercus phell…
 3   11447 Piedmt Charl… CLTMa…  12222 F6    Street        QUPH   Quercus phell…
 4   11462 Piedmt Charl… CLTMa…  13521 F6    Street        QUPH   Quercus phell…
 5   11469 Piedmt Charl… CLTMa…  14232 F6    Street        QUPH   Quercus phell…
 6   11503 Piedmt Charl… CLTMa…  19141 F5    Street        QUPH   Quercus phell…
 7   11514 Piedmt Charl… CLTMa…  20011 F5    Street        QUPH   Quercus phell…
 8   11539 Piedmt Charl… CLTMa…  23048 F6    Street        QUPH   Quercus phell…
 9   11543 Piedmt Charl… CLTMa…  23565 E8    Street        QUPH   Quercus phell…
10   11570 Piedmt Charl… CLTMa…  27540 F8    Street        QUPH   Quercus phell…
# ℹ 39 more rows
# ℹ 32 more variables: CommonName <chr>, TreeType <chr>, address <chr>,
#   street <chr>, side <chr>, cell <dbl>, OnStreet <chr>, FromStreet <chr>,
#   ToStreet <chr>, Age <dbl>, `DBH (cm)` <dbl>, `TreeHt (m)` <dbl>,
#   CrnBase <dbl>, `CrnHt (m)` <dbl>, `CdiaPar (m)` <dbl>,
#   `CDiaPerp (m)` <dbl>, `AvgCdia (m)` <dbl>, `Leaf (m2)` <dbl>,
#   Setback <dbl>, TreeOr <dbl>, CarShade <dbl>, LandUse <dbl>, Shape <dbl>, …

Let’s start with crown base height (CrnBase in the data): this is the average distance from the ground to the bottom of the tree’s crown:

What we’d like to know is whether a linear relationship exists between crown base height and tree height. By linear, we mean that an increase of some increment in one variable (crown base height) will coincide with in an increase in the variable of interest (tree height). Let’s see what the relationship between our variables looks like visually:

ggplot(treeDataNC,aes(x=CrnBase,y=`TreeHt (m)`)) +
  geom_point() +
  labs(x="Crown Base (m)",y="Tree Height (m)") +
  theme_light()

Notice CrnBase is not surrounded by backticks, while TreeHt (m) has them. Remember that the reason is that CrnBase doesn’t have any non-standard characters like whitespace or parentheses, so the read_csv function didn’t add them, and they aren’t needed to refer to the variable in the aesthetic mapping.

OK, so what do we see here? Well, there does seem to be an increase, but it isn’t very strong. We can assess this strength by doing a correlation test:

#check pearson
cor.test(treeDataNC$CrnBase,treeDataNC$`TreeHt (m)`,method="pearson")

    Pearson's product-moment correlation

data:  treeDataNC$CrnBase and treeDataNC$`TreeHt (m)`
t = 5.3424, df = 47, p-value = 2.62e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4031591 0.7638432
sample estimates:
      cor 
0.6146721 

The cor value is 0.61, which suggests a positive relationship, but it isn’t especially strong. The p-value is very low, though, so this indicates we can reject the idea there isn’t a relationship. Of course, we didn’t check to see whether the data are normally distributed. We can use shapiro.test to do that:

#check normality
shapiro.test(treeDataNC$`TreeHt (m)`)

    Shapiro-Wilk normality test

data:  treeDataNC$`TreeHt (m)`
W = 0.96854, p-value = 0.2117
shapiro.test(treeDataNC$CrnBase)

    Shapiro-Wilk normality test

data:  treeDataNC$CrnBase
W = 0.84429, p-value = 1.293e-05

Remembering back to Week 6, a p-value greater than our cut-off threshold (0.05) tells us that we can’t reject the null hypothesis that the data are normally distributed. This is true for tree height, but not so for the crown base height. Therefore, we should probably run our correlation test again, but this time using the Spearman method:

#check spearman
cor.test(treeDataNC$CrnBase,treeDataNC$`TreeHt (m)`,method="spearman")

    Spearman's rank correlation rho

data:  treeDataNC$CrnBase and treeDataNC$`TreeHt (m)`
S = 6156.7, p-value = 5.37e-08
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.6858807 

The result is pretty similar: we can reject the idea that there is no relationship, and the data indicate a positive relationship. The rho value is a little improved, but not by a lot.

1.0.1 Building a model: single predictor regression

Given the existence of a relationship, we could ask: how well can we predict tree height from crown base height? A linear model can help us to estimate this. If you think back to our scatter plot, what a linear model does is draw a straight line through the data that minimizes the distance to all of the points.

`geom_smooth()` using formula = 'y ~ x'

If we find that the line fits the data well, which would be indicated by how closely the data fall along it, then we can use the line to provide us with predictions about data we have not observed. For example, looking at the plot above, the linear model would predict that a tree with a crown base height of 10 meters would have a total height of about 31 meters. Of course, the value of this prediction depends on how well the model fits the data.

To create our model, we’ll use the lm, or linear model, function:

treeModel<-lm(`TreeHt (m)`~ CrnBase,data=treeDataNC)

What have we done here? What this code does is asks R to conduct a linear regression on these two variables and save it as a linear model object called treeModel. At a minimum, lm needs an argument that comes in the form of a formula:

y~x

Where y is the response variable (tree height) and x is a predictor (crown base height). If we just want to use column names like we have here, we also have to supply a data argument, which in this case is the treeDataNC tibble. We could get the same result without this argument by referring to the columns using the $ operator:

treeMod_cb<-lm(treeDataNC$`TreeHt (m)`~ treeDataNC$CrnBase)

So now our model is stored as treeMod_cb. We can get a quick look at the results by using the summary function:

summary(treeMod_cb)

Call:
lm(formula = treeDataNC$`TreeHt (m)` ~ treeDataNC$CrnBase)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7496  -4.3383   0.9595   5.5553  11.1581 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         12.8348     1.7161   7.479 1.54e-09 ***
treeDataNC$CrnBase   1.8014     0.3372   5.342 2.62e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.591 on 47 degrees of freedom
Multiple R-squared:  0.3778,    Adjusted R-squared:  0.3646 
F-statistic: 28.54 on 1 and 47 DF,  p-value: 2.62e-06

There’s a few different pieces of information here, let’s go through some of the most relevant:

  • Residuals: Residuals are how far the real data deviate from the line produced by the model. These are expressed here as an interquartile range

  • Coefficients: These are indicating the relative increments that the response variable changes in response to the predictors.

  • R-squared (R2): These are measures of the proportion of variance in the response (dependent) variable accounted for by the model A value of 1 would indicate a perfect match between predictor and

    • Multiple R-squared is the raw R-squared; for instances where more than one predictor is used, this

    • Adjusted R-squared adjusts for the number of predictors, favoring models with fewer predictors. Under most circumstances, this is the preferred

  • : This is

An adjusted R-squared value of 0.36 is not great. Let’s see if we can get a better match with diameter at breast height, or DBH:

ggplot(treeDataNC,aes(x=`DBH (cm)`,y=`TreeHt (m)`)) +
  geom_point() +
  geom_smooth(method="lm",se=FALSE) +
  labs(x="Diameter at Breast Height (cm)",y="Tree Height (m)") +
  theme_light()

Here you might notice is that we added an argument, method="lm", to the geom_smooth function. This is saying that, , we want to use a linear model. This lets ggplot create a line from a linear model rather than use the default loess smoother. We also included the argument se=false, which removes the error ribbon around the line. This allows us to see the model as it would exist from the formula we used above.

From this plot, it looks like there is a pretty good relationship between these two variables. However, we still want to assess the data using a correlation:

#check normality
shapiro.test(treeDataNC$`TreeHt (m)`)

    Shapiro-Wilk normality test

data:  treeDataNC$`TreeHt (m)`
W = 0.96854, p-value = 0.2117
shapiro.test(treeDataNC$`DBH (cm)`)

    Shapiro-Wilk normality test

data:  treeDataNC$`DBH (cm)`
W = 0.9492, p-value = 0.03429
#check spearman
cor.test(treeDataNC$`DBH (cm)`,treeDataNC$`TreeHt (m)`,method="spearman")
Warning in cor.test.default(treeDataNC$`DBH (cm)`, treeDataNC$`TreeHt (m)`, :
Cannot compute exact p-value with ties

    Spearman's rank correlation rho

data:  treeDataNC$`DBH (cm)` and treeDataNC$`TreeHt (m)`
S = 2571.2, p-value = 5.943e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.868814 

The Spearman test says we can’t rule out a relationship (no surprise), and there is a stronger positive relationship than what we saw before (again, not surprising).

Now let’s see how this performs as a model:

treeMod_dbh<-lm(`TreeHt (m)`~`DBH (cm)`,data=treeDataNC)
summary(treeMod_dbh)

Call:
lm(formula = `TreeHt (m)` ~ `DBH (cm)`, data = treeDataNC)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0051 -2.0312 -0.0645  2.0496 13.3311 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.24141    1.20525   6.838 1.44e-08 ***
`DBH (cm)`   0.19576    0.01667  11.740 1.41e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.214 on 47 degrees of freedom
Multiple R-squared:  0.7457,    Adjusted R-squared:  0.7403 
F-statistic: 137.8 on 1 and 47 DF,  p-value: 1.411e-15

This gives us an improved R-squared, suggesting that more of the variability in this dataset is explained by the model.