2  Looking for Patterns

library(modeldata)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

When we start looking for patterning in the data, we’

2.0.1 Univariate patterns

2.0.1.1 Counts

--bar charts

What if the counts are already in the data?

hairEye<-as_tibble(HairEyeColor)
ggplot(data=hairEye,aes(x=Hair,y=n,fill=Eye)) + 
  geom_col()

2.0.1.2 Ranks

2.0.1.3 Distributions

-Center

--Mean

--Median

-Spread: Spread refers to how

--Range

--Standard Deviation

-Shape

--Normal

--Bimodal, multimodal

---Density plots

TRY IT YOURSELF: Look at X dataset

2.0.2 Bivariate

2.0.2.1 Categorical and categorical

2.0.2.2

hairEye<-as_tibble(HairEyeColor)
ggplot(data=hairEye,aes(x=Hair,y=n,fill=Eye)) + 
  geom_col()

2.0.2.3 Categorical and numerical

Like histograms, boxplots show the center, spread, and shape of distributions, but these are most useful when comparing more than one group.

Median: the central line of the boxplot is

Interquartile range: The “box” part of the box is an indication of the interquartile range. This is a

Have a look at the Sacramento dataset:

Sacramento
# A tibble: 932 × 9
   city           zip     beds baths  sqft type        price latitude longitude
   <fct>          <fct>  <int> <dbl> <int> <fct>       <int>    <dbl>     <dbl>
 1 SACRAMENTO     z95838     2     1   836 Residential 59222     38.6     -121.
 2 SACRAMENTO     z95823     3     1  1167 Residential 68212     38.5     -121.
 3 SACRAMENTO     z95815     2     1   796 Residential 68880     38.6     -121.
 4 SACRAMENTO     z95815     2     1   852 Residential 69307     38.6     -121.
 5 SACRAMENTO     z95824     2     1   797 Residential 81900     38.5     -121.
 6 SACRAMENTO     z95841     3     1  1122 Condo       89921     38.7     -121.
 7 SACRAMENTO     z95842     3     2  1104 Residential 90895     38.7     -121.
 8 SACRAMENTO     z95820     3     1  1177 Residential 91002     38.5     -121.
 9 RANCHO_CORDOVA z95670     2     2   941 Condo       94905     38.6     -121.
10 RIO_LINDA      z95673     3     2  1146 Residential 98937     38.7     -121.
# ℹ 922 more rows
ggplot(data=Sacramento,aes(x=type,y=price)) +
  geom_boxplot()

2.0.2.4 Two numerical

-Scatterplots

Form: Is

Direction: If increasing values on one axis correspond with increasing values on the other, then the direction is positive; if increasing values on

Strength:

Try it yourself!

2.0.3 Multivariate

With more than two variables,

Does my data show signs of clustering?

Does a pattern exist for some categories by not others?