library(tidyverse)
Week 8 Lab: Cleaning, Tidying, and Wrangling Data
Introduction
Datasets are often assembled in ways that facilitate data collection, but often with limited consideration to how they are analyzed. This is especially true when data is being re-used from published sources and the new use is outside of the scope of what the data authors intended. There is a saying, largely anecdotal but probably accurate, that data scientists spend 80% of their time cleaning and organizing data.
The combined processes of data cleaning, organizing, and transforming are often subsumed under the term data wrangling. The definitions of the term “wrangle” are:
have a long and complicated dispute
round up, herd, or take charge of (livestock)
Both of these definitions are strangely appropriate. Working with data can often feel like a discussion, sometimes a heated one, in which you are trying to convince the data to yield to a particular form that allows easier integration into a data workflow. It can also sometimes feel as though you are tending to a herd of animals (cats come to mind), trying to create order from chaos.
We have already done some of this in previous lectures. For example, we have seen how we can add transformed columns to a table using the $
and <-
operators; we have seen how we can subset cases using square brackets []
and the subset
function, and we have used exploratory data analysis to identify suspect and outlier values. However, because wrangling makes up so much of what we do as data scientists, specialized tools have been developed to deal with them, and these available to us through the tidyverse
family of packages. In this lab, you will learn to:
subset, reorder, and transform data using tools from the
dplyr
packagecreate tidy datasets using the pivot functions from the
tidyr
packageuse helper functions from packages like
tidyselect
andstringr
To begin, I recommend creating a file system, opening a new Quarto document, and giving each of the pages in this lab its own heading (see the table of contents). And don’t forget to load the tidyverse
package!