Week 6 Lab

Introduction

Datasets are often assembled in ways that facilitate data collection, but often with limited consideration to how they are analyzed. This is especially true when data is being re-used from published sources and the new use is outside of the scope of what the data authors intended. This means that when data, it is often structured in ways that There is a saying, largely anecdotal but probably accurate, that data scientists spend 80% of their time cleaning and organizing data.

The combined processes of data cleaning, organizing, and transforming are often subsumed under the term data wrangling. The definitions of the term wrangle are

  • have a long and complicated dispute

  • round up, herd, or take charge of (livestock)

Both of these definitions are strangely appropriate. Working with data can often feel like a discussion, sometimes a heated one, in which you are trying to convince the data to yield to a particular form that better. It can also sometimes feel as though you are tending to a herd (cats come to mind), trying to create order from chaos.

We have already done some of this in previous lectures. For example, we have seen how we can add transformed columns to a table using the $ and <- operators; we have seen how we can subset cases using square brackets [], and we have used exploratory data analysis to identify suspect and outlier values. However, because wrangling makes up so much of what we , specialized tools have been developed to deal with them, and these available to use through the tidyverse family of packages. In this lab, you will learn to:

  • subset, reorder, and transform data using tools from the dplyr package

  • create tidy datasets using the pivot functions from the tidyr package

To begin, I recommend creating a file system, opening a new Quarto document, and giving each of the pages in this lab it’s own heading. And don’t forget to load the tidyverse package!

library(tidyverse)