3  Visualizing distributions

Histograms

Let’s say that instead of categories, we wanted to look at the distribution of a set of numerical values. We’ve already explored these using histograms via the Base R hist function, and we can access histograms in ggplot2 by using the geom_histogram function:

ggplot(data=gm2007,aes(x=lifeExp)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You may have noted that we dropped the mapping= here. As long as you’re using aes, it’s not needed: R will assume that this applies to the mapping. You may have also noted there’s an error we get from running this command that relates to bin size. We can handle this with another argument, but for now don’t worry about it.

Boxplots

Oftentimes, we may be interested in showing distributions across multiple categories. We can view multiple distributions using a box plot:

ggplot(data=gm2007,aes(x=continent,y=gdpPercap)) +
  geom_boxplot()

The box plot (or box-and-whisker plot) shows the distribution of values in terms of the interquartile range, where the box represents the space between the 25% and 75% and the whiskers represent the rest. These are useful when you want to compare across multiple categories. For example:

ggplot(data=penguins,aes(x=species,y=flipper_length_mm)) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Note the error here: there are two instances where data are not available, and R is letting you know it can’t plot these.

Try it yourself!

Hopefully now you’re seeing how these ggplot2 functions work. Try the following:

  • Make a histogram of penguin bill lengths

  • Make a boxplot of life expectancy by continental grouping