Chapter 1: Data visualization using ggplot2

Learn how to plot different types of graphs using the ggplot2 package
ggplot2
data visualization
Author
Published

December 2, 2021

Modified

April 15, 2023

1 Before we begin

This series of chapters are focused on people who have a basic understanding of R programming. You are expected to know how the basic syntaxes of R language work like how to assign values, how to load your data, the different operations of R, and so on. If you just started out on your R journey then this chapter and the rest will become a roadblock for you. But fear not, there are some amazing websites that teach you the very basics of R, and that too for free! For an interactive way of learning, I recommend DataCamp. It is an online platform for learning programming languages. They have both paid and free classes. Luckily for us, they are providing the introductory classes on R programming for free. This would be a great way to start your R journey. After you completed the course on data camp you can come back to this blog and you will find it very easy to comprehend and learn the various chapters that are available here. Coming from my own experience, it would be really helpful if you have your own data to work with. Rather than religiously following the steps in these chapters, I would recommend you have a goal in your mind before diving into the chapters. The goal should be to try incorporating the things you learned here into your own data. That would be the best way to learn anything in R. Hope you have a great time learning!

2 Introduction to ggplot2 package

In this chapter we will be plotting different types of graphs using a package called {ggplot2} in R. The ggplot2 package is based on ‘grammar of graphics plot’ which provides a systematic way of doing data visualizations in R. With a few lines of code you can plot a simple graph and by adding more layers onto it you can create complex yet elegant data visualizations.

A ggplot2 graph is made up of three components.

  1. Data: Data of your choice that you want to visually summarise.
  2. Geometry or geoms: Geometry dictates the type of graph that you want to plot and this information is conveyed to ggplot2 through the geom() command code. For e.g. using the geom_boxplot() command, you can plot a box plot with your data. Likewise, there are many types of geometry that you can plot using the ggplot2 package.
  3. Aesthetic mappings: Aesthetics define the different kinds of information that you want to include in the plot. One fo the most important aesthetic is in choosing which data values to plot on the x-axis and the y-axis. Another example is changing the colour of the data points, which can be used to differentiate two different categories in the data. The use of aesthetics depends on the geometry that you are using. We use the command aes() for adding different types of aesthetics to the plot. We will learn more about aes() in Chapter 2. For now, we will only see what kind of plots can be made using the {ggplot2} package. We will learn how to tweak them in Chapter 2.

This tutorial is primarily focused on students who are beginners in R programming and wants to quickly plot their data without much of a hassle. So without further ado let’s plot some graphs!

3 Setting up the prerequisites

First, we need to install the {ggplot2} package in R as it does not come in the standard distribution of R.

  • To install packages in R we use the command install.packages() and to load packages we use the command library(). Therefore to install and load {ggplot2} package we use the following lines of command.
install.packages("ggplot2")
library(ggplot2)

All right we have the {ggplot2} package loaded, now we just need some data to plot. Most R programming tutorials use the iris dataset as an example. But this tutorial won’t be like most tutorials. So let me introduce you to some lovely penguins from Palmer Station in Antarctica!

For this tutorial, we will be installing the {palmerpenguins} package which showcases body measurements taken from three different species of penguins from Antarctica. This package was made possible by the efforts of Dr. Allison Horst. The penguin data was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.

  • Install the {palmerpenguins} package and load it in R.
install.packages("palmerpenguins")
library(palmerpenguins)

Now there are two datasets in this package. We will be using the penguins dataset which is a simplified version of the raw data present in the package.

  • Use the command head() to display the first few values of penguins dataset to see how it looks like
library(palmerpenguins)
head(penguins)

We can see that are 8 columns in the dataset representing different values. Now let us try plotting some graphs with this data.

3.1 Bar graph

So we will try to plot a simple bar graph first. Bar graphs are used to represent categorical data where the height of the rectangular bar represents the value for that category. We will plot a bargraph representing frequency data for all three species of penguins.

  • We will be using the geom_bar() command to plot the bar graph. Let us also use the command theme_bw() for a nice looking theme.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = species, fill = species)) + 
  xlab("Species") + ylab("Frequency") + 
  ggtitle("Frequency of individuals for each species") + 
  geom_bar() + theme_bw()

3.2 Histogram

Histograms are similar to bar graphs visually and are used to represent continuous data. Histograms splits data into ‘bins’ and tells us the number of observations for each of the bin.

  • We can plot a histogram using the command geom_histogram() and change the bin width using the binwidth = argument.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) + 
  xlab("Body Mass (g)") + ylab("Frequency") + 
  ggtitle("Frequency of individuals for respective body mass") + 
  geom_histogram(binwidth = 1000) + theme_bw()
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

The warning message indicates that for two rows in the dataset, they have NA values or that they did not have any values present. This is true for real-life cases, as during data collection sometimes you will be unable to collect data due to various reasons. So this is perfectly fine.

3.3 Line graph

Line graph simply joins together data points to show overall distribution.

  • Use the command geom_line() for plotting a line graph.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = bill_length_mm, 
                            y = bill_depth_mm, colour = species)) + 
  xlab("Bill length (mm)") + ylab("Bill depth (mm)") + 
  ggtitle("Bill length vs Bill depth") + geom_line() + theme_bw()

3.4 Scatter plot

The scatter plot simply denotes the data points in the dataset.

  • Use the command geom_point() to plot a scatter plot.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm, 
                            shape = species, colour = species)) + 
  xlab("Body mass (g)") + ylab("Flipper length (mm)") + 
  ggtitle("Body mass vs Filpper length") + geom_point(size = 2) + theme_bw()

3.5 Density Plot

Density plots are similar to histograms but show it shows the overall distribution of the data in a finer way. This way we will get a bell-shaped curve if our data follows a normal distribution.

  • Use the command geom_density() to a density plot.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = body_mass_g, fill = species)) + 
  xlab("Body Mass (g)") + ylab("Density") + ggtitle("Body mass distribution") + 
  geom_density() + theme_bw()

Since we plotted for all three species the graph looks clustered. Let us try plotting the same graph for only gentoo penguins. We will use the dplyr package to filter() data for gentoo penguins alone. The dplyr package comes in-built with R so just load the dplyr package using the command library().

library(ggplot2)
library(palmerpenguins)
library(dplyr)

penguins_gentoo <- penguins %>% filter(species == "Gentoo")

ggplot(data = penguins_gentoo, aes(x = body_mass_g)) + 
  xlab("Body Mass of Gentoo penguins (g)") + ylab("Density") + 
  ggtitle("Body mass distribution of Gentoo penguins") + 
  geom_density(fill = "red") + theme_bw()

3.6 Dot-plot

Dot-plot is similar to a density plot but it shows discretely each data point in the distribution.

  • Use the command geom_dotplot() to plot a dot-plot.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) + 
  xlab("Species") + ylab("Body mass (g)") + 
  ggtitle("Body mass in three diferent species of penguins") + 
  geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 100) + theme_bw()

3.7 Rug-plot

Rug-plot is a simple way to visualize the distribution of data along the axis lines. It is often used in conjunction with other graphical representations.

  • Use the command geom_rug() to plot a rug-plot.
library(ggplot2)
library(palmerpenguins)
library(dplyr)

penguins_gentoo <- penguins %>% filter(species == "Gentoo")

ggplot(data = penguins_gentoo, aes(x = body_mass_g, y = flipper_length_mm)) + 
  xlab("Body Mass of Gentoo penguins (g)") + ylab("Density") + 
  ggtitle("Body mass distribution of Gentoo penguins") + 
  geom_point(colour = "darkred") + geom_rug() + theme_bw()

3.8 Box plot

Box-plot is one of the better ways of showing data via quartiles. You can learn more about box plots here.

  • Use the command geom_boxplot() to plot a box-plot.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = species, y = body_mass_g, colour = species)) + 
  xlab("Species") + ylab("Body mass (g)") + 
  ggtitle("Body mass in three diferent species of penguins") + geom_boxplot() + 
  theme_bw()

3.9 Violin plot

Violin plot can be considered as the best of both a box-plot and a density plot. It shows the quartile values, like in a box-plot and also shows the distribution of the data, like in a density plot.

  • Use the command geom_violin() in conjunction with geom_boxplot() to plot a violin plot.
library(ggplot2)
library(palmerpenguins)

ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) + 
  xlab("Species") + ylab("Body mass (g)") + 
  ggtitle("Body mass in three diferent species of penguins") + 
  geom_violin(aes(colour = species), trim = TRUE) + geom_boxplot(width = 0.2) +
  theme_bw()

4 Saving your ggplot2 graphs

  • Use the command ggsave() to save the graph locally. In the code below, ‘my_graph’ is the ggplot element containing your graph. The plot will be saved in your working directory.
library(ggplot2)
library(palmerpenguins)

my_graph <- ggplot(data = penguins, aes(x = species, y = body_mass_g,
                                    fill = species)) + 
  xlab("Species") + ylab("Body mass (g)") + 
  ggtitle("Body mass in three diferent species of penguins") + 
  geom_violin(aes(colour = species), trim = TRUE) + 
  geom_boxplot(width = 0.2) +
  theme_bw()

#to save the plot
ggsave(my_graph, filename = "your_graph_name.png", width = 20, height = 20,
       units = "cm")

5 Summary

I hope this tutorial helped you to get familiarize with the {ggplot2} commands. The best way to learn R is by actually doing it yourself. So try to recreate the examples given in this tutorial and then try to apply what you learned using the different datasets available in R. In chapter 2, we will learn how to customize the plots by tweaking the aesthetic mappings.

6 References

  1. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. Read more about ggplot2 here. You can also look at the cheat sheet for all the syntax used in ggplot2. Also check this out.

  2. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

Reuse

Citation

BibTeX citation:
@online{johnson2021,
  author = {Johnson, Jewel},
  title = {Chapter 1: {Data} Visualization Using Ggplot2},
  date = {2021-12-02},
  url = {https://sciquest.netlify.app//tutorials/data_viz/ggplot_1.html},
  langid = {en}
}
For attribution, please cite this work as:
Johnson, Jewel. 2021. “Chapter 1: Data Visualization Using Ggplot2.” December 2, 2021. https://sciquest.netlify.app//tutorials/data_viz/ggplot_1.html.