A ggplot2 tutorial for beginners
This article is originally published at https://www.sharpsightlabs.com
This blog post is a fairly comprehensive ggplot2 tutorial for beginners.
If you’re new to R and ggplot, this ggplot2 tutorial will cover a few things:
If you’re new to ggplot, I recommend that you read the whole tutorial. But if you want to skip to a particular section, click on the appropriate link in the list above. The link will send you directly to the appropriate section in the tutorial.
What is ggplot2?
First, let’s start with the basics. Here, we’re going to cover what ggplot2 is, and how it fits into the larger data science ecosystem for the R programming language?
ggplot2 is a toolkit for data visualization in R
You can use it to create simple data visualizations scatter plots, bar charts, and line charts:
But you can also use it to create fairly advanced and complicated data visualizations, like detailed maps:
Ultimately, ggplot2 can create very simple data visualizations, and it can create very complicated data visualizations. It’s both powerful and flexible.
ggplot2 is part of the Tidyverse data science toolkit
Although ggplot2 focuses on data visualization, it is part of a larger family of R packages for doing data science in R.
This set of data science packages is called the tidyverse. The tidyverse packages cover the full range of the data science workflow, so there are packages for importing data, data manipulation and cleaning, data visualization, and modeling.
In particular, the tidyverse includes:
- readr for importing data
- dplyr for data manipulation
- ggplot2 for data visualization
- stringr for string manipulation
- lubridate for date manipulation
- tidyr for putting data into a “tidy” format
- … and others
The full list of packages in the tidyverse can be found elsewhere.
What’s important to understand is that the tidyverse provides a coherent set of tools for doing data science in the R programming language, and ggplot2 is one part of that broader toolkit.
Importantly, the packages from the tidyverse share a common philosophy concerning how data science should be performed. This philosophy manifests in the how the syntax is structured and how they operate.
Let’s quickly cover some of the important design features of the tidyverse, and how these relate to ggplot2.
The ggplot2 operates on dataframes
The ggplot2 package operates on R dataframes. This is because (for the most part) the tidyverse packages focus on dataframes, in one way or another.
In fact, the name “tidyverse” comes from the concept of a “tidy” dataframe. A so-called “tidy” dataframe is a dataset where every variable has its own column, every observation has its own row, and every value has its own cell in the dataframe grid.
Some of the packages – like the
tidyr package – work to reshape data into this tidy format.
Other packages – like
stringr – primarily operate on the variables within a “tidy” dataframe.
And some packages “do stuff” with dataframes. For example,
ggplot2 visualizes the data that’s in a tidy dataframe. ggplot expects the input data to be in a dataframe. It doesn’t work with other data structures, for the most part.
Several other packages – like
dplyr – also require the input data to be in a “tidy” dataframe.
The tidyverse is highly modular
All of the functions in the tidyverse packages are highly modular. That means that for the most part, all of the functions are designed to do one thing, and one thing only.
For example, in
ggplot() function initiates plotting. That’s essentially the only thing that it does. There’s a separate function that you use to draw bars (for a bar chart). Another function for drawing points for a scatterplot. And there are still other functions for formatting the elements of your plot.
In ggplot2 and the rest of the tidyverse, almost every little operation that you want to perform has a separate function.
This might seem odd, but once you see it in action, it seems like a great way to structure things.
All of these little functions in ggplot2 and the tidyverse are like little Lego building blocks that you can snap together.
In terms of workflow, this means that you can write your code iteratively.
Just trust me on this. It’s great.
The tidyverse and ggplot are easy to use
One of the primary advantages of the tidyverse is that it is relatively easy to use.
Part of this comes from the design of the syntax.
For starters, almost everything is named in a way that’s clear and easy to understand. If you want to “filter” out some of the rows of your data, there is a function called
dplyr. Or if you want to “select” specific variables from a dataset,
dplyr also has a function called
Other functions have little prefixes that make them easy to work with. For example, essentially all of the functions from the
stringr package use the prefix
str_. So if you’re using RStudio, you can type in
str_ and then hit the
Moreover, the names of those
stringr functions are well named. So if you need to “replace” characters in a string, you can use
str_replace(). If you want to convert all of the characters in a string to lower case, you can use
str_to_lower(). The names of the functions begin with
str_, and they are otherwise named in a way that makes them easy to remember.
The fact that the functions are clearly named is actually a really big deal.
Because things are clearly named, functions are much easier to remember. This makes it a lot easier to write code.
It also makes it easier to read code. Reading code for the tidyverse is often like reading psudocode.
This is one of the reasons that I recommend that new R users learn the tidyverse. It’s also one reason that I recommend that many data science beginners learn R, instead of a different data science language.
ggplot2 has a highly structured syntax
Part of the reason that ggplot itself is powerful and easy to use is that it has a highly structured syntax.
This is because it is based on a theoretical framework called The Grammar of Graphics.
I won’t explain the Grammar of Graphics here, but understand that it enables a data scientist to think about data visualization in a highly structured way. Moreover, because the syntax of ggplot2 is based on the Grammar of Graphics, it makes it possible to create data visualizations with a relatively concise syntax, whether they are simple visualizations or and complex visualizations .
Having said that, let’s take a look at the syntax of ggplot2 to understand how it works.
The syntax of ggplot2
Now that we’ve talked about what ggplot2 is and how it fits into the tidyverse, lets move on to the heart of this ggplot2 tutorial. Let’s talk about the syntax of ggplot2.
The great thing about the syntax of ggplot2 is that its highly systematic. The systematic nature of ggplot is one of its best features. The structured nature of ggplot2 makes it very powerful, once you understand it.
On the other hand though, the syntax can be a little confusing to beginners. Once you understand how the system works, it makes a lot of sense, but you might need to do some work to understand it first.
That being said, let’s take a careful look at the syntax.
The basics of ggplot syntax
There are four main parts of a basic ggplot2 visualization: the
ggplot() function, the
data parameter, the
aes() function, and the geom.
Let’s talk about each of these separately.
The ggplot function
ggplot() function is the core function of ggplot2. It initiates plotting.
Essentially, any time you want to create a data visualization with ggplot2, you’re going to use this function. Almost everything else in the ggplot2 system is built “on top of” this function.
The data parameter
Inside of the the
ggplot() function, the first parameter is the
data parameter essentially specifies the data that you want to visualize. More specifically, it specifies the
data.frame object that contains the data that you want to visualize. The ggplot2 system works almost exclusively with
data.frame objects. So when you provide an argument to the data parameter, it will always be a
data.frame object of some type (i.e., a a traditional
data.frame or a
Geoms (AKA, geometric objects)
“Geoms” are the geometric objects of a data visualization. They are the things that get drawn in a data visualization.
This is often confusing to beginners, so let me give you 3 simple examples.
Lines, points, and bars are all types of “geoms.”
Lines, points, and bars are all geometric objects that you can draw in a data visualization. There are many other types of geoms as well like boxes for a box plot, polygons, etc.
So here’s an example. Let’s say you want to make a line chart.
The “geom” that you need to draw to create a line chart like this is a “line geom.” You can draw line geoms with the
This is critical: the type of geom or geoms that you use determine the type of data visualization that gets created.
geoms have attributes like color and size
There’s something important that you need to know about geoms. Geoms have attributes.
Think about it. Anything that you draw has attributes like its position in the coordinate system, color, size, shape, etc.
And remember, geoms are the visual things that we draw in a plot. Therefore, any geom that you draw has attributes.
For example, point geoms have attributes like color, size, x-position, and y-position. We call these aesthetic attributes. Aesthetic attributes are essentially the visual details about the color, size, and position of your geometric objects.
This is important, because it relates to the final part of the basic ggplot2 syntax.
The aes function
I just introduced you to geometric objects, which are the things that we draw in a data visualization. And I just noted that those geometric objects have attributes like color, size, and shape.
Also, a little earlier in this ggplot tutorial, we talked about the
data parameter. The data parameter specifies the data that you will plot.
So there’s a dataset that you will plot, and then there’s the visual output itself, which is determined by your geom specification.
But how do you connect them?
Remember: data visualizations are essentially visual representations of an underlying dataset. For the data visualization process to work properly, there needs to be a connection between the data (the dataframe) and the visual objects that we draw (the geoms).
You need a way to “connect” the dataset to the geoms that get drawn.
The aes function creates mappings from data to geoms
Said a little more precisely, we need a mapping from the underlying data to visual objects that get drawn (the geoms).
How do we do this? How do we connect the dataset to the visual objects in the chart?
We do this with the
aes() function enables you to create a set of “mappings” from your dataset to the geoms in your data visualization. More precisely, the
aes() function allows you to map the variables in your data frame to the aesthetic attributes of the geometric objects of your plot.
Recall from earlier in the tutorial, we talked about these two things: dataframes and geoms. The dataframe is specified by the
data parameter and the geom is specified by the geom that you choose (e.g., geom_line, geom_bar, etc). The
aes() function is what enables you to connect these two things.
Let’s talk a little more specifically about what this function does.
Remember that all geoms have aesthetic attributes. For example, point geoms have attributes like color, size, shape, x-position, and y-position.
When you use the
aes() function, you are really connecting variables in your dataframe to the aesthetic attributes of your geoms.
A quick example of the aes function
Here’s an example. Let’s say that you want to plot line geoms. Essentially, you want to create a line chart.
Line geoms have aesthetic attributes like their position on the x axis, position on the y axis, and color. By using the
aes() function, we can connect the variables in the dataframe to those aesthetic attributes, which will cause the line to vary on the basis of the underlying data.
So imagine you have a dataset called
dummy_data, and it has two variables,
var2. You want to put
var1 on the x axis and
var2 on the y axis. To create this variable mapping, you can use the
ggplot(data = dummy_data, aes(x = var1, y = var2) + geom_line()
Take a look at the code and then look at the image. Inside of the
aes() function, we have the code
x = var1 and
y = var2. Here,
x refers to the
x position aesthetic. Similarly,
y refers to the
y position aesthetic. These are aesthetic attributes of the points on the line that we’re drawing. And ultimately, by using the
aes() function this way, we’re connecting the parts of the line to the underlying data in the dataset,
Keep in mind that ggplot2 geoms have lots of aesthetic attributes that you can manipulate: x-position, y-position, color, size, shape, and more. Also, keep in mind that different geoms (lines, points, bars, etc) have different aesthetic attributes that you can manipulate.
Some aesthetics are relatively universal (like x-position) but others are specific to specific geoms.
Regardless, to get the full power out of the ggplot2 system, you need to have a firm understanding of how to create variable mappings using the
Examples: how to use ggplot2
Ok. Now that I’ve explained the syntax of ggplot2, let’s look at some examples.
Whenever you’re learning a new programming language, I strongly recommend that you study and practice very simple examples until you really understand how they work. To rapidly master a programming language, you really need to understand basic tools, techniques, and concepts first.
With that in mind, I’m going to show you how to make some basic plots with ggplot2.
Keep in mind that there are other tutorials on this website that explain these techniques in greater detail. However, the simple examples in this ggplot tutorial will give you a quick introduction to these plots and how they work.
The data and packages that we’ll be using
In these examples, we’ll be working with a few packages and datasets.
We’ll primarily be working with the ggplot2 package and using data from the ggplot2 package. Additionally, we’re going to use some other tools from the tidyverse. With that in mind, you need to make sure that you have these packages installed and loaded.
To install the packages in RStudio, you can go to
Tools > Install Packages in the menu bar. Once you’re there, a window will open up and you can type the name of the packages into the text box. Then click “Install.” Make sure to install
Once you have the packages installed, you’ll need them loaded in RStudio. To load them, you’ll need to use the
library() function like this:
Technically, you don’t need to load ggplot2 here, because ggplot2 will be automatically loaded when you load the tidyverse package. But since this is a ggplot2 tutorial, I’m making it explicit.
In any case, you’ve loaded these packages by running the code, you should be ready to go.
How to make a scatterplot with ggplot2
First, we’ll make a scatterplot.
ggplot(data = txhousing, aes(x = listings, y = sales)) + geom_point()
So what are we doing here? Let’s break it down.
ggplot() function indicates that we’re going to plot something. Really, the only thing that the
ggplot() function does is initiate plotting. All of the “heavy lifting” is done by the other parts of the syntax.
Immediately inside of the
ggplot() function, you can see the
data = parameter. Using the
data parameter, we’ve indicated that we’re going to plot data from the
txhousing dataset by using the code
data = txhousing.
On the second line of code, we’ve used the
geom_point() function to indicate that we’re going to plot point geoms. Essentially, we’re using this to plot points.
Finally, take a look at the
aes() function inside of
ggplot(). As I mentioned earlier in this ggplot tutorial, the
aes() function enables us to connect our dataset to our geometric objects. So what specifically did we do here? The exact code is
aes(x = listings, y = sales). This code maps the
listings variable to the x axis and the
sales variable to the y axis.
Now, check out the output of the code:
Just as we’ve specified with the
aes() function, you can see that we’ve mapped the
listings variable to the x axis and the
sales variable to the y axis.
And because we’ve used
geom_point(), ggplot has drawn points. In the plot, every point essentially represents a different row of data. For each point, the x axis position corresponds to the value of
listings, and the y axis position corresponds to the value of
Keep in mind that this is a relatively simple example of how to make a scatterplot. For a little more detail, see our other tutorials for more information about how to make scatterplots in ggplot2.
How to make a bar chart with ggplot2
For the next example in our ggplot2 tutorial, let’s take a look at how to create a bar chart with ggplot.
First, here’s the code. You can paste this into RStudio and run it.
ggplot(data = midwest, aes(x = state)) + geom_bar()
Once again, let’s break this down.
If you’ve been following the syntax explanations through this ggplot2 tutorial, this code should mostly make sense.
ggplot() function initiates plotting. Then immediately inside the
ggplot() function, the code
data = midwest indicates that we’ll be plotting data from the
On the second line of code, the
geom_bar() function indicates that we’ll be drawing bars. Essentially, this indicates that we’re going to make a bar chart.
Then, take a look at the
aes() function. As always, the
aes() function tells ggplot which variables to plot on the chart. In this particular case, the code
aes(x = state) puts the
state variable on the x axis of the chart.
Notice though that we haven’t mapped any variable to the y axis. By default, if you use
geom_bar() and you don’t map any variable to the y axis using the
aes() function, ggplot will count the records.
So in this case, the length of the bar corresponds to the count of the number of records for the category on the x axis.
Create a bar chart with stat = ‘identity’
There’s also another way to make a bar chart. It’s possible to map a variable to the y axis too, so the length of the bar correspond to the value of the y axis variable (instead of the count).
To show you an example of this, I’m going to create a new dataset that calculates the total population by state. In order to create this summarised dataset, we’ll use the
group_by() and the
summarise() functions from dplyr.
Having said that, in order to really understand this, you’ll need to understand dplyr and the “pipe” syntax. Explaining dplyr is beyond the scope of this blog post (since this is a ggplot2 tutorial), so check out our dplyr tutorial for more explanation of how this works.
midwest_populations <- midwest %>% group_by(state) %>% summarise(total_population = sum(poptotal))
Ultimately, this code produces a summarised dataset that contains two variables:
Let’s print it out so you can see it:
Again, there are two variables: the state, and then the total population of that state.
The important detail here is that there is one observation for every state. This is different from the original
midwest dataset, where there was one record for every county, and therefore multiple records for every state.
This is relevant, because now we can map the
state variable to the x axis and the
total_population variable to the y axis.
Let’s take a look:
ggplot(midwest_populations, aes(x = state, y = total_population)) + geom_bar(stat = 'identity')
An explanation of stat = ‘identity’ in geom_bar
Let’s break this down.
midwest_populations, has only two variables,
total_population. Inside the
aes() function, we’ve mapped
state to the x axis and
total_population to the y axis. Notice that this is different from our previous example, where we only mapped
state to the x axis.
Furthermore, take a look inside of the call to
geom_bar(). Inside of
geom_bar(), there’s a piece of syntax that says
stat = 'identity'. This syntax essentially says that the length of the bar should correspond to the value of the variable on the y axis. Remember, by default,
geom_bar() wants to count the records and make the length of the bar correspond to that count.
When we use the the code
geom_bar(stat = 'identity') we’re really overriding that default behavior, and making the length of the bar correspond to the variable mapped to the y axis. Keep in mind that this only really works if you have a variable mapped to the y axis. So you need to use the
aes() function in concert with the syntax
stat = 'identity'.
These are two very simple examples of bar charts. If you want more details about how to create bar charts in ggplot2, check out our previous tutorial on how to use
How to make a line chart with ggplot2
Now, let’s finally make a line chart.
Again, if you’ve been following along with this ggplot2 tutorial, the syntax for the line chart should make sense.
To make this line chart with ggplot2, we’re going to use a dataset of the stock price of Tesla (the car company). I previously gathered and cleaned that dataset, so it’s largely ready to go.
# IMPORT DATA INTO R tsla_stock_metrics <- read_csv("https://www.sharpsightlabs.com/datasets/TSLA_start-to-2018-10-26_CLEAN.csv")
Very quickly, let’s examine the data by printing it out.
As you can see, there are several variables here. We’re mainly going to be interested in the
date variable and the
ggplot(data = tsla_stock_metrics, aes(x = date, y = open_price)) + geom_line()
Again, if you’ve been following along so far in this ggplot2 tutorial, this should mostly make sense.
We’re setting the dataset with the code
data = tsla_stock_metrics. Then inside the
aes() function, we’re mapping
date to the x axis and
open_price to the y axis. Finally, we’re using
geom_line() to indicate that we want ggplot to draw lines.
Notice as well how similar this is to our previous examples. Just like in the previous examples in this ggplot2 tutorial, we’re simply designating a dataframe, mapping variables to the x and y axes, and specifying a geom. In the simplest cases, that’s all there is to making a data visualization with ggplot2.
Other charts you can make with ggplot2
The three examples in this ggplot2 tutorial are three of the charts that you’ll probably use most often … the line chart, bar chart, and scatterplot.
Having said that, there are many other charts you can make with ggplot2.
You can make histograms:
Or you can make density plots:
ggplot2 also makes it easy to make much more complicated data visualizations, like geospatial maps:
There’s also a lot that you can do to format a chart.
So even though this ggplot2 tutorial gives you the basics, there’s still more to learn.
For more data science tutorials, sign up for our email list
If you want to master ggplot2 and other data science tools, sign up for our email list.
Here at Sharp Sight, we teach data science.
Every week, we publish articles and free tutorials about data science.
If you sign up for our email list, you’ll get these tutorials delivered right to your inbox.
You’ll learn about:
- machine learning in R
- … and more.
Want to learn data science in R? Sign up now.
Please visit source website for post related comments.