Statistics Sunday: Highlighting a Subset of Data in ggplot2
This article is originally published at http://www.deeplytrivial.com/
books<-read_csv("2017_books.csv", col_names = TRUE)
## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' 
One analysis I conducted with this dataset was to look at the correlation between book length (number of pages) and read time (number of days it took to read the book). We can also generate a scatterplot to visualize this relationship.
## Pearson's product-moment correlation
## data: books$Pages and books$Read_Time
## t = 3.1396, df = 51, p-value = 0.002812
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1482981 0.6067498
## sample estimates:
scatter <- ggplot(books, aes(Pages, Read_Time)) +
geom_point(size = 3) +
labs(title = "Relationship Between Reading Time and Page Length") +
ylab("Read Time (in days)") +
xlab("Number of Pages") +
positive correlation here, meaning the longer books take more days to read. It's a moderate correlation, and there are certainly other variables that may explain why a book took longer to read. For instance, nonfiction books may take longer. Books read in October or November (while I was gearing up for and participating in NaNoWriMo, respectively) may also take longer, since I had less spare time to read. I can conduct regressions and other analyses to examine which variables impact read time, but one of the most important parts of sharing results is creating good data visualizations. How can I show the impact these other variables have on read time in an understandable and visually appealing way?
gghighlight will let me draw attention to different parts of the plot. For example, I can ask gghighlight to draw attention to books that took longer than a certain amount of time to read, and I can even ask it to label those books.
scatter + gghighlight(Read_Time > 14) +
geom_label(aes(label = Title),
hjust = 1,
vjust = 1,
fill = "blue",
color = "white",
alpha = 0.5)
Here, the gghighlight function identifies the subset (books that took more than 2 weeks to read) and labels those books with the Title variable. Three of the four books with long read time values are non-fiction, and one was read for a course I took, so reading followed a set schedule. But the fourth is a fiction book, which took over 20 days to read. Let's see how month impacts reading time, by highlighting books read in November. To do that, I'll need to alter my dataset somewhat. The dataset contains a starting date and finish date, which were read in as characters. I need to convert those to dates and pull out the month variable to create my indicator.
books$Started <- mdy(books$Started)
books$Start_Month <- month(books$Started)
books$Month <- ifelse(books$Start_Month > 10 & books$Start_Month < 12, books$Month <- 1,
books$Month <- 0)
scatter + gghighlight(books$Month == 1) +
geom_label(aes(label = Title), hjust = 1, vjust = 1, fill = "blue", color = "white", alpha = 0.5)
The book with the longest read time was, in fact, read during November, when I was spending most of my time writing.
Please visit source website for post related comments.