# Graphs with ggplot2: Part I

# Graphs with ggplot2: Part I

## Objectives

- Learn to apply aesthetic mappings in ggplot2
- Produce a variety of univariate graphs including histograms, kernel density plots, box plots, and violin plots
- Compare groups in graph space
- Create multivariate graphs including scatter plots and line graphs
- Explore techniques to combat overplotting

## Overview

Multiple packages are available to produce graphs in R. Graphing
functionality is even made available through base R. We will make use of
some base R graphics in this course, but we will not focus on this
topic. The **grids** package provides a low-level graphics
system that other packages have used to expand the graphing capabilities
of R. These packages include **lattice** and
**ggplot2**. In this course, we will explore ggplot2, which
is based on the *Grammar of
Graphics* by Leland Wilkinson. This philosophy suggests breaking
graphs up into semantic components such as scales and layers. In fact,
the package name means “Grammar of Graphics Plots 2”. Practically, this
translates to mapping constants, such as specific colors, sizes, or
symbols, or variables, such as your data values, to graphical
parameters, such as position along the *x*-axis or
*y*-axis, size or shape of point symbols, width of lines, and
color of point symbols or lines. This is the concept of
**aesthetic mappings**.

ggplot2 is my preferred method for making graphs. I have found it to be very intuitive, powerful, and adaptable to a wide variety of needs. Once a graph is produced, you can easily export it as a vector graphic for editing outside of R. In this first section on ggplot2 we will focus on the basics of the package and how to define aesthetic mappings. We will also explore a variety of different types of graphs. In the second section, we will explore methods to improve and refine graphs.

Before working through these examples, you will need to install the
ggplot2 package. Also, you will need to load in some example data.
First, we will work with the *high_plains_data.csv* file. The
elevation (“elev”), temperature (“temp”), and precipitation (“precip2”)
data were extracted from raster grids provided by the PRISM Climate
Group at Oregon State University. The elevation data are provided in
meters, the temperature data represent 30-year annual normal mean
temperature in Celsius, and the precipitation data represent 30-year
annual normal precipitation in millimeters. The original raster grids
have a resolution of 4-by-4 kilometers and can be obtained here. I also summarized
percent forest by county (“per_for”) from the 2011 National Land Cover
Database (NLCD). NLCD data can be obtained from the Multi-Resolution Land
Characteristics Consortium (MRLC). The link at the bottom of the
page provides the example data and R Markdown file used to generate this
module.

```
library(ggplot2)
<- read.csv("D:/mydata/ggplot2_p1/high_plains_data.csv", sep=",", header = TRUE, stringsAsFactors=TRUE) hp_data
```

Here is a link to a cheat sheet for ggplot2. This sheet can also be accessed in RStudio under Help –> Cheat Sheets.

## Univariate Graphs

**Univariate graphs** involve visualizing a single
variable and include **histograms**, **kernel density
plots**, and **box plots**.

### Histograms

A **histogram** plots the count or number of data points
in each defined data range or bin. The *x*-axis represents the
data values and the *y*-axis represents the count. My goal with
the first graph is to inspect the distribution of percent forest cover
by county. Since this is the first example of a ggplot2 graph, I will
explain the syntax in detail. All ggplot2 graphs start with the
*ggplot()* function. Within this function I specify the data
frame being referenced and the aesthetic mappings using *aes()*.
Here, I only need to define the *x* value. Next, I must define
the geometry desired. *geom_histogram()* is used to create
histograms. The *binwidth* argument specifies the range of values
to aggregate into each bin for counting. I used 5, so each bin will
contain a range of 5%. Note that multiple lines of code are associated
using *+*. This is a key component of how ggplot2 works. Also,
note that field names do not need to be placed in quotes.

```
ggplot(hp_data, aes(x=per_for))+
geom_histogram(binwidth=5)
```

This graph is the same as the one created above. However, I have
changed the *binwidth* to 2. So, each bin is now narrower or
aggregates a smaller range of values, and more detail in the
distribution is visualized. There is not necessarily a correct
*binwidth*, as this is case specific and partially depends on
whether you are interested in visualizing a general trend or more local
information.

```
ggplot(hp_data, aes(x=per_for))+
geom_histogram(binwidth=2)
```

At this point, you might be thinking that these graphs don’t look great. Remember that we are focusing on exploring the basics of ggplot2 and aesthetic mappings in this section. You will learn to make your graphs pretty in the next module.

Here is another example of a **histogram** representing
the distribution of mean annual temperature by county. Take some time to
make sure you understand the aesthetic mappings.

```
ggplot(hp_data, aes(x=temp))+
geom_histogram(binwidth=2)
```

### Kernel Density Plots

Another means to represent the distribution of a single variable is a
**kernel density plot**, in which a kernel density function
is used to represent a generalized or smoothed version of the
distribution of a variable. The syntax is very similar to that for the
histograms created above. Temperature is assigned to the *x*
variable. I have added a *..density..* argument in the
*aes()* function to indicate that I want the *y*-axis to
show density. I am now using *geom_density()* as opposed to
*geom_histogram()*. Lastly, I have provided a *fill*
argument to make the density area green.

```
ggplot(hp_data, aes(x=temp, ..density..))+
geom_density(fill="Green")
```

In the next graph I have changed *..density..* to
*..count..* to alter the *y*-axis mapping. I also changed
the *fill* to red, the *color* to gray, and the
*size* to 3. *color* for a kernel density plot defines the
color of the outline while *size* defines the thickness of the
outline.

```
ggplot(hp_data, aes(x=temp, ..count..))+
geom_density(fill="red", color="gray", size=3)
```

In this example, I have set *fill* to *NA* so that
there is no fill color applied.

```
ggplot(hp_data, aes(x=temp, ..count..))+
geom_density(fill=NA, color="red", size=1.5)
```

Similar to the *binwidth* argument for a histogram, the
*adjust* argument can be used to indicate the generalization of
the kernel density function. Smaller values will result in a more
detailed distribution that shows more local patterns, as demonstrated in
the next graph. In the following graph a larger value is used, which
results in a very generalized pattern. Similar to histogram binning,
there isn’t necessarily a correct value, as this depends on the purpose
of the graph and the patterns being explored.

```
ggplot(hp_data, aes(x=temp, ..density..))+
geom_density(fill="Green", adjust=.25)
```

```
ggplot(hp_data, aes(x=temp, ..density..))+
geom_density(fill="Green", adjust=5)
```

### Box Plots

Data distribution can also be summarized using **box
plots**. A box plot shows the **minimum**,
**1st quartile**, **median**, **3rd
quartile**, **maximum**, and **interquartile
range** (**IQR**), as shown in the image below. Note
that the **mean** is not generally included.

I am using *geom_boxplot()* to show the distribution of mean
elevation by county. Note that any small values that are more than 1.5
**IQR** from the **1st quartile** and any high
values that are more than 1.5 **IQR** from the **3rd
quartile** are plotted as points. If you would like to not show
these points, then you can set the *outlier.shape* parameter
equal to *NA*, as demonstrated in the second example. However,
this may be a misrepresentation of your data as it implies that these
more extreme values don’t exist. If you do choose to not plot the
**outliers**, I would recommend at least noting that this
was done so that viewers are aware.

```
ggplot(hp_data, aes(y=elev))+
geom_boxplot()
```

```
ggplot(hp_data, aes(y=elev))+
geom_boxplot(outlier.shape=NA)
```

## Comparing Categories

It is common to compare the distribution of a continuous variable for
different categories in the same graph space. In the following example,
I first subset out the counties for three states using the
*filter()* function from dplyr. I then generate a kernel density
plot in which temperature is mapped to the *x*-axis, density to
the *y*-axis, and the state name to the *fill* color. The
result is three separate density curves. In order to better visualize
the distributions in overlapping areas, I have also applied a 50%
transparency using the *alpha* argument. ggplot2 automatically
generates a legend to explain the different colors. We will talk more
about legends and legend design in the next section.

```
library(dplyr)
<- hp_data %>% dplyr::filter(STATE_NAME == "North Dakota" | STATE_NAME == "Utah" | STATE_NAME == "Kansas")
hp_UT_ND_KS ggplot(hp_UT_ND_KS, aes(x=temp, ..density.., fill=STATE_NAME))+
geom_density(alpha=0.5)
```

The next example provides the same comparison but for precipitation.

```
ggplot(hp_UT_ND_KS, aes(x=precip2, ..density.., fill=STATE_NAME))+
geom_density(alpha=0.5)
```

Similar comparisons could be made using box plots. In this example, I
am comparing the mean elevation by county for all states in the data
set. Note that I did not provide a mapping to the *x*-axis in the
prior box plot examples. Now, I am mapping a categorical variable to
this axis.

```
ggplot(hp_data, aes(x=STATE_NAME, y=elev))+
geom_boxplot()
```

You can map the same variable to more than one graphical parameter.
In this example, I am mapping the state name to the *x*-axis and
to the *fill* color. A legend is automatically generated.

```
ggplot(hp_data, aes(x=STATE_NAME, y=elev, fill=STATE_NAME))+
geom_boxplot()
```

This is a similar example, but for precipitation as opposed to elevation.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_boxplot()
```

Similar to kernel density plots, the *fill* argument relates
to the fill color while the *color* argument applies to the
outline color. This is demonstrated in the code block below by changing
the *fill* argument to *color*.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, color=STATE_NAME))+
geom_boxplot()
```

It is possible to define a desired order for categorical variables.
In this example, I have reordered the states based on the median mean
county elevation. This was accomplished using *fct_reorder()*
from the forcats package.

`library(forcats)`

```
ggplot(hp_data, aes(x=fct_reorder(STATE_NAME, elev, .fun= median, .desc=TRUE), y=elev, fill=STATE_NAME))+
geom_boxplot()
```

An alternative to a box plot is a **violin plot**, which
is basically a mirrored kernel density surface turned on its side.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_violin()
```

Since box plots and violin plots can share the same axes assignments,
they can be plotted in the same space. Here, I am specifying two
*geoms*. The last *geom* provided will plot above prior
*geoms*.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_violin()+
geom_boxplot()
```

In order to make the box plots fit inside of the violin plots, I
decrease the *width* to 0.1. I also make the box plots gray so
that they stand out against the violin plots.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_violin()+
geom_boxplot(width=0.1, fill="gray")
```

This example is similar to the prior. However, I have made the violin
plots gray and the box plot colors different for each state. Note that
mappings set in the *geom* arguments will supersede those set in
*ggplot()*.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_violin(fill="gray")+
geom_boxplot(width=0.1)
```

Lastly, here is an example with just three states.

`library(dplyr)`

```
<- hp_data %>% dplyr::filter(STATE_NAME == "North Dakota" | STATE_NAME == "Utah" | STATE_NAME == "Kansas")
hp_UT_ND_KS ggplot(hp_UT_ND_KS, aes(x=STATE_NAME, y=temp, fill=STATE_NAME))+
geom_violin()+
geom_boxplot(width=.15, fill="gray")
```

## Additional Univariate Graphs

There are some alternatives for comparing a continuous variable
between categories graphically. In this example *geom_point()* is
being used to map the mean elevation for six randomly chosen
counties.

```
<- hp_data %>% sample_n(6, replace=FALSE)
hp_sub ggplot(hp_sub)+
geom_point(aes(x=NAME_1, y=elev), size=2)
```

Bar graphs can be created using *geom_bar()*.
*stat=“identity”* indicates that the height of the bar should
represent the magnitude of the variable mapped to the
*y*-axis.

```
<- hp_data %>% sample_n(6, replace=FALSE)
hp_sub ggplot(hp_sub, aes(x=NAME_1, y=elev), size=2)+
geom_bar(stat="identity")
```

## Multivariate Graphs

**Multivariate** graphs involve visualizing the
relationship between two or more variables, and they include
**scatter plots** and **line graphs**. We will
now explore these graph types.

### Scatter Plots

**Scatter plots** in ggplot2 are created using
*geom_point()*, and continuous variables should be mapped to both
the *x*-axis and *y*-axis. In this first example, I map
elevation to the *x*-axis and temperature to the *y*-axis.
You can see evidence of a inverse or indirect relationship between
temperature and elevation: higher elevation tends to result in lower
temperature.

```
ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()
```

You can change the point symbol used by specifying a *shape*
argument. You can also change the *size* of the point symbol.
Some point symbols will have both a *fill* and *color*
aesthetic while others will only have a single color that can be
defined. Quick-R
provides a summary of available point symbols. In this example, I pick a
new symbol, increase the size, and define outline and fill colors.

```
<- hp_data[hp_data$per_for,]
hp_data_ord ggplot(hp_data, aes(x=elev, y=temp))+
geom_point(shape=23, color="blue", fill="red", size=2)
```

You can also add trend lines to scatter plots using
*geom_smooth()*. Here, I have added a **linear**
trend (*method=lm*).

```
<- hp_data[hp_data$per_for,]
hp_data_ord ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=lm)
```

You can change the line symbol using the *lty* argument and
the width using *lwd*. A list of line styles are also summarized
on Quick-R.

```
<- hp_data[hp_data$per_for,]
hp_data_ord ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=lm, lty=2, lwd=3)
```

You may have noticed that a gray outline is placed around the trend.
This is a **95% confidence interval** and is added by
default. To not show this, simply add *se=FALSE*.

```
<- hp_data[hp_data$per_for,]
hp_data_ord ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=lm, se=FALSE)
```

Instead of plotting a linear trend, you can also use a
**loess** function (*method=loess*). This is
basically a local-window or moving average and would be more appropriate
when a linear relationship does not exist between the two variables
being graphed.

```
<- hp_data[hp_data$per_for,]
hp_data_ord ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=loess)
```

If you would like to show more than two variables on a scatter plot,
then you can use additional aesthetic mappings and graphical parameters.
In this example, I have mapped percent forest cover (a continuous
variable) to the point *size* and the state to the point
*color* (a categorical variable). I would argue that this is not
necessarily effective; it is simply an example of how you can apply or
use additional aesthetic mappings.

```
ggplot(hp_data, aes(x=elev, y=temp, size=per_for, color=STATE_NAME))+
geom_point()
```

You can also label data points by adding *geom_text()* and
providing a text variable or text constant using *label*.

`library(dplyr)`

```
<- hp_data %>% sample_n(10, replace=FALSE)
hp_sub ggplot(hp_sub, aes(x=elev, y=temp))+
geom_point()+
geom_text(aes(label=NAME_1))
```

### Line Graphs

*Line graphs* connect adjacent data points with a line as
opposed to representing each data value as a separate point. They are
often used to represent trends, such as trends over time in a
**time series graph**. Since the *x*-axis and
*y*-axis can be the same for scatter plots and line graphs, they
can be combined to one plot using multiple *geoms*.

We will now read in a new data set: **maple_leaf.csv**.
This is spectral reflectance data for a maple leaf from the United
States Geologic Survey (USGS) Spectral Library. The original data can be
found here.

`<- read.csv("D:/mydata/ggplot2_p1/maple_leaf.csv", header=TRUE, sep=",") leaf `

Line graphs are created using *geom_line()*. In this example,
I am plotting wavelength (in micrometers) to the *x*-axis and
spectral reflectance (as a proportion) to the *y*-axis. This
generates a **spectral reflectance curve** where peaks
represent reflectance and troughs represent absorption. This should be
familiar if you have studied remote sensing.

```
ggplot(leaf, aes(x=wav, y=reflec))+
geom_line()
```

As another example, we will now read in data from the AVIRIS
hyperspectral sensor showing spectral reflectance at four different
locations or pixels. Using the *str()* function, you can see that
the center wavelength of each band has been provided (in nanometers
(nm)) and that data for each location is in separate columns (A through
D). Spectral reflectance values have been multiplied by 100 to store
them using an integer as opposed to a double data type.

```
<- read.csv("D:/mydata/ggplot2_p1/aviris_spectral_data.csv", header=TRUE, sep=",")
aviris str(aviris)
'data.frame': 207 obs. of 5 variables:
$ Center: num 366 376 385 395 405 ...
$ A : int 2299 2404 2576 2679 2719 2932 3048 3070 3104 3103 ...
$ B : int 2983 2795 2793 2958 3016 3289 3444 3491 3540 3570 ...
$ C : int 1381 1787 1896 1957 2005 2174 2267 2253 2237 2209 ...
$ D : int 1865 1954 2051 2150 2200 2341 2454 2483 2463 2457 ...
```

In this first example, I plot spectral reflectance at Site A.
Wavelength was assigned to the *x*-axis and spectral reflectance
to the *y*-axis. Within the *aes()* function I divide the
values in “A” by 100 to return spectral reflectance as a percent.

```
ggplot(aviris, aes(x=Center, y=A/100))+
geom_line()
```

In this example, I plot all four locations as separate lines using
different colors. You might be wondering why I simply didn’t assign the
sample location as a categorical variable to the *color*
aesthetic. This is because each location is stored as a separate column
as opposed to storing the sample as a categorical variable in a single
column. So, these data are in a different “shape” then the examples
above and must to be treated differently. Note again that you can add
multiple *geoms* to the same plot.

```
ggplot(aviris, aes(x=Center, y=A/100, color="red"))+
geom_line()+
geom_line(aes(color="yellow", y=B/100))+
geom_line(aes(color="blue", y=C/100))+
geom_line(aes(color="green", y=D/100))
```

## Overplotting

We will now discuss means to deal with **overplotting**.
Overplotting occurs when you have a large number of data points that
cannot be easily distinguished in the graph space due to crowding and
overlap.

### Hexagonal Binning

As an example, the graph below contains 15,000 points, which were
generated using the *rnorm()* function. There are too many data
points in the plot space to see a clear pattern.

```
<- rnorm(15000, mean=0, sd=1)
d1 <- rnorm(15000, mean=0, sd=1)
d2 <- data.frame(d1, d2)
d3 ggplot(d3, aes(x=d1, y=d2))+
geom_point()
```

One method to deal with this issue is to use **hexagonal
binning** where the two-dimensional graph space is divided into
hexagons of equal size and the number of data points in each hexagon are
counted. You can change the *size* argument to obtain larger or
smaller hexagons. Creating hexagonal bins with ggplot2 requires the
**hexbin** package, so this package may need to be
installed to execute the example.

```
ggplot(d3, aes(x=d1, y=d2))+
geom_hex(size=10)
```

### Rectangular Binning

As opposed to using hexagonal bins, you can obtain rectangular or
square bins with *geom_bin2d()*. Here, each bin measures 1 by 1
units.

```
ggplot(d3, aes(x=d1, y=d2))+
geom_bin2d(binwidth=c(1,1))
```

### 2D Kernel Density

Another option is using a **two-dimensional kernel-density
surface**. In the example, I am using *stat_density2d()*
to generate the kernel density estimate. I am then plotting contours
using *geom_density2d()*.

```
ggplot(d3, aes(x=d1, y=d2))+
stat_density2d(aes(fill=..level..), alpha=.5, size=2, bins=10, geom="polygon")+
geom_density2d(color="black")
```

### Jitter

You can also add random noise to data points to provide more
separation. This can be accomplished using *geom_jitter()*. In
the first graph, the points are being plotted at the exact specified
coordinates. In the second graph, random noise has been added in the
*x* and *y* directions.

```
<- rep(c(1, 2, 3, 4, 5), 5)
a1 <- rep(c(1), 5)
l1 <- rep(c(2), 5)
l2 <- rep(c(3), 5)
l3 <- rep(c(4), 5)
l4 <- rep(c(5), 5)
l5 <- c(l1, l2, l3, l4, l5)
a2 <- data.frame(a1, a2)
a3 ggplot(a3, aes(x=a2, y=a1))+
geom_point(size=3)
```

```
ggplot(a3, aes(x=a2, y=a1))+
geom_jitter(size=3)
```

## A Note on Graphical Primitives

All graph elements in ggplot2 are created using graphic primitives to plot line segments, paths, polygons, etc. This is demonstrated in the following graph where I have defined coordinates to represent a polygon, a rectangle, and a line segment to plot in the graph space. These primitives can be used to generate more complex graph features.

```
library(ggplot2)
<- data.frame(y=c(4,2,1,2), x=c(2,4,1,1))
poly ggplot(poly)+
geom_polygon(aes(x=x, y=y), colour="gray", fill="red", size=2)+
geom_rect(xmin=1, xmax=3, ymin=3, ymax=4, fill=NA, color="Yellow", size=2)+
geom_segment(aes(x=2, xend=4, y=2, yend=3), size=2, color="blue")
```

## A Note on Coordiate Systems

We have been making use of **Cartesian coordinate
systems** throughout this section. However, other systems are
available. Below is an example of a graph created using polar
coordinates. This is a **Coxcomb Plot** where the angle
represented a category and the size represents a quantity. You can also
make maps using map coordinates and the **ggmap**
package.

```
library(ggplot2)
<- c("A", "B", "C", "D", "E")
x1 <- c(14, 28, 31, 17, 14)
y1 <- data.frame(x1, y1)
data1 <- ggplot(data1, aes(x=x1, y=y1))+
pcord geom_bar(stat="identity")
+ coord_polar() pcord
```

## Concluding Remarks

I hope this introduction to ggplot2 has provided you with an adequate understanding of the package’s terminology, syntax, and aesthetic mappings. The graphs generated here would need some work before they could be considered final products for a presentation, report, write-up, or publication. In the next section, we will explore a variety of techniques to make ggplot2 graphs pretty including adding text, editing axis scales and labels, changing fonts and colors, using themes, customizing legends, adding geometric features, and combining multiple graphs to a single layout. We will also explore how to export graphs as raster and vector graphics.