---
title: "Exploratory data analysis"
author: ""
---
## Aims of the worksheet
Exploratory data analysis is the process of looking through a dataset to see what can be learned from it. At a minimum this entails looking at the way the data is structured and organized to figure out what kinds of information it has, as well as to identify potential gaps and errors in the data. An exploratory data analysis will also compute summary statistics on the data and make plots of relationships between the variables. Depending on the state of the data, it may also have to reshape or clean up the data. And finally, an exploratory data analysis may make use of more advanced techniques, such as creating statistical models of the data, clustering the data, reducing dimensionality through principal components analysis, or the like. While there are a number of common techniques, the actual approach taken will depend greatly on the actual dataset that you are working with. Over time you will develop a checklist of techniques that you use. But a good rule of thumb is to start with the simplest techniques, then build up in complexity as it proves itself to be necessary.
You will find these chapters or books on exploratory data analysis to be helpful: Arnold and Tilton's *Humanities Data in R*, ch. 3--5; Kaplan's *Data Computing*, ch. 14; and Roger Peng's *Exploratory Data Analysis with R*.
The aim of this worksheet is to introduce a few techniques for exploratory data analysis on the Methodists dataset, then introduce a new dataset on which you will practice. But don't forget the techniques taught in the worksheets on plotting (especially scatter plots and histograms) and in data manipulation.
## Loading and cleaning data
We will begin by loading the Methodist data.
```{r, message=FALSE}
library(tidyverse)
library(ggraph)
library(historydata)
data("methodists")
methodists
```
Let's also write a function that figures out the decade from the year, and add the decade as a column to the data frame.
```{r}
decade <- function(x) {
trunc(x / 10) * 10
}
# To show that it works
decade(1795:1805)
methodists <- methodists %>%
mutate(decade = decade(year))
```
And we will create an alternative version of the dataset where the type of members is its own column.
```{r}
methodists_tidy <- methodists %>%
gather(type_of_member, number_of_members, starts_with("members_"))
```
## Some techniques of exploratory data analysis
### Summary statistics
Using dplyr's `count()` function, along with summary statistics using `group_by()` and `summarize()` along with `mean()`, `median()` and the like can give you a quick sense of what is contained in the dataset.
For instance, we can get a quick overview of how many meetings are in each conference.
```{r}
methodists %>%
count(year, conference) %>%
filter(!is.na(conference)) %>%
ggplot(aes(x = year, y = n, color = conference)) +
geom_line()
```
Or we can extend that to get a sense of the size of each conference:
```{r}
methodists %>%
group_by(year, conference) %>%
summarize(members_total = sum(members_total)) %>%
filter(!is.na(conference)) %>%
ggplot(aes(x = year, y = members_total, color = conference)) +
geom_line() +
theme(legend.position = "bottom")
```
The `summary()` function can also be quite useful. It can be used on data frames or vectors (as well as other objects).
```{r}
summary(methodists)
summary(methodists$members_total)
```
### Measures of density and distribution
Summary statistics like the mean, median, and quartile are useful for getting a sense of a variable's central tendencies, but they are less useful for getting a sense of the distribution of values in a variable. One very useful technique is to see how the values are distributed.^[Note that the term [distribution](https://en.wikipedia.org/wiki/Probability_distribution) has a technical meaning in statistics, though we are using it more colloquially here.]
Here we use the `geom_density()` geometry in ggplot2 to show the density distribution of values. Here we are figuring out the density distribution for the `members_total` value. The value on the y-axis may not be intuitively obvious. It is the proportion of the values in the dataset that have the value on the x-axis. So, in the 1770s about 0.0015 (0.15%) of churches had a membership of about 250. The total area under the density curve will be 1.
```{r}
ggplot(methodists, aes(x = members_total, fill = as.factor(decade))) +
geom_density() +
facet_wrap(~decade) +
ggtitle("Distribution of the total size of congregations by decade") +
scale_x_continuous(breaks = seq(0, 1500, 250), limits = c(0, 1500)) +
guides(fill = FALSE) +
ggtitle("Density plots comparing sizes of circuits by decade")
```
Another way to get a sense of the distribution of values is to use a boxplot. A boxplot show the distances between the quartiles (the box), values extending beyond the quartiles (the whiskers), plus outliers (the dots). (See `?geom_boxplot` for the full details.) Here we compare the distribution in the number of white and black members by decade.
```{r}
methodists_tidy %>%
filter(year >= 1802,
type_of_member %in% c("members_black", "members_white")) %>%
ggplot(aes(x = type_of_member, y = number_of_members)) +
geom_boxplot(outlier.color = "gray") +
ylim(0, 1000) + facet_wrap(~ decade, ncol = 2) +
ggtitle("Boxplots comparing white and black membership by decade")
```
Yet another way to get a sense of the distribution of a single variable is to use a histogram, which puts values into bins, then counts how many values fall into each of those bins. In this example, we count how many circuits fall into each bin, with each bin having a width of 100 members.
```{r}
methodists %>%
filter(year == 1825) %>%
ggplot(aes(x = members_total)) +
geom_histogram(binwidth = 50) +
xlim(0, 2000) +
ggtitle("Histogram of the number of members in a circuit in 1825")
```
### Scatter plots and smoothing
The techniques above for showing the distribution work for a single variable. Often we want to know the relationship between two variables. A scatter plot is often the easiest way to get a sense of what is in two variables. Here we create a scatterplot of the relationship between white and black members by decade. It can be difficult to understand the relationship between the two variables, especially in a case like this where the points are overplotted and the values are widely dispersed. By using `geom_smooth()` we can fit a modeling function to the data. (See `?geom_smooth` and especially the `method =` parameter for the kinds of models you can use.) These models try to capture the relationship between the $x$ variable and the $y$ variable (the colored line), along with the uncertainty (the gray band around the line). In this case the line show the typical relationship between the number of white and black members.
```{r}
methodists %>%
filter(decade >= 1790, decade <= 1820) %>%
ggplot(aes(members_white, members_black)) +
facet_wrap(~ decade) +
geom_point(alpha = 0.1, shape = 1) +
coord_cartesian(xlim = c(0, 1000), ylim = c(0, 500)) +
geom_smooth()
```
### Clustering
Clustering is the process of taking observations which have many variables and grouping them together. Similar observations will be in the same cluster. One mode of clustering which is included in base R is [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
We are going to use a subset of the Methodists data from 1829, so that the data is small enough to plot easily.
```{r}
library(stringr)
set.seed(822)
methodists_for_clustering <- methodists %>%
filter(year == 1829) %>%
select(conference, district, meeting, starts_with("members_"),
-members_general) %>%
mutate(label = str_sub(meeting, 1, 12)) %>%
sample_n(50)
```
Next we compute the distance between each observation (`dist()`) and then after figuring out how far away they are from one another, then figure out the hierarchical clusters (`hclust()`).
```{r}
# Use only the actual data columns
methodists_clust <- methodists_for_clustering %>%
select(starts_with("members_")) %>%
dist() %>%
hclust()
methodists_clust$labels <- methodists_for_clustering$label
```
We can now see how the groups are clustered in a "dendrogram."
```{r, message=FALSE}
# Using the ggraph package
ggraph(methodists_clust, "dendrogram") +
geom_edge_elbow() +
geom_node_text(aes(label = label), angle = 90, hjust = 1) +
ylim(-1000, 2000)
# Cf. base R plot of same data
plot(methodists_clust, hang = -1)
```
We have to decide where to cut the dendrogram using the `cutree()` function. We can specify the number of groups we want with the `k = ` argument or the height at which to cut the tree with the `h = ` argument. Then we can put our groups back into the data frame.
```{r}
methodists_for_clustering <- methodists_for_clustering %>%
mutate(cluster = cutree(methodists_clust, k = 4))
```
Finally, we can compute some summary statistics and plot the clusters to see if they make sense.
```{r}
methodists_for_clustering %>%
group_by(cluster) %>%
summarize(n = n(),
mean_total = mean(members_total, na.rm = TRUE),
mean_black = mean(members_black, na.rm = TRUE),
mean_white = mean(members_white, na.rm = TRUE)) %>%
mutate(percent_black = mean_black / (mean_white + mean_black)) %>%
arrange(cluster)
ggplot(methodists_for_clustering,
aes(x = members_white, y = members_black, color = as.factor(cluster))) +
geom_point()
```
Another way that we can do clustering is with the k-means algorithm. See the [vignette on that topic](https://cran.rstudio.com/web/packages/broom/vignettes/kmeans.html) in the `broom` package for an explanation of how to make a plot using k-menas.
```{r}
methodists_for_kmeans <- methodists_for_clustering %>%
select(starts_with("members_"))
methodists_kmeans <- kmeans(methodists_for_kmeans, centers = 4)
methodists_kmeans
```
## Practice exploratory data analysis: A New Nation Votes
For practice with exploratory data analysis, we will use the *A New Nation Votes* dataset. This is a dataset which contains election returns in the early American republic, from 1787 to 1825. You should see the [project webpage](http://elections.lib.tufts.edu/) and read about the dataset to find out what is in it and how it was collected. You may also wish to browse the online version of the database. Then you should then download the data. You may use either the [full dataset](http://dl.tufts.edu/election_datasets) provided by Tufts, which includes every kind of election. Or you can use [this version](https://github.com/mapping-elections/elections-data) which has only the county-level data for Congressional elections.
Here is some code you can use to download the Tufts version of the data and load it in as a data frame.
```{r, eval = FALSE}
library(readr)
url <- "http://dl.tufts.edu/file_assets/generic/tufts:MS115.003.001.00001/0"
if (!file.exists("all-votes.tsv")) {
download.file(url, "nnv-all-votes.zip")
unzip("nnv-all-votes.zip", files = "all-votes.tsv")
}
nnv <- read_tsv("all-votes.tsv")
```
In a separate Rmd file, perform an exploratory analysis of the data. You should attempt at least the first three of these techniques: (1) aggregating the data through counting or summarizing; (2) plotting the distribution of certain variables; (3) scatterplots (or other plots) of multiple variables in relationship with one another; (4) clustering.
Some questions to think about: What kinds of elections were there? How many of each kind of election? How many candidates and how often do they appear? Which parties? What does each row in the dataset represent? Which years are in the dataset, and how many elections are there in each year? Which states are represented? How do the data change over time? Can you represent the change in a county? How about an individual's political fortunes? Is it more useful to represent the data as counts of votes or as percentages?