---
title: "Data manipulation with dplyr"
author: ""
---
## Aims of this worksheet
One of the key reasons to use R is to be able to manipulate data with ease. After completing this worksheet you will be able to work with the most commonly used data manipulation verbs provided by the [dplyr](https://cran.r-project.org/package=dplyr) and [tidyr](https://cran.r-project.org/package=tidyr) packages.
We will begin by loading the necessary packages and data. We will use the `methodists` dataset from the [historydata](https://cran.r-project.org/package=historydata) package. This dataset contains membership figures for Methodist meetings (which were organized into districts, which were in turn organized into conferences) for the early nineteenth century.
```{r, message=FALSE}
library(tidyverse)
library(historydata)
data(methodists)
methodists
```
## Selecting columns (`select()`)
The first data manipulation verb that we are going to use is `select()`. This function lets us pass the names of the columns that we want to keep.
```{r}
methodists %>%
select(year, meeting, members_total)
```
Notice that we have not actually changed the data stored in `methodists` until we assign the changed data to a variable.
Read the documentation for this function, `?select`.
(@) Select the columns for `year`, `meeting`, as well as all columns that begin with the word `members_`.
```{r}
```
(@) Remove the column `url`.
```{r}
```
## Filtering rows (`filter()`)
The `select()` function lets us pick certain columns. The `filter()` function lets select certain rows based on logical conditions. For example, here we get the only the meetings where the total number of members is at greater than 1,000.
```{r}
methodists %>%
filter(members_total > 1000)
```
(@) Get just the rows from New York in 1800.
```{r}
```
(@) Which Methodist meetings had only black members?
```{r}
```
(@) Bonus: some of these Methodist meetings were "missions." Load the `stringr` package and look at the documentation for `?str_detect`. Can you find which rows represent missions?
```{r}
```
## Creating new columns (`mutate()`)
Very often one will want to create a new column based on other columns in the data. For instance, in our Methodist data, there is a column called `year`, but that column represents the year that the minutes were reported. The membership figures are actually for the previous year. Here we create a new column called `year_recorded`, where each value is one less than in `year`.
```{r}
methodists %>%
mutate(year_recorded = year - 1) %>%
select(year, year_recorded, meeting)
```
Notice that we chained the data manipulation functions using the pipe (`%>%`). This lets us create a pipeline were we can do many different manipulations in a row.
(@) Create two new columns, one with the percentage of white members, and one with the percentage of black members.
```{r}
```
## Sorting columns (`arrange()`)
Often we want to sort a data frame by one of its columns. This can be done with the verb `arrange()`. By default `arrange()` will sort from least to greatest; we can use the function `desc()` to sort from greatest to least. In this example, we sort the data frame to get the meetings with the highest number of white members.
```{r}
methodists %>%
arrange(desc(members_white))
```
(@) Which meetings had the highest number of black members? Select only the necessary columns so that the results print in a meaningful way.
```{r}
```
(@) Which meetings had the high percentage of black members without being entirely black?
```{r}
```
## Split-apply-combine (`group_by()`)
Notice that in the example above the `arrange()` function sorted the entire data frame. So when we looked for the circuits with the largest number of members, we got rows from 1825, then 1830, then 1829, then 1830, and so on. What if we wanted to get the biggest circuit from each year?
We can solve this kind of problem with what Hadley Wickham calls the "split-apply-combine" pattern of data analysis. Think of it this way. First we can *split* the big data frame into separate data frames, one for each year. Then we can *apply* our logic to get the results we want; in this case, that means sorting the data frame. We might also want to get just the top one row with the biggest number of members. Then we can *combine* those split apart data frames into a new data frame.
Take a simple example using the `top_n()` function, which returns the top `n` (in this case, top 1) results for a particular column. After selecting a few columns, we get the row in the data frame which has the highest value for `members_black`.
```{r}
methodists %>%
select(year, meeting, members_total, members_black) %>%
top_n(1, members_black)
```
We can change how that code works by using the `group_by()` function. Now we get the one row for each unique year in the dataset.
```{r}
methodists %>%
select(year, meeting, members_total, members_black) %>%
group_by(year) %>%
top_n(1, members_black)
```
We get the same results more concisely and reliably, though the steps of "split-apply-combine" are perhaps somewhat less easy to see.
(@) For each year, which was the biggest circuit?
```{r}
```
(@) For each year, which church had the biggest percentage of black members without being entirely black?
```{r}
```
(@) For the year 1825, what was the biggest meeting in each conference? In each district?
```{r}
```
(@) For each year, what was the biggest church in the Baltimore conference?
```{r}
```
## Summarizing or aggregating data (`summarize()`)
In the examples using `top_n()` we performed a very simple kind of data summary, where we took the single row with the biggest value in a given column. This essentially boiled many rows of a data frame down into a single row. We would like to be able to summarize or aggregate a data frame in other ways as well. For instance, we often want to take the sum or the mean of a given column. We can do this using the `summarize()` function in conjunction with the `group_by()` function.
In this example, we group by the year the minutes were taken. Then we find the total number of white members for each year.
```{r}
methodists %>%
group_by(year) %>%
summarize(total_members_white = sum(members_white, na.rm = TRUE))
```
Notice that we get one row in the recombined data frame for each group in the original data frame. The value in the new column is the result of a function (in this case, `sum()`) applied to the columns in each of the split apart data frames.
There is also a special case where we might want to know how many rows were in each of the split apart (or grouped) data frames. We can use the special `n()` function to get that count. (This is such a common thing to do that dplyr provides the special function `count()` to do this in an abbreviated way. You can look up that function's documentation to see how it works.)
```{r}
methodists %>%
group_by(year) %>%
summarize(total_meetings = n())
```
(@) How many meetings were there in each conference in each year since 1802?
```{r}
```
(@) What is the average number of white, black, Indian and total members for each year since 1786?
```{r}
```
(@) Make a plot of the average percentage of black members over time.
```{r}
```
(@) Make a plot of the average and median number of members over time.
```{r}
```
(@) What was the average number of members in each conference for each year? Can you also make a plot of this?
```{r}
```
(@) What was the average percentage of black members in each conference for each year? Can you also make a plot of this?
```{r}
```