3 Dates

Computational historians unsurprisingly often have to deal with dates. It is difficult to do even basic operations with dates, however. For example, since months and years have unequal numbers of days, even figuring out the duration of time between two dates can be tricky. But R includes a special type of object, Date, which makes the task much easier. As long as you are dealing with Gregorian dates, you can do almost anything that you might need to do with dates using R’s built-in functions and the lubridate package.17

We should be precise in our definition of what a date is. A date must be specified to an exact year, month, and day. If you are dealing only with years, then a simple numeric or integer vector is sufficient. If you are dealing with dates and times, then you will need to specify the time and possibly the time zone as well. This chapter will not get into dates and times, but if you understand how to work with date objects, those principles are easily extended to date and time object.18

In this chapter we will use the lubridate package alongside our customary tidyverse and historydata packages.

library(tidyverse)
library(historydata)
library(lubridate)

3.1 Years

If you are dealing only with dates in the form of years, then a numeric column in your data with the year information is sufficient. For example, the dijon_prices dataset in historydata contains price series for various commodities in Dijon, France, from 1568 to 1630.

dijon_prices
#> # A tibble: 1,110 × 6
#>     commodity      measure  year price     citation citation_date
#>         <chr>        <chr> <dbl> <dbl>       <fctr>        <fctr>
#> 1  best wheat quarteranche  1568 11.67 B 205, f.95v      17/12/68
#> 2  good wheat quarteranche  1568 10.00 B 205, f.95v      17/12/68
#> 3 mixed grain quarteranche  1568  8.33 B 205, f.95v      17/12/68
#> 4         rye quarteranche  1568  6.67 B 205, f.95v      17/12/68
#> 5      barley     boisseau  1568  4.17 B 205, f.95v      17/12/68
#> 6        oats     boisseau  1568  3.00 B 205, f.95v      17/12/68
#> # ... with 1,104 more rows

Using the year column, we can filter the data frame to a particular year as we are accustomed to doing in dplyr, or we can use the year column as a variable in a ggplot2 plot.

dijon_prices %>% 
  filter(year == 1600)
#> # A tibble: 18 × 6
#>     commodity      measure  year price     citation citation_date
#>         <chr>        <chr> <dbl> <dbl>       <fctr>        <fctr>
#> 1  best wheat quarteranche  1600  20.0 B238, f.161v      28/11/00
#> 2  good wheat quarteranche  1600  17.5 B238, f.161v      28/11/00
#> 3 mixed grain quarteranche  1600  14.0 B238, f.161v      28/11/00
#> 4         rye quarteranche  1600  10.0 B238, f.161v      28/11/00
#> 5      barley     boisseau  1600  10.0 B238, f.161v      28/11/00
#> 6        oats     boisseau  1600  10.0 B238, f.161v      28/11/00
#> # ... with 12 more rows

Often, though, we have a year embedded in some other kind of variable, and we would like to extract that year so that we can manipulate it directly. For example, we might have document IDs or filenames that contain a year, or a set of sentences that include dates.

doc_ids <- c("NY1850", "CA1851", "CA1850", "WA1855", "NV1861")
files <- c("sotu-1968.txt", "sotu-1969.txt", "sotu-1970.txt", "sotu-1971.txt")
sentences <- c("George Washington became president in 1789.",
               "John Adams became president in 1798.",
               "In 1801, Thomas Jefferson became president.",
               "James Madison was inaugurated in 1809.")

In each of these cases, we can describe what we need to do. We need to extract a four-character sequence of digits, which will be our year, and we need to convert that sequence of characters into an integer which we can treat as a number. You will often find yourself writing a function that looks like this.

extract_year <- function(x) {
  stopifnot(is.character(x)) 
  year_char <- stringr::str_extract(x, "\\d{4}") 
  year_int <- as.integer(year_char)
  year_int
}

This function first checks that the input vector x is a character vector; if it’s not, then the input is probably a mistake. Then it uses the str_extract() function from the stringr package to find the first sequence of four digits. (That is the meaning of the regular expression "\\d{4}".) Then it turns the resulting character vector into an integer and returns it.

We can test this function on our sample data.

extract_year(doc_ids)
#> [1] 1850 1851 1850 1855 1861
extract_year(files)
#> [1] 1968 1969 1970 1971
extract_year(sentences)
#> [1] 1789 1798 1801 1809

Because this function is vectorized, we could use it in a dplyr expression in order to create a new column of years from an existing column.

3.2 R’s date objects

R includes a Date class for representing dates and doing calculations with them. You can turn a text representation of a date into a Date object using the as.Date() function. This function takes a format = argument that lets you specify the order of the elements of the date, but the default is to accept dates in the form YYYY-MM-DD. Let’s create two date objects

fort_sumter <- as.Date("1861-04-12")
appomattox <- as.Date("1865-04-09")

Now that we have created our date objects we can use comparison functions to figure out if dates come before or after one another.

fort_sumter <= appomattox
#> [1] TRUE
fort_sumter >= appomattox
#> [1] FALSE
fort_sumter == appomattox
#> [1] FALSE

We can also calculate the difference in time using the - function. Note that the returned object prints out the length in days, but it is actually an object of class difftime. You can get a measurement of a time difference in different intervals by using the difftime() function.

appomattox - fort_sumter
#> Time difference of 1458 days

Another useful operation with dates from base R is creating a sequence between two dates at some regular interval.

seq(from = as.Date("1860-01-01"), to = as.Date("1861-01-01"), by = "month")
#>  [1] "1860-01-01" "1860-02-01" "1860-03-01" "1860-04-01" "1860-05-01"
#>  [6] "1860-06-01" "1860-07-01" "1860-08-01" "1860-09-01" "1860-10-01"
#> [11] "1860-11-01" "1860-12-01" "1861-01-01"

3.3 Parsing dates with lubridate

While there are other things that you can do with dates in base R, you almost always better off using the lubridate package. That package provides many additional functions and some additional classes for dealing with dates in a sensible way.

The lubridate package provides a series of functions in the form mdy(), ymd(), and dmy(), where those letters correspond to the position of the month, day, and year in a date. As long as your dates are in a reasonably consistent format, lubridate should be able to parse them. For example, lubridate can parse these different ways of writing the same dates.

mdy(c("September 17, 1862", "July 21, 1861", "July 1, 1863"))
#> [1] "1862-09-17" "1861-07-21" "1863-07-01"
dmy(c("17 September 1862", "21 July 1861", "1 July 1863"))
#> [1] "1862-09-17" "1861-07-21" "1863-07-01"
mdy("9/17/1862", "07/21/1861", "07/01/1863")
#> [1] "1862-09-17" "1861-07-21" "1863-07-01"
ymd("1862-09-17", "1861-07-21", "1863-07-01")
#> [1] "1862-09-17" "1861-07-21" "1863-07-01"

You often don’t have a choice about the formats of dates in data that you receive, so

3.4 Other operations on dates

Once you have a vector of dates, you might need to extract some component of the date, such as the year or the day of the week. The lubridate package provides functions to pull out those pieces.

gettysburg <- mdy("July 1, 1863", "July 2, 1863", "July 3 1863")
year(gettysburg)
#> [1] 1863 1863 1863
month(gettysburg)
#> [1] 7 7 7
day(gettysburg)
#> [1] 1 2 3
weekdays(gettysburg)
#> [1] "Wednesday" "Thursday"  "Friday"

Sometimes you have dates that are specific down to the day, but you are interested in aggregating them by year, month, or week. For example, you might have the dates of newspaper issues, but want to know how many papers were published in a year or a month. For that you can use lubridate’s round_date(), floor_date(), and ceiling_date() functions.

floor_date(gettysburg, unit = "year")
#> [1] "1863-01-01" "1863-01-01" "1863-01-01"
floor_date(gettysburg, unit = "month")
#> [1] "1863-07-01" "1863-07-01" "1863-07-01"
floor_date(gettysburg, unit = "week")
#> [1] "1863-06-28" "1863-06-28" "1863-06-28"

Note that floor_date() will give you the date at the start of the week or month or year, while round_date() will give you the nearest start of the week or month or year.

floor_date(gettysburg, unit = "week")
#> [1] "1863-06-28" "1863-06-28" "1863-06-28"
round_date(gettysburg, unit = "week")
#> [1] "1863-06-28" "1863-07-05" "1863-07-05"

The lubridate package also contains classes and functions for intervals and periods, which you can read about in its vignette.

3.5 Creating data with dates

When you are creating your own data, such as when you transcribe a source, you should write dates in a standardized way. The standard way of writing a date (called ISO 8601) is to include a four-digit year, a two-digit month, and two-digit day, each separated by hyphens: YYYY-MM-DD. This way of writing dates has several virtues. One of them is that even when the dates are treated as text, they sort correctly in chronological order. The other is that by default many R functions expect dates to be in that format.

For example, in the toy data file webster-speeches.csv (download here), the dates are written as 1800-07-04. When we load the file with the read_csv() function from readr, that date column is automatically parsed into a date format.

read_csv("data/webster-speeches.csv")
#> # A tibble: 8 × 3
#>           author       date                   speech
#>            <chr>     <date>                    <chr>
#> 1 Daniel Webster 1800-07-04 Oration at Hanover, N.H.
#> 2 Daniel Webster 1818-03-10   Dartmouth College Case
#> 3 Daniel Webster 1820-12-22         Plymouth Oration
#> 4 Daniel Webster 1825-06-17     Bunker Hill Monument
#> 5 Daniel Webster 1826-08-02      Adams and Jefferson
#> 6 Daniel Webster 1830-01-26    Second Reply to Hayne
#> # ... with 2 more rows

  1. Garrett Grolemund, Vitalie Spinu, and Hadley Wickham, Lubridate: Make Dealing with Dates a Little Easier, 2016, https://CRAN.R-project.org/package=lubridate.

  2. See R’s documentation at ?POSIXt or look at the lubridate documentation for time classes.