3 Dates

Computational historians unsurprisingly often have to deal with dates. It is difficult to do even basic operations with dates, however. For example, since months and years have unequal numbers of days, even figuring out the duration of time between two dates can be tricky. But R includes a special type of object, Date, which makes the task much easier. As long as you are dealing with Gregorian dates, you can do almost anything that you might need to do with dates using R’s built-in functions and the lubridate package.21

We should be precise in our definition of what a date is. A date must be specified to an exact year, month, and day. If you are dealing only with years, then a simple numeric or integer vector is sufficient. If you are dealing with dates and times, then you will need to specify the time and possibly the time zone as well. This chapter will not get into dates and times, but if you understand how to work with date objects, those principles are easily extended to date and time object.22

In this chapter we will use the lubridate package alongside our customary tidyverse and historydata packages.

3.1 Years

If you are dealing only with dates in the form of years, then a numeric column in your data with the year information is sufficient. For example, the dijon_prices dataset in historydata contains price series for various commodities in Dijon, France, from 1568 to 1630.

Using the year column, we can filter the data frame to a particular year as we are accustomed to doing in dplyr, or we can use the year column as a variable in a ggplot2 plot.

Often, though, we have a year embedded in some other kind of variable, and we would like to extract that year so that we can manipulate it directly. For example, we might have document IDs or file names that contain a year, or a set of sentences that include dates.

In each of these cases, we can describe what we need to do. We need to extract a four-character sequence of digits, which will be our year, and we need to convert that sequence of characters into an integer which we can treat as a number. You will often find yourself writing a function that looks like this.

This function first checks that the input vector x is a character vector; if it’s not, then the input is probably a mistake. Then it uses the str_extract() function from the stringr package to find the first sequence of four digits. (That is the meaning of the regular expression "\\d{4}".) Then it turns the resulting character vector into an integer and returns it.

We can test this function on our sample data.

Because this function is vectorized, we could use it in a dplyr expression in order to create a new column of years from an existing column.

3.2 R’s date objects

R includes a Date class for representing dates and doing calculations with them. You can turn a text representation of a date into a Date object using the as.Date() function. This function takes a format = argument that lets you specify the order of the elements of the date, but the default is to accept dates in the form YYYY-MM-DD. Let’s create two date objects

Now that we have created our date objects we can use comparison functions to figure out if dates come before or after one another.

We can also calculate the difference in time using the - function. Note that the returned object prints out the length in days, but it is actually an object of class difftime. You can get a measurement of a time difference in different intervals by using the difftime() function.

Another useful operation with dates from base R is creating a sequence between two dates at some regular interval.

3.3 Parsing dates with lubridate

While there are other things that you can do with dates in base R, you almost always better off using the lubridate package. That package provides many additional functions and some additional classes for dealing with dates in a sensible way.

The lubridate package provides a series of functions in the form mdy(), ymd(), and dmy(), where those letters correspond to the position of the month, day, and year in a date. As long as your dates are in a reasonably consistent format, lubridate should be able to parse them. For example, lubridate can parse these different ways of writing the same dates.

You often don’t have a choice about the formats of dates in data that you receive, so

3.4 Other operations on dates

Once you have a vector of dates, you might need to extract some component of the date, such as the year or the day of the week. The lubridate package provides functions to pull out those pieces.

Sometimes you have dates that are specific down to the day, but you are interested in aggregating them by year, month, or week. For example, you might have the dates of newspaper issues, but want to know how many papers were published in a year or a month. For that you can use lubridate’s round_date(), floor_date(), and ceiling_date() functions.

Note that floor_date() will give you the date at the start of the week or month or year, while round_date() will give you the nearest start of the week or month or year.

The lubridate package also contains classes and functions for intervals and periods, which you can read about in its vignette.

3.5 Creating data with dates

When you are creating your own data, such as when you transcribe a source, you should write dates in a standardized way. The standard way of writing a date (called ISO 8601) is to include a four-digit year, a two-digit month, and two-digit day, each separated by hyphens: YYYY-MM-DD. This way of writing dates has several virtues. One of them is that even when the dates are treated as text, they sort correctly in chronological order. The other is that by default many R functions expect dates to be in that format.

For example, in the toy data file webster-speeches.csv (download here), the dates are written as 1800-07-04. When we load the file with the read_csv() function from readr, that date column is automatically parsed into a date format.

  1. Vitalie Spinu, Garrett Grolemund, and Hadley Wickham, Lubridate: Make Dealing with Dates a Little Easier, 2018, https://CRAN.R-project.org/package=lubridate.

  2. See R’s documentation at ?POSIXt or look at the lubridate documentation for time classes.