4.1 Introduction and setup
Geocoding is the process of moving back and forth from place names to geographic coordinates. Suppose, for example, that we had the name of a city:
"Fairfax, VA". “Forward” geocoding takes the place name as input and returns a guess for the latitude and longitude for that city. “Reverse geocoding” would take a latitude and longitude as input and return a guess for the city’s name.
Because geocoding requires a vast number of place names, it is usually done by sending queries a web service and getting back results. There are a number of such services that have corresponding packages in R, but unfortunately most of the services require payment.
For this chapter, we will use the OpenCage Geocoder, and the opencage R package. OpenCage offer 2,500 queries free per day, in addition to their paid service. To use this service, register for their free plan. Once you have done so, you will get an API key, which looks like a long string of random characters. This is like a password which uniquely identifies you to the service so it can count your requests. You will want to keep this API key private like you would a password.
Once you have the API key, you can install the opencage package by running
install.packages("opencage") in your console. Once you have done that, we are going put the API key into an environment variable so that the package knows about it. Then, we will check that the package can find it. If your API key is returned, then you are all set to begin.
4.2 Geocoding a single place name at a time
To understand how geocoding works, we are going to try it on a single place name for a city. This would also work for other kinds of place names, including addresses.
We can use the
opencage_forward() to go from a place name to a location. We want to do a few things to help the geocoder get good results. First, we will be as specific in the place name as possible. One other thing that we can do is use the
countrycode = "US" argument to say that place name is in the United States, thus excluding the rest of the world from consideration. You can submit the two letter code for any country instead. Or, if you know that your place names are in a specific region, you can provide a bounding box. For example, if you were geocoding addresses in Washington, DC, you might figure out the bounding box around the city to make sure that only addresses from that city are included. Finally, the API will give us results in a variety of formats, depending on what we need. So here we are using
no_annotations = TRUE to simplify the results a big.
The output from the API that we have saved into the
city_geocoded variable are a list, and it includes information about how many queries we have left in addition to the results. The results themselves are a data frame. Here we are looking just at some key columns from the results.
city_geocoded$results %>% select(query, formatted, confidence, components._type, components._category, geometry.lat, geometry.lng) #> # A tibble: 4 x 7 #> query formatted confidence components._type components._cat… geometry.lat #> <chr> <chr> <chr> <chr> <chr> <dbl> #> 1 Fair… Fairfax,… 7 county place 38.8 #> 2 Fair… Fairfax … 2 county place 38.8 #> 3 Fair… Fairfax,… 9 road road 37.2 #> 4 Fair… The Fair… 9 nursing_home social 38.7 #> # … with 1 more variable: geometry.lng <dbl>
Somewhat surprisingly, we get four results back when we thought we were going to get one. But this is less surprising when we think about the fact that the place name “Fairfax” is ambiguous. We get back coordinates for Fairfax City, Fairfax County, a road called “Fairfax” as well as a nursing home called “The Fairfax.” We could filter these results using the category or some other criteria to pick the one that we want. Or we could also instruct the API to give us just its best guess back.
The key results that we get back are the two columns that contain latitude and longitude, which we can use for mapping.
4.3 Geocoding place names in a batch
Most of the time, we will have a set of place names that we want to geocode all at once.
Let’s create a sample data set. Even though the Paulist missions data in the historydata package are already geocoded, let’s try to do the geocoding ourselves. We are going to create a new dataset,
missions, with just the name of the church and the city and state. We are only going to use the first ten rows of data. And we are going to add a column with the place name by combining the city and state. The structure of this data would be similar for many kinds of historical data. In this case we are going to try to geocode at the level of the city rather than the building.
library(historydata) missions <- paulist_missions %>% slice(1:10) %>% select(church, city, state) %>% mutate(place = str_c(city, ", ", state)) missions #> # A tibble: 10 x 4 #> church city state place #> <chr> <chr> <chr> <chr> #> 1 St. Joseph's Church New York NY New York, NY #> 2 St. Michael's Church Loretto PA Loretto, PA #> 3 St. Mary's Church Hollidaysburg PA Hollidaysburg, PA #> 4 Church of St. John Evangelist Johnstown PA Johnstown, PA #> 5 St. Peter's Church New York NY New York, NY #> 6 St. Patrick's Cathedral New York NY New York, NY #> # … with 4 more rows
Right away we should notice that there three instances of missions to New York city. So we don’t need to geocode that city multiple times. It will be faster, as well as easier to save and correct by hand if necessary, if we create a new data frame with just the distinct places.
places <- missions %>% distinct(city, state, place) places #> # A tibble: 8 x 3 #> city state place #> <chr> <chr> <chr> #> 1 New York NY New York, NY #> 2 Loretto PA Loretto, PA #> 3 Hollidaysburg PA Hollidaysburg, PA #> 4 Johnstown PA Johnstown, PA #> 5 Erie PA Erie, PA #> 6 Cussewago PA Cussewago, PA #> # … with 2 more rows
Now we are going to geocode all of the place names at once. The code below might seem somewhat convoluted. It runs the geocoding function on each of the place names, then joins the results back into a data frame. Note that we have used the
limit = 1 argument to make sure we only get back a single set of coordinates for each place name, rather than multiple.^]We could also get a number of results back from each place, and then settle on some criterion for evaluating them to get down to a single one per place name, but that would be more complicated than necessary in this case.]
We can take a look at the results to see whether they make sense. From these results, we might have some concern about whether “New York, NY” got the city rather than some other kind of geographic unit.
geocoded %>% select(query, formatted, confidence, components._type, geometry.lat, geometry.lng) #> # A tibble: 8 x 6 #> query formatted confidence components._type geometry.lat geometry.lng #> <chr> <chr> <chr> <chr> <dbl> <dbl> #> 1 New Yor… NY, United Sta… 2 state_district 40.7 -74.0 #> 2 Loretto… Loretto, PA, U… 7 city 40.5 -78.6 #> 3 Hollida… Hollidaysburg,… 7 city 40.4 -78.4 #> 4 Johnsto… Johnstown, PA,… 7 city 40.3 -78.9 #> 5 Erie, PA Erie, PA, Unit… 7 city 42.1 -80.1 #> 6 Cussewa… Cussewago Town… 6 city 41.8 -80.2 #> # … with 2 more rows
But the real proof is when we try to map these geocoded points. Here we can map a quick map in leaflet, hover over the markers to see what place name is associated with them, and then zoom into the map to check whether the locations are correct. It appears from this map that all of our locations are spot on.
If the geocoded coordinates were not accurate, there are a few steps we could try. We could try to update our place names to make them more precise. Or, we could write out a CSV file containing the geocoded coordinates, edit the ones that were mistaken by hand, and then read the CSV file back into R.
There is one more step that we need to take. We have the place names geocoded, but we want to associate those back to the dataset we started with. We can do that by doing a join from the
missions data frame to a new data frame (called
coordinates here) which has just the place name and the latitude and longitude.
The resulting table gives us our original dataset, plus the latitudes and longitudes.
missions_geocoded #> # A tibble: 10 x 6 #> church city state place lat lng #> <chr> <chr> <chr> <chr> <dbl> <dbl> #> 1 St. Joseph's Church New York NY New York, NY 40.7 -74.0 #> 2 St. Michael's Church Loretto PA Loretto, PA 40.5 -78.6 #> 3 St. Mary's Church Hollidaysburg PA Hollidaysburg, … 40.4 -78.4 #> 4 Church of St. John Evangelist Johnstown PA Johnstown, PA 40.3 -78.9 #> 5 St. Peter's Church New York NY New York, NY 40.7 -74.0 #> 6 St. Patrick's Cathedral New York NY New York, NY 40.7 -74.0 #> # … with 4 more rows
4.4 More details
If opencage does not suit your needs, you can try the Google Maps geocoder, which can be accessed through the ggmap package. The Google API will require a credit card in order to get an API key, but Google also gives away free initial funds which may be sufficient for your purposes.