During graduate school, I started working with spatial data and making maps using ArcGIS. It was the primary tool used by my Geography professors, it was installed on all the computers in the lab, and hey, point and click, who doesn’t love point and click? But very quickly I became frustrated with the GUI and ArcGIS was really slow for seemingly simple operations.
For one project, I wanted to count the number of arrests made in certain neighborhoods. I had the XY point coordinates of about 50,000 arrests and polygons of each Census tract in New York City (about 2,200). So I slog my way through the spatial join menu, select my layers, and then…wait. I don’t remember how long it took, but I knew there had to be a better way.
This was around the time the sf
package was getting off the ground and I dove into spatial processing in R. I soon discovered there was a much better way: I stopped using ArcGIS and did all my mapping in R.
Anyhow, that was a long-winded introduction to this short post about how to perform a point-in-polygon operation using sf
.
For this example, I’m going to use data from the New York City 2015 Tree Census. Volunteers and Parks Department staff counted all 650,000+ trees on the streets of New York and fortunately for us, they put the results on the NYC Open Data portal.
The first step is to grab the tree data via the NYC Open Data API and convert it into an sf
object. (If don’t already have an app token set up for NYC Open Data, the first first step is to create an account and key.)
library(tidyverse)
library(sf)
library(RSocrata) # to download data from the city api
library(keyring) # for secure app token access
library(units)
library(tmap)
# set up url for api call
base <- "https://data.cityofnewyork.us/resource/"
resource <- "nwxe-4ae8" # tree count 2015 data set
vars <- c("x_sp", "y_sp") # get x and y coordinates only
call <- paste0(base, resource, ".json", "?$select=", paste(vars, collapse = ", "))
# download data -- this can take a little while!
# my api access token is saved via keyring package
tree <- read.socrata(call, app_token = key_get("NYC Open Data")) %>%
as_tibble()
# convert tree data frame into sf object
tree_sf <- tree %>%
mutate_at(vars(x_sp, y_sp), as.numeric) %>% # coordinates must be numeric
st_as_sf(
coords = c("x_sp", "y_sp"),
agr = "constant",
crs = 2263, # nad83 / new york long island projection
stringsAsFactors = FALSE,
remove = TRUE
)
Next, we’ll pull down Census tract geometry for New York City, also from the NYC Open Data website. We have to convert this to the same projection as the point data to do the spaital join in the next step.
geo_url <- "https://data.cityofnewyork.us/api/geospatial/fxpq-c8ku?method=export&format=GeoJSON"
# read in geojson of tract geometry and calculate area of each tract in sq mi
tract_sf <- read_sf(geo_url) %>%
st_transform(2263) %>% # convert to same projection as above
select(boro_name, boro_ct2010) %>%
mutate(area = set_units(st_area(.), mi^2)) %>%
print()
## Simple feature collection with 2166 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## epsg (SRID): 2263
## proj4string: +proj=lcc +lat_1=41.03333333333333 +lat_2=40.66666666666666 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000.0000000001 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## # A tibble: 2,166 x 4
## boro_name boro_ct2010 area geometry
## <chr> <chr> [mi^2] <MULTIPOLYGON [US_survey_foot]>
## 1 Staten Isl… 5000900 0.08956… (((962269.1 173705.5, 962288.7 173671.6…
## 2 Manhattan 1009800 0.06836… (((994133.5 214848.9, 994005.9 214620.8…
## 3 Manhattan 1010000 0.06675… (((993108.3 216013.1, 992982.2 215784.3…
## 4 Manhattan 1010200 0.06675… (((992216.5 216507.7, 992091 216280.4, …
## 5 Manhattan 1010400 0.06688… (((991325.9 217001.7, 991199.2 216773.6…
## 6 Manhattan 1011300 0.06782… (((988650.3 214286.4, 988517.8 214044.1…
## 7 Manhattan 1011402 0.03814… (((994013.2 217645.3, 993886.5 217417.3…
## 8 Manhattan 1013000 0.06880… (((994920.1 221386.3, 994791.9 221155, …
## 9 Manhattan 1014000 0.06908… (((996728.3 222545.7, 996600.4 222314.1…
## 10 Manhattan 1014801 0.02005… (((996994.4 223026.1, 996856.4 222777, …
## # … with 2,156 more rows
Now with the point object (trees) and polygon object (Census tracts), we perform a spatial join. Using st_join()
, we’ll find which of the 2,166 polygons in tract_sf
that each of the 683,788 points in tree_sf
is within. The result of this function will be an sf
data frame with each row of tree_sf
appended with the columns from tract_sf
of the appropriate tract. After that, we’ll count up the number of trees in each tract, join the count back into the original tract data frame and calculate the tree density as trees per square mile.
# find points within polygons
tree_in_tract <- st_join(tree_sf, tract_sf, join = st_within)
# count trees per census tract
tree_tract_count <- count(as_tibble(tree_in_tract), boro_ct2010) %>%
print()
## # A tibble: 2,153 x 2
## boro_ct2010 n
## <chr> <int>
## 1 1000201 70
## 2 1000202 217
## 3 1000600 187
## 4 1000700 144
## 5 1000800 293
## 6 1000900 83
## 7 1001001 24
## 8 1001002 56
## 9 1001200 111
## 10 1001300 97
## # … with 2,143 more rows
# join tree count with tract df, calc tree density
tract_tree_sf <- left_join(tract_sf, tree_tract_count) %>%
mutate(tree_sq_mi = as.numeric(n / area)) %>%
print()
## Simple feature collection with 2166 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## epsg (SRID): 2263
## proj4string: +proj=lcc +lat_1=41.03333333333333 +lat_2=40.66666666666666 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000.0000000001 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## # A tibble: 2,166 x 6
## boro_name boro_ct2010 area n tree_sq_mi geometry
## <chr> <chr> [mi^2] <int> <dbl> <MULTIPOLYGON [US_survey>
## 1 Staten I… 5000900 0.08956… 269 3003. (((962269.1 173705.5, 96…
## 2 Manhattan 1009800 0.06836… 328 4797. (((994133.5 214848.9, 99…
## 3 Manhattan 1010000 0.06675… 194 2906. (((993108.3 216013.1, 99…
## 4 Manhattan 1010200 0.06675… 102 1528. (((992216.5 216507.7, 99…
## 5 Manhattan 1010400 0.06688… 188 2811. (((991325.9 217001.7, 99…
## 6 Manhattan 1011300 0.06782… 6 88.5 (((988650.3 214286.4, 98…
## 7 Manhattan 1011402 0.03814… 153 4011. (((994013.2 217645.3, 99…
## 8 Manhattan 1013000 0.06880… 399 5799. (((994920.1 221386.3, 99…
## 9 Manhattan 1014000 0.06908… 408 5906. (((996728.3 222545.7, 99…
## 10 Manhattan 1014801 0.02005… 93 4636. (((996994.4 223026.1, 99…
## # … with 2,156 more rows
And there you have it. Finding points in polygons in just a few easy (programatic!) steps. And how about a quick interactive choropleth map of this data. Bye, bye ArcGIS!!!
tmap_mode("view")
tm_shape(tract_tree_sf) +
tm_fill(
col = "tree_sq_mi",
palette = "Greens",
style = "cont",
contrast = c(0.1, 1),
title = "Trees per Sq Mile",
id = "boro_ct2010",
showNA = FALSE,
alpha = 0.8,
popup.vars = c(
"Total Trees" = "n",
"Trees/Sq Mi" = "tree_sq_mi"
),
popup.format = list(
n = list(format = "f", digits = 0),
tree_sq_mi = list(format = "f", digits = 0)
)
) +
tm_borders(col = "darkgray", lwd = 0.7) +
tm_view(basemaps = "Stamen.TonerLite")