Point-in-polygon with sf
Oct 9 2018

During graduate school, I started working with spatial data and making maps using ArcGIS. It was the primary tool used by my Geography professors, it was installed on all the computers in the lab, and hey, point and click, who doesn’t love point and click? But very quickly I became frustrated with the GUI and ArcGIS was really slow for seemingly simple operations.

For one project, I wanted to count the number of arrests made in certain neighborhoods. I had the XY point coordinates of about 50,000 arrests and polygons of each Census tract in New York City (about 2,200). So I slog my way through the spatial join menu, select my layers, and then…wait. I don’t remember how long it took, but I knew there had to be a better way.

This was around the time the sf package was getting off the ground and I dove into spatial processing in R. I soon discovered there was a much better way: I stopped using ArcGIS and did all my mapping in R.

Anyhow, that was a long-winded introduction to this short post about how to perform a point-in-polygon operation using sf.


For this example, I’m going to use data from the New York City 2015 Tree Census. Volunteers and Parks Department staff counted all 650,000+ trees on the streets of New York and fortunately for us, they put the results on the NYC Open Data portal.

The first step is to grab the tree data via the NYC Open Data API and convert it into an sf object. (If don’t already have an app token set up for NYC Open Data, the first first step is to create an account and key.)

library(tidyverse)
library(sf)
library(RSocrata) # to download data from the city api
library(keyring)  # for secure app token access
library(units)
library(tmap)

# set up url for api call
base <- "https://data.cityofnewyork.us/resource/"
resource <- "nwxe-4ae8"  # tree count 2015 data set
vars <- c("x_sp", "y_sp")  # get x and y coordinates only

call <- paste0(base, resource, ".json", "?$select=", paste(vars, collapse = ", "))

# download data -- this can take a little while!
# my api access token is saved via keyring package
tree <- read.socrata(call, app_token = key_get("NYC Open Data")) %>%
  as_tibble()

# convert tree data frame into sf object
tree_sf <- tree %>%
  mutate_at(vars(x_sp, y_sp), as.numeric) %>%   # coordinates must be numeric
  st_as_sf(
    coords = c("x_sp", "y_sp"),
    agr = "constant",
    crs = 2263,        # nad83 / new york long island projection
    stringsAsFactors = FALSE,
    remove = TRUE
    )

Next, we’ll pull down Census tract geometry for New York City, also from the NYC Open Data website. We have to convert this to the same projection as the point data to do the spaital join in the next step.

geo_url <- "https://data.cityofnewyork.us/api/geospatial/fxpq-c8ku?method=export&format=GeoJSON"

# read in geojson of tract geometry and calculate area of each tract in sq mi
tract_sf <- read_sf(geo_url) %>%
  st_transform(2263) %>%   # convert to same projection as above
  select(boro_name, boro_ct2010) %>%
  mutate(area = set_units(st_area(.), mi^2)) %>%
  print()
## Simple feature collection with 2166 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## epsg (SRID):    2263
## proj4string:    +proj=lcc +lat_1=41.03333333333333 +lat_2=40.66666666666666 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000.0000000001 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## # A tibble: 2,166 x 4
##    boro_name   boro_ct2010     area                                 geometry
##    <chr>       <chr>         [mi^2]          <MULTIPOLYGON [US_survey_foot]>
##  1 Staten Isl… 5000900     0.08956… (((962269.1 173705.5, 962288.7 173671.6…
##  2 Manhattan   1009800     0.06836… (((994133.5 214848.9, 994005.9 214620.8…
##  3 Manhattan   1010000     0.06675… (((993108.3 216013.1, 992982.2 215784.3…
##  4 Manhattan   1010200     0.06675… (((992216.5 216507.7, 992091 216280.4, …
##  5 Manhattan   1010400     0.06688… (((991325.9 217001.7, 991199.2 216773.6…
##  6 Manhattan   1011300     0.06782… (((988650.3 214286.4, 988517.8 214044.1…
##  7 Manhattan   1011402     0.03814… (((994013.2 217645.3, 993886.5 217417.3…
##  8 Manhattan   1013000     0.06880… (((994920.1 221386.3, 994791.9 221155, …
##  9 Manhattan   1014000     0.06908… (((996728.3 222545.7, 996600.4 222314.1…
## 10 Manhattan   1014801     0.02005… (((996994.4 223026.1, 996856.4 222777, …
## # … with 2,156 more rows

Now with the point object (trees) and polygon object (Census tracts), we perform a spatial join. Using st_join(), we’ll find which of the 2,166 polygons in tract_sf that each of the 683,788 points in tree_sf is within. The result of this function will be an sf data frame with each row of tree_sf appended with the columns from tract_sf of the appropriate tract. After that, we’ll count up the number of trees in each tract, join the count back into the original tract data frame and calculate the tree density as trees per square mile.

# find points within polygons
tree_in_tract <- st_join(tree_sf, tract_sf, join = st_within)

# count trees per census tract
tree_tract_count <- count(as_tibble(tree_in_tract), boro_ct2010) %>%
  print()
## # A tibble: 2,153 x 2
##    boro_ct2010     n
##    <chr>       <int>
##  1 1000201        70
##  2 1000202       217
##  3 1000600       187
##  4 1000700       144
##  5 1000800       293
##  6 1000900        83
##  7 1001001        24
##  8 1001002        56
##  9 1001200       111
## 10 1001300        97
## # … with 2,143 more rows

# join tree count with tract df, calc tree density
tract_tree_sf <- left_join(tract_sf, tree_tract_count) %>%
  mutate(tree_sq_mi = as.numeric(n / area))  %>%
  print()
## Simple feature collection with 2166 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## epsg (SRID):    2263
## proj4string:    +proj=lcc +lat_1=41.03333333333333 +lat_2=40.66666666666666 +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000.0000000001 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## # A tibble: 2,166 x 6
##    boro_name boro_ct2010     area     n tree_sq_mi                  geometry
##    <chr>     <chr>         [mi^2] <int>      <dbl> <MULTIPOLYGON [US_survey>
##  1 Staten I… 5000900     0.08956…   269     3003.  (((962269.1 173705.5, 96…
##  2 Manhattan 1009800     0.06836…   328     4797.  (((994133.5 214848.9, 99…
##  3 Manhattan 1010000     0.06675…   194     2906.  (((993108.3 216013.1, 99…
##  4 Manhattan 1010200     0.06675…   102     1528.  (((992216.5 216507.7, 99…
##  5 Manhattan 1010400     0.06688…   188     2811.  (((991325.9 217001.7, 99…
##  6 Manhattan 1011300     0.06782…     6       88.5 (((988650.3 214286.4, 98…
##  7 Manhattan 1011402     0.03814…   153     4011.  (((994013.2 217645.3, 99…
##  8 Manhattan 1013000     0.06880…   399     5799.  (((994920.1 221386.3, 99…
##  9 Manhattan 1014000     0.06908…   408     5906.  (((996728.3 222545.7, 99…
## 10 Manhattan 1014801     0.02005…    93     4636.  (((996994.4 223026.1, 99…
## # … with 2,156 more rows

And there you have it. Finding points in polygons in just a few easy (programatic!) steps. And how about a quick interactive choropleth map of this data. Bye, bye ArcGIS!!!

tmap_mode("view")
tm_shape(tract_tree_sf) +
  tm_fill(
    col = "tree_sq_mi",
    palette = "Greens",
    style = "cont",
    contrast = c(0.1, 1),
    title = "Trees per Sq Mile",
    id = "boro_ct2010",
    showNA = FALSE,
    alpha = 0.8,
    popup.vars = c(
      "Total Trees" = "n",
      "Trees/Sq Mi" = "tree_sq_mi"
    ),
    popup.format = list(
      n = list(format = "f", digits = 0),
      tree_sq_mi = list(format = "f", digits = 0)
    )
  ) +
  tm_borders(col = "darkgray", lwd = 0.7) + 
  tm_view(basemaps = "Stamen.TonerLite")