---
title: "Spatial Coverage Sampling"
author: "D G Rossiter"
date: "`r format(Sys.Date(), '%d-%B-%Y')`"
output:
   html_document:
     toc: TRUE
     toc_float: TRUE
     theme: "spacelab"
     number_section: TRUE
     fig_height: 6
     fig_width: 6
     fig_align: "center"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
options("rgdal_show_exportToProj4_warnings"="none")
options("warn = -1")
```

The  `spcosa` R package is described in Walvoort, D. J. J., Brus, D. J., & de Gruijter, J. J. (2010). An R package for spatial coverage sampling and random sampling from compact geographical strata by k-means. Computers & Geosciences, 36(10), 1261–1267. https://doi.org/10.1016/j.cageo.2010.04.005

The aim is to place the sample locations so that they cover the study area as uniformly as possible. In a rectangular area grid sampling will accomplish this, but most study areas are not rectangles. Further, this package accounts for two-stage sampling: we can use the existing points and add new ones in the optimal locations.

The sampling units are obtained by minimising a criterion that is defined in terms of the geographic distances between the nodes of a fine *discretisation grid* covering the polygon(s) within which to sample and the sampling units. The criterion is to minimise the *mean of the squared distances* of the grid nodes to their nearest sampling unit (mean squared shortest distance, MSSD):

$$
MSSD=\frac{1}{N}\sum_{k=1}^{N}\min_{j}\left(D_{kj}^{2}\right) \;,
$$
where $N$ is the total number of nodes of the discretisation grid and $D_{kj}$ is the distance between the $k$th grid node and the $j$th sampling point. This distance measure can be minimised by the k-means algorithm.

This  is a numerical, iterative procedure, as follows:

Repeat until at step (3) points are not moved:

(1) place the required number of points $k$ at random;
(2) compute the clusters represented by these points with the k-means algorithm;
(3) move the points to the centroids of their respective clusters.

See also \S17.2 "Spatial coverage sampling"[^3]
in Brus, D. J. (2022). Spatial sampling with R. https://dickbrus.github.io/SpatialSamplingwithR/ and the vignette for the `spcosa` package[^4]

[^3]:https://dickbrus.github.io/SpatialSamplingwithR/RegularGridSpatialCoverage.html#SpatialCoverage

[^4]:https://cran.r-project.org/web/packages/spcosa/vignettes/spcosa.html

# Setup

Load the required packages:

```{r packages}
require(sf, quietly = TRUE)        # Simple Features data structures
require(units, quietly = TRUE)
require(sp, quietly = TRUE)        # spcosa uses `sp` data structures
require(rJava, quietly = TRUE)     # required by spcosa
require(spcosa, quietly = TRUE)    # the "spatial coverage sampling" package
require(ggplot2, quietly = TRUE)   # graphics
```

# Import a study area polygon

Here is the boundary of New York State (USA) as a shapefile, extracted from the US Census [^1]:

[^1]: https://www.census.gov/geographies/mapping-files/2014/geo/carto-boundary-file.html

Read it as an `sf` object with the `st_read` function.

```{r import}
ny.poly <- st_read("./ds_spcosa/NYState_500k/NYState_500k.shp")
class(ny.poly)
st_crs(ny.poly)$proj4string
st_crs(ny.poly)$epsg
st_bbox(ny.poly)
names(ny.poly)
```

The CRS is WGS84 geographic coordinates.
Convert to an appropriate metric projection CRS, for example NAD83 / New York Central:[^2]

[^2]:https://epsg.org/crs_32116/NAD83-New-York-Central.html

```{r transform}
ny.poly <- st_transform(ny.poly, 32116)
st_crs(ny.poly)$proj4string
st_crs(ny.poly)$epsg
st_bbox(ny.poly)
st_area(ny.poly)
```

The area of the State is `r round(st_area(ny.poly)/10^6)` $km^2$.

Display the multipolygon:

```{r}
plot(ny.poly["NAME"], main = "NY State", col="lightgray", 
     graticule = st_crs(4326), axes = TRUE)
```

Convert this polygon from `sf` to `sp` for use with `spcosa`:

```{r}
ny.poly.sp <- as(ny.poly, "Spatial")
class(ny.poly.sp)
```

# Spatial coverage sampling

Now we are ready to divide this polygon into strata.  An important parameter is `nTry`: "the stratify method will try `nTry` initial configurations [for the k-means clustering] and will keep the best solution in order to reduce the risk of ending up with an unfavorable solution."
  
There are 62 counties in NY, so let's use that number to get the most geographically-compact "counties":

```{r strata}
ny.strata <- stratify(ny.poly.sp, nStrata = 62, nTry = 24)
plot(ny.strata) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata")
class(ny.strata)
slot(ny.strata@centroids, "proj4string")
```

Note that the stratification has no CRS, even though the source polygon does. However the centroids of a `CompactStratification` do have a slot to store a CRS, so we make it compatible with the polygon:

```{r}
slot(ny.strata@centroids, "proj4string") <- slot(ny.poly.sp, "proj4string")
```


Examine the structure of the stratification object:

```{r str.strata}
str(ny.strata)
```

Notice the automatically-computed cellsize `r (ny.strata@cells@grid@cellsize)[1]` and dimension of the discretization grid. The total number of cells `r length(ny.strata@cells@grid.index)` is approximately 2500, the default `nGridCells` optional argument to the `stratify` function. 
See below for how to change the resolution of the discretization grid.

We may choose to have any number of sampling points per stratum.

First, one point per "county", at its centroid. 

```{r one.pt}
ny.sample.1 <- spsample(ny.strata)
plot(ny.strata, ny.sample.1) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata, sample at centroids")
```


## Controlling the discretization

The `stratify` function works with a fine discretization of the polygon. This defaults to 2500 grid cells; however this can be controlled with the `nGridCells` optional parameter. Or, the cell size can be set with the `cellSize` optional parameter.

Let's see how the stratification changes with a finer grid.

```{r}
system.time( {
  ny.strata.fine <- stratify(ny.poly.sp, nStrata = 62, nTry = 24, nGridCells = 10000)
  slot(ny.strata.fine@centroids, "proj4string") <- 
    slot(ny.poly.sp, "proj4string")
  ny.sample.1.fine <- spsample(ny.strata.fine)
})
plot(ny.strata.fine, ny.sample.1.fine ) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata (fine discretization)")
```

We see more detailed boundaries, which are more likely to be oblique.

# Stratified random sampling

We can use the stratification and select several random points per stratum. This would be applicable as a stratified random sampling plan to estimate population parameters for the State.

For example, five points per stratum (so, 310 points total):

```{r more.pts}
ny.sample.5 <- spsample(ny.strata.fine, n = 5)
plot(ny.strata.fine, ny.sample.5) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata, 5 random points per stratum")
```

# Infill sampling

Here we locate a set of new weather stations as infill to a set of existing stations, from the Northeast Regional Climate Network. These have been prepared in the same CRS as the NY State boundary polygon.

```{r pre.points}
load("./ds_spcosa/ny_stations_sf.RData", verbose = TRUE)
dim(ny.pts)
class(ny.pts)
head(ny.pts)
st_crs(ny.pts)$epsg
st_crs(ny.pts)$proj4string
st_bbox(ny.pts)
ggplot() +
  geom_sf(data = ny.poly) +
  geom_sf(data = ny.pts) +
  labs(title = "NY State weather stations")
```

There are `r dim(ny.pts)[1]` points. Suppose we have a budget for 196 weather stations -- where should we locate the `r 196-dim(ny.pts)[1]` new stations?

First stratify, then identify the centroids. We also show how to specify a cell size with the `cellSize` optional parameter. Here we select a 5 x 5 km grid cell; this results in about `r round((st_area(ny.poly)/10^6)/25)` grid cells covering the State, which is about twice the default `nGridCells` (2500).

```{r sample.new}
(n.new <- 196-dim(ny.pts)[1])
ny.strata.new <- stratify(ny.poly.sp, 
                       # priorPoints must be of class sp::SpatialPoints*
                       priorPoints = as(ny.pts, "Spatial"),
                       nStrata = 196, nTry = 24,
                       cellSize = c(5000, 5000))
ny.pts.new <- spsample(ny.strata.new)
str(ny.pts.new)
```

Plot with two different symbols:

```{r plot.sample.new}
plot(ny.strata.new, ny.pts.new) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  labs(title = "196 compact strata, stations at centroid")
```

# Exporting sample points

The selected locations can be exported for use in other spatial programs or GIS. The simplest format is as a two-column CSV file. We first convert to a `data.frame` and then export this.

We know the location of the existing points, so we only export the locations of the additional points. For ease of location in the field, we convert from the NY State metric system to WGS84 geographic coordinates.

```{r export}
# indices of the new points
which.infill <- which(ny.pts.new@isPriorPoint == FALSE)
# only select the rows which represent new points
ny.infill.points.df <- 
  as(ny.pts.new, "data.frame")[which.infill,]
# how many new points?
dim(ny.infill.points.df)
# convert to sf to change CRS
ny.infill.pts.sf <- st_as_sf(ny.infill.points.df, coords = c("x1", "x2"))
head(ny.infill.pts.sf)
st_crs(ny.infill.pts.sf) <- st_crs(ny.poly)
ny.infill.pts.sf <- st_transform(ny.infill.pts.sf, 4326)
head(ny.infill.pts.sf)
# convert back to data.frame for export
ny.infill.pts.df <- as.data.frame(st_coordinates(ny.infill.pts.sf))
# round to 3 decimal degrees, approx. 100 m; adequate for weather stations
names(ny.infill.pts.df) <- c("Long", "Lat")
ny.infill.pts.df <- round(ny.infill.pts.df, 3)
head(ny.infill.pts.df)
write.csv(ny.infill.pts.df, file="./ds_spcosa/InfillSamplePts.csv")
```

There are `r dim(ny.infill.points.df)[1]` new points.