---
title: "Introduction sdcSpatial: privacy protected density maps"
subtitle: "Privacy protected density maps"
output: 
  rmarkdown::html_vignette:
    df_print: default
    fig_width: 6
vignette: >
  %\VignetteIndexEntry{Introduction sdcSpatial: privacy protected density maps}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "100%"
)

# preloading
library(sdcSpatial)
```
 q
Plotting data on a map is a popular and helpful tool to analyze spatial data. 
R makes it easy to plot spatial data with packages such as `ggplot2`, 
`tmap`, `mapview` or `leaflet`. 
However when plotting the spatial distribution of a 
_sensitive_ variable, e.g. income or unemployment, 
you may accidentally reveal a sensitive value of an 
individual observation. 
_Statistical disclosure control_ (SDC) deals with problems related to privacy in connection with 
publishing statistics. SDC provides measures to assess disclosure risk and methods to 
reduce disclosure risk in publications while trying to minimize information loss.

Several open source tools are available; see [sdcTools](https://github.com/sdcTools) for a collection of them. Commonly used tools include the standalone software tools [$\mu$-argus](https://research.cbs.nl/casc/mu.htm) and [$\tau$-argus](https://research.cbs.nl/casc/tau.htm) as well as the R-packages [sdcTable](https://cran.r-project.org/package=sdcTable)
and [sdcMicro](https://cran.r-project.org/package=sdcMicro).

Traditionally, SDC software operates on values of (aggregated) records, where it does not directly make use of spatial characteristics that might be present in the data.

`sdcSpatial` contains  functions to create spatial distribution maps, 
assess the risk of disclosure in locations on the map and to suppress 
or adjust locations with revealing sensitive values.

```{r}
library(sdcSpatial)
```

## Data

`sdcSpatial` contains two simulated datasets with realistic locations: 
`dwellings` and `enterprises`.
Lets have a look at the dataset `enterprises`.

```{r}
data("enterprises")
head(enterprises)
```

`enterprises` is a `SpatialPointsDataFrame` object, but `sdc_raster` works
equally well with `sf` and `data.frame` objects with point data (locations).

```{r}
summary(enterprises)
```

`enterprises` contains two simulated variables: `production` (numeric) and 
`fined` (logical), and we are interested in their spatial distribution.

## View the data 

Let's first plot the locations of enterprises:

```{r}
sp::plot(enterprises)
```
There are many locations and a lot of over-plotting or _occlusion_ is taking 
place: a better visualization method to reveal spatial patterns in this case 
is to create
a raster or density plot. Since we are interested in the spatial distribution
of `production` we would like to rasterize the data, which can be done with 
`raster::rasterize` or `ggplot2::geom_tile` but for didactic sake we use
`sdc_raster` to create a raster map with a 500m grid.

```{r}
production <- sdc_raster(enterprises, "production", r = 500)
plot(production, value="mean", sensitive=FALSE, main="mean production")
```
We have plotted the mean production, other stats are kept in the `production$value` object:

```{r}
raster::plot(production$value[[1:3]])
```

The important question is:

> Can we publish this map or does it contain sensitive values?

## Sensitive locations

Let us see how many of the values are sensitive:

```{r}
print(production)
```
Printing the `production` object shows that when we demand that a raster cell 
should at least have 10 observations 
(`min_count`) and its value should not be dominated by one enterprise 
(`max_risk`), then `r round(100*sensitivity_score(production))`% of the data in the map is sensitive!

For a 500m by 500m block a threshold of 10 enterprises is on the high side, 
so let us change that into 5:

```{r}
production$min_count <- 5
production$max_risk <- 0.9
# or equally 
production <- sdc_raster(enterprises, "production"
                        , r = 500, min_count = 5, max_risk = 0.9)
sensitivity_score(production)
```
The score dropped, but which cells are we talking about?

```{r}
plot(production)
sensitive_cells <- is_sensitive(production)
```

`sensitive_cells` is a `raster` which can be used for further inspection. 

## Reducing sensitivity

Let us try to reduce the sensitivity of the map using a smoothing method:

```{r}
production_smoothed <- protect_smooth(production, bw = 500)
plot(production_smoothed)
```
In this case smoothing reduced the number of sensitive locations drastically! 

Let us remove the remaining sensitive cells

```{r}
production_safe <- remove_sensitive(production_smoothed)
sensitivity_score(production_safe) # check, double check
```

We can improve upon the "blocky" map by using `raster::disaggregate`. We can 
plot the following:

```{r}
mean_production <- mean(production_safe)
mean_production <- raster::disaggregate(mean_production, 10, "bilinear")
# generated with R >= 3.6
# col <- hcl.colors(10, "YlOrRd", rev = TRUE)
col <- c("#FFFFC8", "#FEF1B2", "#FADC8A", "#F7C252", "#F5A400", "#F18000", 
         "#EB5500", "#D12D00", "#A90D00", "#7D0025")
raster::plot(mean_production, col=col)

# library(leaflet)
# leaflet() %>% 
#   leaflet::addTiles() %>% 
#   leaflet::addRasterImage(mean_production, colors = col, opacity = 0.5)
```


`protect_quadtree` is also a protecting method, which we demonstrate with
the variable `fined`. 

First we create a more fine grained (pun not intended) raster for the variable
`fined`. 

```{r}
fined <- sdc_raster(enterprises, "fined", min_count = 5, r = 200, max_risk = 0.8)
print(fined)
```

Which is rather sensitive, let us have a look at the locations:
```{r}
# col <- hcl.colors(10, rev=TRUE) # generated with R >= 3.6
col <- c("#FDE333", "#BBDD38", "#6CD05E", "#00BE7D", "#00A890"
        , "#008E98",  "#007094", "#185086", "#422C70", "#4B0055")
plot(fined, "mean", col=col)
```

The quadtree method aggregates sensitive cells with its 3 neighbors and does this
recursively: the result is as follows:

```{r}
fined_qt <- protect_quadtree(fined)
plot(fined_qt, col=col)
```
which has a sensitivity score of `r sensitivity_score(fined_qt)`.

The method has the advantage of locally selecting the necessary resolution to 
suppress sensitive values, while the `protect_smooth` method uses a fixed bandwidth.

The protection result is blocky in comparison with the smoothing method, but safer if you 
look at the sensitive cells in high fined areas.

```{r}
fined_smooth <- protect_smooth(fined, bw = 500, keep_resolution=FALSE)
plot(fined_smooth, col = col)
sensitivity_score(fined_smooth)
```


## Thanks to `raster`

`sdcSpatial` builds heavily upon the excellent [raster](https://cran.r-project.org/package=raster) package: it creates `raster` maps and uses the machinery of `raster` to calculate sensitivity
and to apply protection methods to raster maps.