--- title: "Find errors in data" output: rmarkdown::html_vignette: df_print: kable vignette: > %\VignetteIndexEntry{Find errors in data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(Ncpus = 1) ``` ## Intro Errorlocate uses validation rules from package `validate` to locate faulty values in observations (or in database slang: erronenous _fields_ in _records_). It follows this simple recipe (Felligi-Holt): - Check if a record is valid (using supplied validation rules) - If not valid then adjust the minimum number of values to make it valid. `errorlocate` does this by translating this into a mixed integer problem (see `vignette("inspect_mip", package="errorlocate"`) and solving it using `lpSolveAPI`. ## Methods `errorlocate` has two main functions to be used: - `locate_errors` for detecting errors - `replace_errors` for replacing faulty values with `NA` ```{r setup} library(validate) library(errorlocate) ``` Let's start with a simple example: We have a rule that age cannot be negative: ```{r} rules <- validator(age > 0) ``` And we have the following data set ```{r} "age, income -10, 0 15, 2000 25, 3000 NA, 1000 " -> csv d <- read.csv(textConnection(csv), strip.white = TRUE) ``` ```{r, echo = FALSE} d ``` ```{r} le <- locate_errors(d, rules) summary(le) ``` `summary(le)` gives an overview of the errors found in this data set. The complete error listing can be found with: ```{r} le$errors ``` Which says that record 1 has a faulty value for age. Suppose we expand our rules ```{r} rules <- validator( r1 = age > 0 , r2 = if (income > 0) age > 16 ) ``` With `validate::confront` we can see that rule `r2` is violated (record 2). ```{r} summary(confront(d, rules)) ``` What errors will be found by `locate_errors`? ```{r} set.seed(1) le <- locate_errors(d, rules) le$errors ``` It now detects that `age` in observation 2 is also faulty, since it violates the second rule. Note that we use `set.seed`. This is needed because in this example, either `age` or `income` can be considered faulty. `set.seed` assures that the procedure is reproducible. With `replace_errors` we can remove the errors (which still need to be imputed). ```{r} d_fixed <- replace_errors(d, le) summary(confront(d_fixed, rules)) ``` In which `replace_errors` set all faulty values to `NA`. ```{r} d_fixed ``` ### Weights `locate_errors` allows for supplying weigths for the variables. It is common that the quality of the observed variables differs. When we have more trust in `age` we can give it more weight so it chooses income when it has to decide between the two (record 2): ```{r} set.seed(1) # good practice, although not needed in this example weight <- c(age = 2, income = 1) le <- locate_errors(d, rules, weight) le$errors ``` Weights can be specified in different ways: (see also `errorlocate::expand_weights`): - not specifying: all variables will have weight 1 - named `vector`: all records will have same set of weights. Unspeficied columns will have weight 1. - named `matrix` or `data.frame`, same dimension as the data: specify weights per record. - Use `Inf` weights to fixate a variable, so it won't be changed. ### Performance / Parallelisation `locate_errors` solves a mixed integer problem. When the number of interactions between validation rules is large, finding an optimal solution can become computationally intensive. Both `locate_errors` as well as `replace_errors` have a parallization option: `Ncpus` making use of multiple processors. The `$duration` (s) property of each solution indicates the time spent to find a solution for each record. This can be restricted using the argument `timeout` (s). ```{r} # duration is in seconds. le$duration ```