Title: | Validate Data in a Database using 'validate' |
---|---|
Description: | Check whether records in a database table are valid using validation rules in R syntax specified with R package 'validate'. R validation checks are automatically translated to SQL using 'dbplyr'. |
Authors: | Edwin de Jonge [aut, cre] |
Maintainer: | Edwin de Jonge <[email protected]> |
License: | GPL-3 |
Version: | 0.3.2.9000 |
Built: | 2024-12-01 06:41:48 UTC |
Source: | https://github.com/data-cleaning/validatedb |
See the number of valid and invalid checks either by rule or by record.
## S3 method for class 'tbl_validation' aggregate(x, by = c("rule", "record", "key"), ...)
## S3 method for class 'tbl_validation' aggregate(x, by = c("rule", "record", "key"), ...)
x |
|
by |
either by "rule" or by "record" |
... |
not used |
The result of a confront()
on a db tbl
results in a lazy squery. That
is it builds a query without executing it. To store the result in the database
use compute()
or values()
.
A dbplyr::tbl_dbi()
object that represents the aggregation query
(to be executed) on the database.
income <- data.frame(id = 1:2, age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 ) # and confront! # in general with a db table it is handy to use a key cf <- confront(tbl_income, rules, key="id") aggregate(cf, by = "rule") aggregate(cf, by = "record") # to tweak performance of the db query the following options are available # 1) store validation result in db cf <- confront(tbl_income, rules, key="id", compute = TRUE) # or identical cf <- confront(tbl_income, rules, key="id") cf <- compute(cf) # 2) Store the validation sparsely cf_sparse <- confront(tbl_income, rules, key="id", sparse=TRUE ) show_query(cf_sparse) values(cf_sparse, type="tbl")
income <- data.frame(id = 1:2, age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 ) # and confront! # in general with a db table it is handy to use a key cf <- confront(tbl_income, rules, key="id") aggregate(cf, by = "rule") aggregate(cf, by = "record") # to tweak performance of the db query the following options are available # 1) store validation result in db cf <- confront(tbl_income, rules, key="id", compute = TRUE) # or identical cf <- confront(tbl_income, rules, key="id") cf <- compute(cf) # 2) Store the validation sparsely cf_sparse <- confront(tbl_income, rules, key="id", sparse=TRUE ) show_query(cf_sparse) values(cf_sparse, type="tbl")
Retrieve validation results as a data.frame
## S3 method for class 'tbl_validation' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S3 method for class 'tbl_validation' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
|
row.names |
ignored |
optional |
ignored |
... |
ignored |
data.frame
, result of the query on the database.
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! cf <- confront(tbl_income, rules, key = "id") as.data.frame(cf) # and now with a sparse result: cf <- confront(tbl_income, rules, key = "id", sparse=TRUE) as.data.frame(cf)
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! cf <- confront(tbl_income, rules, key = "id") as.data.frame(cf) # and now with a sparse result: cf <- confront(tbl_income, rules, key = "id", sparse=TRUE) as.data.frame(cf)
Checks whether validation rules are working on the database, and gives hints on non working rules.
check_rules(tbl, x, key = NULL)
check_rules(tbl, x, key = NULL)
tbl |
|
x |
|
key |
|
validatedb
translates validation rules using dbplyr
on a database. Every
database engine is different, so it may happen that some validation rules
will not work. This function helps in finding out why rules are not working.
In some (easy to fix) cases, this may be due to:
using variables that are not present in the table
using a different value type than the column in the database, e.g.using an integer value, while the database column is of type "varchar".
To debug your rules, a useful thing to do is first to test the rules on a small sub set of the table
e.g.
tbl |> head() |> # debugging on db as.data.frame() |> # debugging "rules", do they work on a data.frame confront(rules, key = "id") |> summary()
But it can also be that some R functions are not available on the database, in which case you have to reformulate the rule.
data.frame
with name
, rule
, working
, sql
for each rule.
person <- dbplyr::memdb_frame(id = letters[1:2], age = c(12, 20)) rules <- validator(age >= 18) check_rules(person, rules, key = "id") # use the result of check_rules to find out more on the translation res <- check_rules(person, rules, key = "id") print(res[-4]) writeLines(res$sql)
person <- dbplyr::memdb_frame(id = letters[1:2], age = c(12, 20)) rules <- validator(age >= 18) check_rules(person, rules, key = "id") # use the result of check_rules to find out more on the translation res <- check_rules(person, rules, key = "id") print(res[-4]) writeLines(res$sql)
Stores the validation result in the db using
the dplyr::compute()
of the db back-end.
This method changes the tbl_validation
object!
Note that for most back-ends the default setting is
a temporary table with a random name.
## S3 method for class 'tbl_validation' compute(x, name, ...)
## S3 method for class 'tbl_validation' compute(x, name, ...)
x |
|
name |
optional, when omitted, a random name is used. |
... |
passed through to |
A dbplyr::tbl_dbi()
object that refers to the computed (temporary)
table in the database. See dplyr::compute()
.
Other tbl_validation:
tbl_validation-class
Create a sparse confrontation query. Only errors and missing are stored.
This stores all results
of a tbl
validation in a table with length(rules)
columns and nrow(tbl)
rows. Note that the result of this function is a (lazy) query object that
still needs to be executed in the database, e.g. with dplyr::collect()
, dplyr::collapse()
or
dplyr::compute()
.
confront_tbl_sparse(tbl, x, key, union_all = TRUE, check_rules = TRUE)
confront_tbl_sparse(tbl, x, key, union_all = TRUE, check_rules = TRUE)
tbl |
|
x |
|
key |
|
union_all |
if |
check_rules |
if |
The return value of the function is a list with:
$query
: A dbplyr::tbl_dbi()
object that refers to the confrontation query.
$errors
: The validation rules that are not working on the database
$working
: A logical
with which expression are working on the database.
$exprs
: All validation expressions.
A object with the necessary information: see details
Other validation:
tbl_validation-class
,
values,tbl_validation-method
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! (we have to use a key, because a db...) cf <- confront(tbl_income, rules, key = "id") print(cf) summary(cf) # Values (i.e. validations on the table) can be retrieved like in `validate` # with`type="matrix"` (simplify = TRUE) values(cf, type = "matrix") # But often this seems more handy: values(cf, type = "tbl") # We can see the sql code by using `show_query`: show_query(cf) # identical show_query(values(cf, type = "tbl")) # sparse results in db (that the default) values(cf, type="tbl", sparse=TRUE) # or if you like data.frames values(cf, type="data.frame", sparse=TRUE)
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! (we have to use a key, because a db...) cf <- confront(tbl_income, rules, key = "id") print(cf) summary(cf) # Values (i.e. validations on the table) can be retrieved like in `validate` # with`type="matrix"` (simplify = TRUE) values(cf, type = "matrix") # But often this seems more handy: values(cf, type = "tbl") # We can see the sql code by using `show_query`: show_query(cf) # identical show_query(values(cf, type = "tbl")) # sparse results in db (that the default) values(cf, type="tbl", sparse=TRUE) # or if you like data.frames values(cf, type="data.frame", sparse=TRUE)
tbl
with validator
rules.Confront dbplyr::tbl_dbi()
objects with validate::validator()
rules, making it
possible to execute validator()
rules on database tables. Validation results
can be stored in the db or retrieved into R.
confront.tbl_sql(tbl, x, ref, key, sparse = FALSE, compute = FALSE, ...) ## S4 method for signature 'ANY,validator,ANY' confront(dat, x, ref, key = NULL, sparse = FALSE, ...)
confront.tbl_sql(tbl, x, ref, key, sparse = FALSE, compute = FALSE, ...) ## S4 method for signature 'ANY,validator,ANY' confront(dat, x, ref, key = NULL, sparse = FALSE, ...)
tbl |
|
x |
|
ref |
reference object (not working) |
key |
|
sparse |
|
compute |
|
... |
passed through to |
dat |
an object of class 'tbl_sql“. |
validatedb
builds upon dplyr
and dbplyr
, so it works on all databases
that have a dbplyr compatible database driver (DBI / odbc).
validatedb
translates validator
rules into dplyr
commands resulting in
a lazy query object. The result of a validation can be stored in the database
using compute
or retrieved into R with values
.
a tbl_validation()
object, containing the confrontation query and processing information.
Other validation:
tbl_validation-class
,
values,tbl_validation-method
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! (we have to use a key, because a db...) cf <- confront(tbl_income, rules, key = "id") print(cf) summary(cf) # Values (i.e. validations on the table) can be retrieved like in `validate` # with`type="matrix"` (simplify = TRUE) values(cf, type = "matrix") # But often this seems more handy: values(cf, type = "tbl") # We can see the sql code by using `show_query`: show_query(cf) # identical show_query(values(cf, type = "tbl")) # sparse results in db (that the default) values(cf, type="tbl", sparse=TRUE) # or if you like data.frames values(cf, type="data.frame", sparse=TRUE)
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! (we have to use a key, because a db...) cf <- confront(tbl_income, rules, key = "id") print(cf) summary(cf) # Values (i.e. validations on the table) can be retrieved like in `validate` # with`type="matrix"` (simplify = TRUE) values(cf, type = "matrix") # But often this seems more handy: values(cf, type = "tbl") # We can see the sql code by using `show_query`: show_query(cf) # identical show_query(values(cf, type = "tbl")) # sparse results in db (that the default) values(cf, type="tbl", sparse=TRUE) # or if you like data.frames values(cf, type="data.frame", sparse=TRUE)
Write sql statements of a tbl confrontation.
dump_sql(x, sql_file = stdout(), sparse = x$sparse, ...)
dump_sql(x, sql_file = stdout(), sparse = x$sparse, ...)
x |
|
sql_file |
filename/connection where the sql code should be written to. |
sparse |
not used |
... |
not used |
tests for each rule if it can be executed on the database
rule_works_on_tbl(tbl, x, key = NULL, show_errors = FALSE)
rule_works_on_tbl(tbl, x, key = NULL, show_errors = FALSE)
tbl |
a |
x |
a |
key |
|
show_errors |
if |
logical
encoding which validation rules "work" on the database.
Shows the generated sql code for the validation of the tbl.
## S3 method for class 'tbl_validation' show_query(x, ..., sparse = x$sparse)
## S3 method for class 'tbl_validation' show_query(x, ..., sparse = x$sparse)
x |
|
... |
passed through. |
sparse |
|
Same result as dplyr::show_query, i.e. the SQL text of the query.
tbl
objectValidation information for a database tbl
, result of a confront.tbl_sql()
.
The tbl_validation
object contains all information needed for the confrontation
of validation rules with the data in the database table. It contains:
$query
: a dbplyr::tbl_dbi object with the query to be executed on the database
$tbl
: the dbplyr::tbl_dbi pointing to the table in the database
$key
: Whether there is a key column, and if so, what it is.
$record_based
: logical
with which rules are record based.
$exprs
: list of validation rule expressions
$working
: logical
, which of the rules work on the database. (whether the database supports this expression)
$errors
: list of validation rules that did not execute on the database.
$sparse
: If TRUE
the query default presented as a sparse validation object.
$subqueries
: list of sparse queries for each of the rules.
tbl_validation
object. See details.
Other validation:
confront.tbl_sql()
,
values,tbl_validation-method
Other tbl_validation:
compute.tbl_validation()
Retrieve the result of a validation/confrontation.
## S4 method for signature 'tbl_validation' values( x, simplify = type == "matrix", drop = FALSE, type = c("tbl", "matrix", "list", "data.frame"), sparse = x$sparse, ... )
## S4 method for signature 'tbl_validation' values( x, simplify = type == "matrix", drop = FALSE, type = c("tbl", "matrix", "list", "data.frame"), sparse = x$sparse, ... )
x |
|
simplify |
only use when |
drop |
not used at the moment |
type |
whether to return a list/matrix or to return a query on the database. |
sparse |
whether to show the results as a sparse query (only fails and |
... |
not used |
Since the validation is done on a database, there are multiple options for storing the result of the validation. The results show per record whether they are valid according to the validation rules supplied.
Use compute
(see confront.tbl_sql()
) to store the result in the database
Use sparse
to only calculate "fails" and "missings"
Default type "tbl" is that everything is "lazy", so the query and/or storage has to
be done explicitly by the user.
The other types execute the query and retrieve the result into R. When this
creates memory problems, the tbl
option is to be preferred.
Results for type
:
tbl
: a dbplyr::tbl_dbi object, pointing to the database
matrix
: a R matrix, similar to validate::values()
.
list
: a R list, similar to validate::values()
.
data.frame
: the result of tbl
stored in a data.frame
.
depending on type
the result is different, see details
Other validation:
confront.tbl_sql()
,
tbl_validation-class
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! (we have to use a key, because a db...) cf <- confront(tbl_income, rules, key = "id") print(cf) summary(cf) # Values (i.e. validations on the table) can be retrieved like in `validate` # with`type="matrix"` (simplify = TRUE) values(cf, type = "matrix") # But often this seems more handy: values(cf, type = "tbl") # We can see the sql code by using `show_query`: show_query(cf) # identical show_query(values(cf, type = "tbl")) # sparse results in db (that the default) values(cf, type="tbl", sparse=TRUE) # or if you like data.frames values(cf, type="data.frame", sparse=TRUE)
# create a table in a database income <- data.frame(id = letters[1:2], age=c(12,35), salary = c(1000,NA)) con <- dbplyr::src_memdb() tbl_income <- dplyr::copy_to(con, income, overwrite=TRUE) print(tbl_income) # Let's define a rule set and confront the table with it: rules <- validator( is_adult = age >= 18 , has_income = salary > 0 , mean_age = mean(age,na.rm=TRUE) > 20 ) # and confront! (we have to use a key, because a db...) cf <- confront(tbl_income, rules, key = "id") print(cf) summary(cf) # Values (i.e. validations on the table) can be retrieved like in `validate` # with`type="matrix"` (simplify = TRUE) values(cf, type = "matrix") # But often this seems more handy: values(cf, type = "tbl") # We can see the sql code by using `show_query`: show_query(cf) # identical show_query(values(cf, type = "tbl")) # sparse results in db (that the default) values(cf, type="tbl", sparse=TRUE) # or if you like data.frames values(cf, type="data.frame", sparse=TRUE)