Title: | Checking and Simplifying Validation Rule Sets |
---|---|
Description: | Rule sets with validation rules may contain redundancies or contradictions. Functions for finding redundancies and problematic rules are provided, given a set a rules formulated with 'validate'. |
Authors: | Edwin de Jonge [aut, cre] , Mark van der Loo [aut], Jacco Daalmans [ctb] |
Maintainer: | Edwin de Jonge <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.2 |
Built: | 2024-11-22 04:52:12 UTC |
Source: | https://github.com/data-cleaning/validatetools |
Detect viable domains for categorical variables
detect_boundary_cat(x, ..., as_df = FALSE)
detect_boundary_cat(x, ..., as_df = FALSE)
x |
|
... |
not used |
as_df |
return result as data.frame (before 0.4.5) |
data.frame
with columns $variable
, $value
, $min
, $max
. Each row is a
category/value of a categorical variable.
Other feasibility:
detect_boundary_num()
,
detect_infeasible_rules()
,
is_contradicted_by()
,
is_infeasible()
,
make_feasible()
rules <- validator( x >= 1, x + y <= 10, y >= 6 ) detect_boundary_num(rules) rules <- validator( job %in% c("yes", "no"), if (job == "no") income == 0, income > 0 ) detect_boundary_cat(rules)
rules <- validator( x >= 1, x + y <= 10, y >= 6 ) detect_boundary_num(rules) rules <- validator( job %in% c("yes", "no"), if (job == "no") income == 0, income > 0 ) detect_boundary_cat(rules)
Detect for each numerical variable in a validation rule set, what its maximum and minimum values are.
This allows for manual rule set checking: does rule set x
overly constrain numerical values?
detect_boundary_num(x, eps = 1e-08, ...)
detect_boundary_num(x, eps = 1e-08, ...)
x |
|
eps |
detected fixed values will have this precission. |
... |
currently not used |
This procedure only finds minimum and maximum values, but misses gaps.
data.frame
with columns "variable", "lowerbound", "upperbound".
Statistical Data Cleaning with R (2017), Chapter 8, M. van der Loo, E. de Jonge
Simplifying constraints in data editing (2015). Technical Report 2015|18, Statistics Netherlands, J. Daalmans
Other feasibility:
detect_boundary_cat()
,
detect_infeasible_rules()
,
is_contradicted_by()
,
is_infeasible()
,
make_feasible()
rules <- validator( x >= 1, x + y <= 10, y >= 6 ) detect_boundary_num(rules) rules <- validator( job %in% c("yes", "no"), if (job == "no") income == 0, income > 0 ) detect_boundary_cat(rules)
rules <- validator( x >= 1, x + y <= 10, y >= 6 ) detect_boundary_num(rules) rules <- validator( job %in% c("yes", "no"), if (job == "no") income == 0, income > 0 ) detect_boundary_cat(rules)
Detects variables that have a fixed value in the rule set. To simplify a rule set, these variables can be substituted with their value.
detect_fixed_variables(x, eps = x$options("lin.eq.eps"), ...)
detect_fixed_variables(x, eps = x$options("lin.eq.eps"), ...)
x |
|
eps |
detected fixed values will have this precission. |
... |
not used. |
Other redundancy:
detect_redundancy()
,
is_implied_by()
,
remove_redundancy()
,
simplify_fixed_variables()
,
simplify_rules()
library(validate) rules <- validator( x >= 0 , x <= 0 ) detect_fixed_variables(rules) simplify_fixed_variables(rules) rules <- validator( x1 + x2 + x3 == 0 , x1 + x2 >= 0 , x3 >= 0 ) simplify_fixed_variables(rules)
library(validate) rules <- validator( x >= 0 , x <= 0 ) detect_fixed_variables(rules) simplify_fixed_variables(rules) rules <- validator( x1 + x2 + x3 == 0 , x1 + x2 >= 0 , x3 >= 0 ) simplify_fixed_variables(rules)
Detect which rules cause infeasibility. This methods tries to remove the minimum number of rules to make the system mathematically feasible. Note that this may not result in your desired system, because some rules may be more important to you than others. This can be mitigated by supplying weights for the rules. Default weight is 1.
detect_infeasible_rules(x, weight = numeric(), ...)
detect_infeasible_rules(x, weight = numeric(), ...)
x |
|
weight |
optional named |
... |
not used |
character
with the names of the rules that are causing infeasibility.
Other feasibility:
detect_boundary_cat()
,
detect_boundary_num()
,
is_contradicted_by()
,
is_infeasible()
,
make_feasible()
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
Detect redundancies in a rule set.
detect_redundancy(x, ...)
detect_redundancy(x, ...)
x |
|
... |
not used. |
For removal of duplicate rules, simplify
Other redundancy:
detect_fixed_variables()
,
is_implied_by()
,
remove_redundancy()
,
simplify_fixed_variables()
,
simplify_rules()
rules <- validator( rule1 = x > 1 , rule2 = x > 2 ) # rule1 is superfluous remove_redundancy(rules) # rule 1 is implied by rule 2 is_implied_by(rules, "rule1") rules <- validator( rule1 = x > 2 , rule2 = x > 2 ) # standout: rule1 and rule2, oldest rules wins remove_redundancy(rules) # Note that detection signifies both rules! detect_redundancy(rules)
rules <- validator( rule1 = x > 1 , rule2 = x > 2 ) # rule1 is superfluous remove_redundancy(rules) # rule 1 is implied by rule 2 is_implied_by(rules, "rule1") rules <- validator( rule1 = x > 2 , rule2 = x > 2 ) # standout: rule1 and rule2, oldest rules wins remove_redundancy(rules) # Note that detection signifies both rules! detect_redundancy(rules)
expect values
expect_values(values, weights, ...)
expect_values(values, weights, ...)
values |
named list of values. |
weights |
named numeric of equal length as values. |
... |
not used |
Check if rules are categorical
is_categorical(x, ...)
is_categorical(x, ...)
x |
validator object |
... |
not used |
logical indicating which rules are purely categorical/logical
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") B == "b1" , y > x ) is_categorical(v)
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") B == "b1" , y > x ) is_categorical(v)
Check if rules are conditional rules
is_conditional(rules, ...)
is_conditional(rules, ...)
rules |
validator object containing validation rules |
... |
not used |
logical indicating which rules are conditional
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") x > 1 # conditional , if (y > 0) x >= 0 # conditional , if (A == "a1") B == "b1" # categorical ) is_conditional(v)
v <- validator( A %in% c("a1", "a2") , B %in% c("b1", "b2") , if (A == "a1") x > 1 # conditional , if (y > 0) x >= 0 # conditional , if (A == "a1") B == "b1" # categorical ) is_conditional(v)
Find out for a contradicting rule which rules are conflicting. This helps in determining and assessing conflicts in rule sets. Which of the rules should stay and which should go?
is_contradicted_by(x, rule_name)
is_contradicted_by(x, rule_name)
x |
|
rule_name |
|
character
with conflicting rules.
Other feasibility:
detect_boundary_cat()
,
detect_boundary_num()
,
detect_infeasible_rules()
,
is_infeasible()
,
make_feasible()
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
Find out which rules are causing rule_name(s) to be redundant.
is_implied_by(x, rule_name, ...)
is_implied_by(x, rule_name, ...)
x |
|
rule_name |
|
... |
not used |
character
with the names of the rule that cause the implication.
Other redundancy:
detect_fixed_variables()
,
detect_redundancy()
,
remove_redundancy()
,
simplify_fixed_variables()
,
simplify_rules()
rules <- validator( rule1 = x > 1 , rule2 = x > 2 ) # rule1 is superfluous remove_redundancy(rules) # rule 1 is implied by rule 2 is_implied_by(rules, "rule1") rules <- validator( rule1 = x > 2 , rule2 = x > 2 ) # standout: rule1 and rule2, oldest rules wins remove_redundancy(rules) # Note that detection signifies both rules! detect_redundancy(rules)
rules <- validator( rule1 = x > 1 , rule2 = x > 2 ) # rule1 is superfluous remove_redundancy(rules) # rule 1 is implied by rule 2 is_implied_by(rules, "rule1") rules <- validator( rule1 = x > 2 , rule2 = x > 2 ) # standout: rule1 and rule2, oldest rules wins remove_redundancy(rules) # Note that detection signifies both rules! detect_redundancy(rules)
An infeasible rule set cannot be satisfied by any data because of internal contradictions. This function checks whether the record-wise linear, categorical and conditional rules in a rule set are consistent.
is_infeasible(x, ...)
is_infeasible(x, ...)
x |
|
... |
not used |
TRUE or FALSE
Other feasibility:
detect_boundary_cat()
,
detect_boundary_num()
,
detect_infeasible_rules()
,
is_contradicted_by()
,
make_feasible()
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
Check which rules are linear rules.
is_linear(x, ...)
is_linear(x, ...)
x |
|
... |
not used |
logical
indicating which rules are (purely) linear.
Make an infeasible system feasible, by removing the minimum (weighted) number of rules, such that the remaining
rules are not conflicting.
This function uses detect_infeasible_rules
for determining the rules to be removed.
make_feasible(x, ...)
make_feasible(x, ...)
x |
|
... |
passed to |
validator
object with feasible rules.
Other feasibility:
detect_boundary_cat()
,
detect_boundary_num()
,
detect_infeasible_rules()
,
is_contradicted_by()
,
is_infeasible()
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
rules <- validator( x > 0) is_infeasible(rules) rules <- validator( rule1 = x > 0 , rule2 = x < 0 ) is_infeasible(rules) detect_infeasible_rules(rules) make_feasible(rules) # find out the conflict with this rule is_contradicted_by(rules, "rule1")
Simplify a rule set by removing redundant rules
remove_redundancy(x, ...)
remove_redundancy(x, ...)
x |
|
... |
not used |
simplified validator
object, in which redundant rules are removed.
Other redundancy:
detect_fixed_variables()
,
detect_redundancy()
,
is_implied_by()
,
simplify_fixed_variables()
,
simplify_rules()
rules <- validator( rule1 = x > 1 , rule2 = x > 2 ) # rule1 is superfluous remove_redundancy(rules) # rule 1 is implied by rule 2 is_implied_by(rules, "rule1") rules <- validator( rule1 = x > 2 , rule2 = x > 2 ) # standout: rule1 and rule2, oldest rules wins remove_redundancy(rules) # Note that detection signifies both rules! detect_redundancy(rules)
rules <- validator( rule1 = x > 1 , rule2 = x > 2 ) # rule1 is superfluous remove_redundancy(rules) # rule 1 is implied by rule 2 is_implied_by(rules, "rule1") rules <- validator( rule1 = x > 2 , rule2 = x > 2 ) # standout: rule1 and rule2, oldest rules wins remove_redundancy(rules) # Note that detection signifies both rules! detect_redundancy(rules)
Conditional rules may be constrained by the others rules in a validation rule set. This procedure tries to simplify conditional statements.
simplify_conditional(x, ...)
simplify_conditional(x, ...)
x |
|
... |
not used. |
validator
simplified rule set.
TODO non-constraining, non-relaxing
library(validate) # non-relaxing clause rules <- validator( r1 = if (x > 1) y > 3 , r2 = y < 2 ) # y > 3 is always FALSE so r1 can be simplified simplify_conditional(rules) # non-constraining clause rules <- validator( r1 = if (x > 0) y > 0 , r2 = if (x < 1) y > 1 ) simplify_conditional(rules) rules <- validator( r1 = if (A == "a1") x > 0 , r2 = if (A == "a2") x > 1 , r3 = A == "a1" ) simplify_conditional(rules)
library(validate) # non-relaxing clause rules <- validator( r1 = if (x > 1) y > 3 , r2 = y < 2 ) # y > 3 is always FALSE so r1 can be simplified simplify_conditional(rules) # non-constraining clause rules <- validator( r1 = if (x > 0) y > 0 , r2 = if (x < 1) y > 1 ) simplify_conditional(rules) rules <- validator( r1 = if (A == "a1") x > 0 , r2 = if (A == "a2") x > 1 , r3 = A == "a1" ) simplify_conditional(rules)
Detect variables of which the values are restricted to a single value by the rule set. Simplify the rule set by replacing fixed variables with these values.
simplify_fixed_variables(x, eps = 1e-08, ...)
simplify_fixed_variables(x, eps = 1e-08, ...)
x |
|
eps |
detected fixed values will have this precission. |
... |
passed to |
validator
object in which
Other redundancy:
detect_fixed_variables()
,
detect_redundancy()
,
is_implied_by()
,
remove_redundancy()
,
simplify_rules()
library(validate) rules <- validator( x >= 0 , x <= 0 ) detect_fixed_variables(rules) simplify_fixed_variables(rules) rules <- validator( x1 + x2 + x3 == 0 , x1 + x2 >= 0 , x3 >= 0 ) simplify_fixed_variables(rules)
library(validate) rules <- validator( x >= 0 , x <= 0 ) detect_fixed_variables(rules) simplify_fixed_variables(rules) rules <- validator( x1 + x2 + x3 == 0 , x1 + x2 >= 0 , x3 >= 0 ) simplify_fixed_variables(rules)
Simplifies a rule set set by applying different simplification methods. This is a convenience function that works in common cases. The following simplification methods are executed:
substitute_values
: filling in any parameters that are supplied via .values
or ...
.
simplify_fixed_variables
: find out if there are fixed values. If this is the case, they are substituted.
simplify_conditional
: Simplify conditional statements, by removing clauses that are superfluous.
remove_redundancy
: remove redundant rules.
For more control, these methods can be called separately.
simplify_rules(.x, .values = list(...), ...)
simplify_rules(.x, .values = list(...), ...)
.x |
|
.values |
optional named list with values that will be substituted. |
... |
parameters that will be used to substitute values. |
Other redundancy:
detect_fixed_variables()
,
detect_redundancy()
,
is_implied_by()
,
remove_redundancy()
,
simplify_fixed_variables()
rules <- validator( x > 0 , if (x > 0) y == 1 , A %in% c("a1", "a2") , if (A == "a1") y > 1 ) simplify_rules(rules)
rules <- validator( x > 0 , if (x > 0) y == 1 , A %in% c("a1", "a2") , if (A == "a1") y > 1 ) simplify_rules(rules)
Substitute values into expression, thereby simplifying the rule set. Rules that evaluate to TRUE because of the substitution are removed.
substitute_values(.x, .values = list(...), ..., .add_constraints = TRUE)
substitute_values(.x, .values = list(...), ..., .add_constraints = TRUE)
.x |
|
.values |
(optional) named list with values for variables to substitute |
... |
alternative way of supplying values for variables (see examples). |
.add_constraints |
|
library(validate) rules <- validator( rule1 = z > 1 , rule2 = y > z ) # rule1 is dropped, since it always is true substitute_values(rules, list(z=2)) # you can also supply the values as separate parameters substitute_values(rules, z = 2) # you can choose to not add substituted values as a constraint substitute_values(rules, z = 2, .add_constraints = FALSE) rules <- validator( rule1 = if (gender == "male") age >= 18 ) substitute_values(rules, gender="male") substitute_values(rules, gender="female")
library(validate) rules <- validator( rule1 = z > 1 , rule2 = y > z ) # rule1 is dropped, since it always is true substitute_values(rules, list(z=2)) # you can also supply the values as separate parameters substitute_values(rules, z = 2) # you can choose to not add substituted values as a constraint substitute_values(rules, z = 2, .add_constraints = FALSE) rules <- validator( rule1 = if (gender == "male") age >= 18 ) substitute_values(rules, gender="male") substitute_values(rules, gender="female")
translate linear rules into an lp problem
translate_mip_lp(rules, objective = NULL, eps = 0.001)
translate_mip_lp(rules, objective = NULL, eps = 0.001)
rules |
mip rules |
objective |
function |
eps |
accuracy for equality/inequality |
validatetools
is a utility package for managing validation rule sets
that are defined with validate
. In production systems
validation rule sets tend to grow organically and accumulate redundant or
(partially) contradictory rules. 'validatetools' helps to identify problems
with large rule sets and includes simplification methods for resolving
issues.
The following methods allow for problem detection:
is_infeasible
checks a rule set for feasibility. An infeasible system must be corrected to be useful.
detect_boundary_num
shows for each numerical variable the allowed range of values.
detect_boundary_cat
shows for each categorical variable the allowed range of values.
detect_fixed_variables
shows variables whose value is fixated by the rule set.
detect_redundancy
shows which rules are already implied by other rules.
The following methods detect possible simplifications and apply them to a rule set.
substitute_values
: replace variables with constants.
simplify_fixed_variables
: substitute the fixed variables with their values in a rule set.
simplify_conditional
: remove redundant (parts of) conditional rules.
remove_redundancy
: remove redundant rules.
Statistical Data Cleaning with Applications in R, Mark van der Loo and Edwin de Jonge, ISBN: 978-1-118-89715-7