Introduction to the Pipeline Data Validation Workflow (VALID-II)
Source:vignettes/VALID-II.Rmd
VALID-II.Rmd
The second data validation workflow, VALID-II: Pipeline Data Validation, somewhat simplifies the process for checking data directly. There is no agent involved here and we instead call validation functions directly on the data table objects. Because no agent, there is no report, and the idea is that the side effects are most important here. We can trigger warnings, raise errors, or write out logs when exceeding specified failure thresholds. The data just passes through the validation functions (some data in, the same data out).
Where would we do this? When importing data, for instance, we could
pass the incoming data through a few validation functions, possibly with
customized threshold levels set (by default, any test units failing will
result in an error). We could also use a set of validation functions
further down the script on transformed data as a QA/QC measure. If bad
data quality might be ruinous for a downstream report (especially in an
automated context), it’s better to stop the process through
pointblank validation tests and get to the heart of the
matter. And, just like every workflow in pointblank,
the target table can be a data frame or tibble, a database table (a
tbl_dbi
object), or a Spark DataFrame (a
tbl_spark
object, by way of the sparklyr
package).
An Example of This Workflow
Let’s adapt an example used in the VALID-I: Data Quality Reporting Workflow article to the pipeline data validation workflow:
small_table %>%
col_is_posix(date_time) %>%
col_vals_in_set(f, set = c("low", "mid", "high")) %>%
col_vals_lt(a, value = 10) %>%
col_vals_regex(b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$") %>%
col_vals_between(d, left = 0, right = 5000)
## Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
## The `col_vals_between()` validation failed beyond the absolute threshold level (1).
## * failure level (1) >= failure threshold (1)
In terms of the expressions used in the pipeline, you might notice
that the create_agent()
and interrogate()
functions are absent. This is due to the secondary role of the
validation functions, where they can operate directly and immediately on
the data, acting as a sort of filter. If all validations in this
pipeline pass (i.e., there are no failing test units in any of the
validation steps), then the small_table
data is returned.
Otherwise, as it’s currently written, a stoppage will occur on any such
failure.
As it turns out, the above validation pipeline did result in an error. The stringent default threshold setting stops the evaluation of the pipeline at the point of failure (one or more test units failing). This, in turn, stops the running script (a key consideration if the script is deployed and automatically running on some sort of schedule). In this type of workflow we don’t need to define those functions, pointblank will automatically do the sensible thing by stopping evaluation and providing a stock message. So, in the first instance of such a stoppage due to validation failing, R scripts will stop at that point and R Markdown documents will correspondingly cease to render.
Modifying the Behavior of Validation Failures
There are ways to counteract the stopping behavior seen in the
previous example. In an R Markdown document, we could
set the chunk option error = TRUE
where we might expect an
error to occur due to validation failure (allowing execution to continue
no matter what happens). That’s what was done for the
small_table
example in this very document, giving us the
error message printed below the input. Another way is to disable
evaluation at the step level is to use the active = FALSE
option on every validation function that shouldn’t be evaluated. This
works for both R scripts and R
Markdown documents but it is quite different since we can’t
determine whether a validation passed or failed since it doesn’t
actually perform the check (it simply skips it). It is useful, however,
if you want to enable staged checks use data validations in a sort of
‘debug mode’ since global variables that are logical can be passed to
the active
argument in specific validation functions.
Another strategy is to not stop()
but instead
warn()
.
Using warn_on_fail()
and stop_on_fail()
functions to generate simple action_levels
There are two helper functions that are convenient for this workflow:
warn_on_fail()
and stop_on_fail()
. These
functions return an action_levels
object that either warns
or stops, doing so with informative warning or error messages. The
default failure threshold for each is set to 1
: one
fail unit means that the threshold for the WARN
or
FAIL
condition will be met. The stop_on_fail()
helper is (behind the scenes) applied by default when using validation
functions in the VALID-II workflow. So, the previous example is
exactly the same as this expanded form:
small_table %>%
col_is_posix(
date_time,
actions = stop_on_fail(stop_at = 1)
) %>%
col_vals_in_set(
f, set = c("low", "mid", "high"),
actions = stop_on_fail(stop_at = 1)
) %>%
col_vals_lt(
a, value = 10,
actions = stop_on_fail(stop_at = 1)
) %>%
col_vals_regex(
b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$",
actions = stop_on_fail(stop_at = 1)
) %>%
col_vals_between(
d, left = 0, right = 5000,
actions = stop_on_fail(stop_at = 1)
)
## Error: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
## The `col_vals_between()` validation failed beyond the absolute threshold level (1).
## * failure level (1) >= failure threshold (1)
If we want to instead issue warnings, perhaps with less stringent
failure thresholds in certain steps, the warn_on_fail()
function provides a simple way to express this.
small_table %>%
col_is_posix(
date_time,
actions = warn_on_fail()
) %>%
col_vals_in_set(
f, set = c("low", "mid", "high"),
actions = warn_on_fail(warn_at = 0.2)
) %>%
col_vals_lt(
a, value = 10,
actions = warn_on_fail(warn_at = 3)
) %>%
col_vals_regex(
b, regex = "^[0-9]-[a-z]{3}-[0-9]{3}$",
actions = warn_on_fail(warn_at = 0.2)
) %>%
col_vals_between(
d, left = 0, right = 5000,
actions = warn_on_fail(warn_at = 1)
)
## Warning: Exceedance of failed test units where values in `d` should have been between `0` and `5000`.
## The `col_vals_between()` validation failed beyond the absolute threshold level (1).
## * failure level (1) >= failure threshold (1)
## # A tibble: 13 × 8
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 834. TRUE low
## 12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108. FALSE low
## 13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
While warn_on_fail()
and stop_on_fail()
are
handy ways to generate action_levels
objects for the
actions
argument in every validation function, it doesn’t
provide a way to specify any actions. For that, we’ll need to use the
action_levels()
function directly.
Using action_levels()
for More Control
The action_levels()
function can be as useful here as it
is for the VALID-I workflow
with an agent. The function creates an
action_levels
object which can have two roles: (1)
specification of threshold failure levels for entering certain
conditions (WARN
, STOP
, and
NOTIFY
), and (2) setting of actions (i.e., function calls
to invoke) when entering a specific condition. The fns
argument of action_levels()
allows us to define those
custom functions that are evaluated upon entering each of the three
states (and this acts on the ‘step’ level, per each validation
function).
Compared to the VALID-I workflow, which deals more with reporting, having actions triggered by failures in the VALID-II workflow is probably more useful and important. We can imagine a situation where an R script is deployed with data validation interspersed throughout. Depending on the deployment, we may desire a hard stop (affecting the downstream components), or, we may want a softer approach with warning and logging.
Let’s try a hybrid approach where each of three available conditions
have failure threshold levels set and an associated function to invoke.
The functions to invoke at each condition can be whatever makes sense
for the workflow (i.e., we don’t have to issue warnings in the
WARN
condition if we want something else). Here, we will
use a warning()
for WARN
, a
stop()
for STOP
, and a logging function
(log4r_step()
) for NOTIFY
. Here’s how we might
create such an action_levels
object with
action_levels()
:
al <-
action_levels(
warn_at = 0.1,
stop_at = 0.2,
notify_at = 0.3,
fns = list(
warn = ~ warning("WARN threshold exceeded."),
stop = ~ stop("STOP threshold exceeded."),
notify = ~ log4r_step(x)
)
)
Once we assigned the action_levels
to an object (in this
case, as al
) we can print it to get a summary of the
settings.
al
## -- The `action_levels` settings
## WARN failure threshold of 0.1 of all test units.
## \fns\ ~ warning("WARN threshold exceeded.")
## STOP failure threshold of 0.2 of all test units.
## \fns\ ~ stop("STOP threshold exceeded.")
## NOTIFY failure threshold of 0.3 of all test units.
## \fns\ ~ log4r_step(x)
## ----
Finally, we will apply this object to every validation function call in the expression (changed slightly to result in more test units failing).
small_table %>%
col_is_posix(date_time, actions = al) %>%
col_vals_in_set(f, set = c("low", "mid"), actions = al) %>%
col_vals_lt(a, value = 7, actions = al) %>%
col_vals_regex(b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$", actions = al) %>%
col_vals_between(d, left = 0, right = 4000, actions = al)
## Warning in rlang::eval_tidy(.): WARN threshold exceeded.
## Error in rlang::eval_tidy(.): STOP threshold exceeded.
In addition to an error and a warning, the log4r_step()
function used for the NOTIFY
condition generates, in this
case, a new "pb_log_file"
text file for logs. We can
examine it with readLines()
; it has a single entry that
relates to Step 1
(the col_vals_in_set()
step):
readLines("pb_log_file")
FATAL [2020-11-09 00:23:48] Step 1 exceeded the NOTIFY failure threshold (f_failed = 0.46154) ['col_vals_in_set']
The log4r_step()
function offered by
pointblank is shown in examples and explained in more
detail in the VALID-I: Data
Quality Reporting Workflow article.