Overview of Validation Workflows
Source:vignettes/validation_workflows.Rmd
validation_workflows.Rmd
There are six validation workflows in pointblank:
- VALID-I: Data Quality Reporting
- VALID-II: Pipeline Data Validation
- VALID-III: Expectations in Unit Tests
- VALID-IV: Data Tests for Conditionals
- VALID-V: Table Scan
- VALID-VI: R Markdown Document Validation
The first workflow (VALID-I) is used
for comprehensive reporting of the data quality of a target table. This
typically uses as many validation functions as the user wishes to write
to get an adequate level of validation coverage for that table. This is
an agent-based workflow that uses: (1) the
create_agent()
function, (2) one or more validation
functions, and (3) the interrogate()
function. The
agent generated by create_agent()
is given a
target table and it accepts validation functions (e.g.,
col_vals_gt()
, col_is_numeric()
,
rows_distinct()
, etc.), building up a validation plan. It’s
not until interrogate()
is called that the validations are
evaluated and subsequent intel is stored. An agent object, both
pre- and post-interrogation can be printed, yielding a Validation
Report. A separate set of functions (e.g., the
Post-interrogation functions that include
get_data_extracts()
, get_sundered_data()
, and
more) can be called on the agent to collect the intel or
variations/extracts of the target table.
The second workflow (VALID-II) is meant for repeated data-quality checks in a data-transformation pipeline that involves tabular data. The principal mode of operation there is to use validation functions to either warn the user of unforeseen data integrity problems or stop the pipeline dead so that dependent, downstream processes (that would use the data to some extent) are never initiated. Both the Data Quality Reporting and the Pipeline Data Validation workflows use a common set of validation functions, but latter doesn’t use an agent, and the validations eagerly interrogate the data at each invocation.
Data can be tested like function output is tested by using the
Expectations in Unit Tests workflow (VALID-III). This
uses a suite of expect_*()
functions that are analogous to
the validation functions but with simplified interfaces. These functions
are used directly on data (no agent) and they serve as tests in
the testthat testing framework. Unit testing on data is
important if your package functions produce or transform data and the
testthat-compatible functions offered by
pointblank make testing data a little bit easier and a
lot more precise.
Evaluation of data can produce logical output
(TRUE
/FALSE
) through use of the Data
Tests for Conditionals workflow (VALID-IV). This
uses the analogous suite of test_*()
functions. This
workflow is suitable in programming contexts where the result of data
validation might be the alteration of a code path given that a logical
value is always returned with these. Like the VALID-III
workflow, the function’s signature is simplified. In fact, the arguments
for the complementary expect_*()
and test_*()
functions are the same.
A target table can be scanned and described with the Table Scan workflow (VALID-V). This is useful for getting table dimensions, important statistical values by column, interactions by column, and a view of missingness in easy-to-parse HTML output. Components of this HTML report can be reordered or omitted as needed.
The R Markdown Document Validation workflow (VALID-VI) can
contain a combination of workflow elements. The ideal workflow to use
here is that from VALID-II
(Pipeline Data Validation) since that in combination
with chunks having the option validation = TRUE
set results
in a special display of validation results in a rendered HTML document.
The VALID-I
workflow can also be used since the agent report prints nicely as an
HTML table. The table_scan()
function of the VALID-V workflow
likewise produces useful output. Finally, the test_*()
functions of the VALID-IV
workflow can be used should logical values be needed within the
code.