Overview of Validation Workflows • pointblank

There are six validation workflows in pointblank:

The first workflow (VALID-I) is used for comprehensive reporting of the data quality of a target table. This typically uses as many validation functions as the user wishes to write to get an adequate level of validation coverage for that table. This is an agent-based workflow that uses: (1) the create_agent() function, (2) one or more validation functions, and (3) the interrogate() function. The agent generated by create_agent() is given a target table and it accepts validation functions (e.g., col_vals_gt(), col_is_numeric(), rows_distinct(), etc.), building up a validation plan. It’s not until interrogate() is called that the validations are evaluated and subsequent intel is stored. An agent object, both pre- and post-interrogation can be printed, yielding a Validation Report. A separate set of functions (e.g., the Post-interrogation functions that include get_data_extracts(), get_sundered_data(), and more) can be called on the agent to collect the intel or variations/extracts of the target table.

The second workflow (VALID-II) is meant for repeated data-quality checks in a data-transformation pipeline that involves tabular data. The principal mode of operation there is to use validation functions to either warn the user of unforeseen data integrity problems or stop the pipeline dead so that dependent, downstream processes (that would use the data to some extent) are never initiated. Both the Data Quality Reporting and the Pipeline Data Validation workflows use a common set of validation functions, but latter doesn’t use an agent, and the validations eagerly interrogate the data at each invocation.

Data can be tested like function output is tested by using the Expectations in Unit Tests workflow (VALID-III). This uses a suite of expect_*() functions that are analogous to the validation functions but with simplified interfaces. These functions are used directly on data (no agent) and they serve as tests in the testthat testing framework. Unit testing on data is important if your package functions produce or transform data and the testthat-compatible functions offered by pointblank make testing data a little bit easier and a lot more precise.

Evaluation of data can produce logical output (TRUE/FALSE) through use of the Data Tests for Conditionals workflow (VALID-IV). This uses the analogous suite of test_*() functions. This workflow is suitable in programming contexts where the result of data validation might be the alteration of a code path given that a logical value is always returned with these. Like the VALID-III workflow, the function’s signature is simplified. In fact, the arguments for the complementary expect_*() and test_*() functions are the same.

A target table can be scanned and described with the Table Scan workflow (VALID-V). This is useful for getting table dimensions, important statistical values by column, interactions by column, and a view of missingness in easy-to-parse HTML output. Components of this HTML report can be reordered or omitted as needed.

The R Markdown Document Validation workflow (VALID-VI) can contain a combination of workflow elements. The ideal workflow to use here is that from VALID-II (Pipeline Data Validation) since that in combination with chunks having the option validation = TRUE set results in a special display of validation results in a rendered HTML document. The VALID-I workflow can also be used since the agent report prints nicely as an HTML table. The table_scan() function of the VALID-V workflow likewise produces useful output. Finally, the test_*() functions of the VALID-IV workflow can be used should logical values be needed within the code.