Skip to contents

The VALID-III: Expectations in Unit Tests workflow is all about checking your data alongside your usual testthat tests. The functions used for this workflow all begin with the expect_ prefix and are based on the set of validation functions used in the VALID-I and VALID-II workflows. Here’s the complete list of functions with a phrase for each function’s expectation:

Now that we know that the functions are similar in intent but different in name, let’s learn how to use these functions effectively.

Using expect_*() Functions in the testthat Way

The testthat package has collection of functions that begin with expect_. It’s no coincidence that the pointblank for the VALID-III workflow adopts the same naming convention. The idea is to use these functions interchangeably with those from testthat in the standard testthat workflow (in a test-<name>.R file, inside the tests/testthat folder). The big difference here is that instead of testing function outputs, we are testing data tables. However, tables may be returned from function calls and the expect_*() functions offered up by pointblank might offer more flexibility for testing that data. For instance expect_col_vals_between() allows us to write an expectation with fine control on boundary values (and whether they are inclusive bounds), whether NA values should be ignored, and we can even set a failure threshold if that makes sense for the expectation.

Here’s an example of how to generate some tests on data with testthat and also with pointblank. For the small_table dataset, let’s write expectations that show that non-NA values in column c are between 2 and 9.

testthat::expect_true(all(na.omit(small_table$c) >= 2))
testthat::expect_true(all(na.omit(small_table$c) <= 9))

There is no testthat function that tests for values between two values. The original strategy was to use testthat::expect_gte() and testthat::lte() with small_table$c as the object in both, however, that doesn’t work because that results in a logical vector greater than length 1. Also, there is no allowance for NA values to be skipped. The best I could do was the above.

The pointblank version of this task makes for a more succinct and understandable expectation expression:

expect_col_vals_between(small_table, c, 2, 9, na_pass = TRUE)

The arguments in the expect_col_vals_between() give us everything we need to check tabular data without having to do subsetting and perform other transformations. There are a few added benefits. Should data come from a data source other than a local data frame, the SQL expressions are handled internally and they have been tested extensively across all the supported database types and in Spark DataFrames as well.

These expect_*() Functions Are Simpler Than Their Counterparts

All of the expect_*() functions have the same leading arguments of their validation function counterparts but they omit the following arguments at the end of their signatures:

  • actions
  • step_id
  • label
  • brief
  • active

While we lose the actions argument, we get in its place the threshold argument. This is a simple failure threshold value for use with the expectation (expect_*) and the test (test_*) functions. By default, threshold is set to 1 which means that any single test unit failing will result in an overall failure (i.e., the expectation will fail).

As with the thresholds set in the action_levels() functions (or the shortcut functions warn_on_fail() and stop_on_fail()), whole numbers beyond 1 indicate that any failing units up to that absolute threshold value will result in a succeeding expectation. Likewise, fractional values (between 0 and 1) act as a proportional failure threshold, where 0.25 means that 25% of failing test units results in a failed expectation.

The preconditions argument can be used to transform the input data before evaluation of the expectation. This is useful is some cases where you might need to summarize the input data table, mutate columns, perform some filtering, or even perform table joins beforehand.