It can be useful to set up all the data sources you need and just draw from
them when necessary. This upfront configuration with tbl_store()
lets us
define the methods for obtaining tabular data from mixed sources (e.g.,
database tables, tables generated from flat files, etc.) and provide
identifiers for these data preparation procedures.
What results from this work is a convenient way to materialize tables with
tbl_get()
. We can also get any table-prep formula from the table store
with tbl_source()
. The content of a table-prep formulas can involve reading
a table from a location, or, it can involve data transformation. One can
imagine scenarios where we might (1) procure several mutated variations of
the same source table, (2) generate a table using disparate data sources, or
(3) filter the rows of a database table according to the system time. Another
nice aspect of organizing table-prep formulas in a single object is supplying
it to the tbl
argument of create_agent()
or create_informant()
via $
notation (e.g, create_agent(tbl = <tbl_store>$<name>)
) or with
tbl_source()
(e.g.,
create_agent(tbl = ~ tbl_source("<name>", <tbl_store>))
).
Arguments
- ...
Expressions that contain table-prep formulas and table names for data retrieval. Two-sided formulas (e.g,
<LHS> ~ <RHS>
) are to be used, where the left-hand side is an identifier and the right-hand contains a statement that obtains a table (i.e., the table-prep formula). If the LHS is omitted then an identifier will be generated for you.- .list
Allows for the use of a list as an input alternative to
...
.- .init
We can optionally provide an initialization statement (in a one-sided formula) that should be executed whenever any of tables in the table store are obtained. This is useful, for instance, for including a
library()
call that can be executed before any table-prep formulas in...
.
YAML
A pointblank table store can be written to YAML with yaml_write()
and
the resulting YAML can be used in several ways. The ideal scenario is to have
pointblank agents and informants also in YAML form. This way the agent and
informant can refer to the table store YAML (via tbl_source()
), and, the
processing of both agents and informants can be performed with
yaml_agent_interrogate()
and yaml_informant_incorporate()
. With the
following R code, a table store with two table-prep formulas is generated and
written to YAML (if no filename is given then the YAML is written to
"tbl_store.yml"
).
R statement for generating the "tbl_store.yml"
file:
tbl_store(
tbl_duckdb ~ db_tbl(small_table, dbname = ":memory:", dbtype = "duckdb"),
sml_table_high ~ small_table %>% dplyr::filter(f == "high"),
.init = ~ library(tidyverse)
) %>%
yaml_write()
YAML representation ("tbl_store.yml"
):
type: tbl_store
tbls:
tbl_duckdb: ~ db_tbl(small_table, dbname = ":memory:", dbtype = "duckdb")
sml_table_high: ~ small_table %>% dplyr::filter(f == "high")
init: ~library(tidyverse)
This is useful when you want to get fresh pulls of prepared data from a
source materialized in an R session (with the tbl_get()
function. For
example, the sml_table_high
table can be obtained by using
tbl_get("sml_table_high", "tbl_store.yml")
. To get an agent to check this
prepared data periodically, then the following example with tbl_source()
will be useful:
R code to generate agent that checks sml_table_high
and writing the agent
to YAML:
create_agent(
tbl = ~ tbl_source("sml_table_high", "tbl_store.yml"),
label = "An example that uses a table store.",
actions = action_levels(warn_at = 0.10)
) %>%
col_exists(c(date, date_time)) %>%
write_yaml()
The YAML representation ("agent-sml_table_high.yml"
):
tbl: ~ tbl_source("sml_table_high", "tbl_store.yml")
tbl_name: sml_table_high
label: An example that uses a table store.
actions:
warn_fraction: 0.1
locale: en
steps:
- col_exists:
columns: c(date, date_time)
Now, whenever the sml_table_high
table needs to be validated, it can be
done with yaml_agent_interrogate()
(e.g.,
yaml_agent_interrogate("agent-sml_table_high.yml")
).
Examples
Creating an in-memory table store and adding table-prep formulas
The table store provides a way to get the tables we need fairly easily. Think of an identifier for the table you'd like and then provide the code necessary to obtain that table. Then repeat as many times as you like!
Here we'll define two tables that can be materialized later: tbl_duckdb
(an
in-memory DuckDB database table with pointblank's small_table
dataset)
and sml_table_high
(a filtered version of tbl_duckdb
):
store_1 <-
tbl_store(
tbl_duckdb ~
db_tbl(
pointblank::small_table,
dbname = ":memory:",
dbtype = "duckdb"
),
sml_table_high ~
db_tbl(
pointblank::small_table,
dbname = ":memory:",
dbtype = "duckdb"
) %>%
dplyr::filter(f == "high")
)
We can see what's in the table store store_1
by printing it out:
store_1
## -- The `table_store` table-prep formulas
## 1 tbl_duckdb // ~ db_tbl(pointblank::small_table, dbname = ":memory:",
## dbtype = "duckdb")
## 2 sml_table_high // ~ db_tbl(pointblank::small_table, dbname = ":memory:",
## dbtype = "duckdb") %>% dplyr::filter(f == "high")
## ----
It's good to check that the tables can be obtained without error. We can do
this with the tbl_get()
function. With that function, we need to supply the
given name of the table-prep formula (in quotes) and the table store object.
tbl_get(tbl = "tbl_duckdb", store = store_1)
## # Source: table<pointblank::small_table> [?? x 8]
## # Database: duckdb_connection
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## # … with more rows
tbl_get(tbl = "sml_table_high", store = store_1)
## # Source: lazy query [?? x 8]
## # Database: duckdb_connection
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 3 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 4 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 5 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 6 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
We can shorten the tbl_store()
statement with some syntax that
pointblank provides. The sml_table_high
table-prep is simply a
transformation of tbl_duckdb
, so, we can use {{ tbl_duckdb }}
in place of
the repeated statement. Additionally, we can provide a library()
call to
the .init
argument of tbl_store()
so that dplyr is available (thus
allowing us to use filter(...)
instead of dplyr::filter(...)
). Here is
the revised tbl_store()
call:
store_2 <-
tbl_store(
tbl_duckdb ~
db_tbl(
pointblank::small_table,
dbname = ":memory:",
dbtype = "duckdb"
),
sml_table_high ~
{{ tbl_duckdb }} %>%
filter(f == "high"),
.init = ~ library(tidyverse)
)
Printing the table store store_2
now shows that we used an .init
statement:
store_2
## -- The `table_store` table-prep formulas
## 1 tbl_duckdb // ~ db_tbl(pointblank::small_table, dbname = ":memory:",
## dbtype = "duckdb")
## 2 sml_table_high // ~ {{tbl_duckdb}} %>% filter(f == "high")
## ----
## INIT // ~library(tidyverse)
## ----
Checking again with tbl_get()
should provide the same tables as before:
tbl_get(tbl = "tbl_duckdb", store = store_2)
## # Source: table<pointblank::small_table> [?? x 8]
## # Database: duckdb_connection
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## # … with more rows
tbl_get(tbl = "sml_table_high", store = store_2)
## # Source: lazy query [?? x 8]
## # Database: duckdb_connection
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 3 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 4 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 5 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 6 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 NA 2230. TRUE high
Using a table store in a data validation workflow
Define a tbl_store
object by adding table-prep formulas inside the
tbl_store()
call.
store_3 <-
tbl_store(
small_table_duck ~ db_tbl(
table = small_table,
dbname = ":memory:",
dbtype = "duckdb"
),
~ db_tbl(
table = "rna",
dbname = "pfmegrnargs",
dbtype = "postgres",
host = "hh-pgsql-public.ebi.ac.uk",
port = 5432,
user = I("reader"),
password = I("NWDMCE5xdipIjRrp")
),
all_revenue ~ db_tbl(
table = file_tbl(
file = from_github(
file = "sj_all_revenue_large.rds",
repo = "rich-iannone/intendo",
subdir = "data-large"
)
),
dbname = ":memory:",
dbtype = "duckdb"
),
sml_table ~ pointblank::small_table
)
Let's get a summary of what's in the table store store_3
through printing:
store_3
## -- The `table_store` table-prep formulas
## 1 small_table_duck // ~ db_tbl(table = small_table, dbname = ":memory:",
## dbtype = "duckdb")
## 2 rna // ~db_tbl(table = "rna", dbname = "pfmegrnargs", dbtype =
## "postgres", host = "hh-pgsql-public.ebi.ac.uk", port = 5432, user =
## I("reader"), password = I("NWDMCE5xdipIjRrp"))
## 3 all_revenue // ~ db_tbl(table = file_tbl(file = from_github(file =
## "sj_all_revenue_large.rds", repo = "rich-iannone/intendo", subdir =
## "data-large")), dbname = ":memory:", dbtype = "duckdb")
## 4 sml_table // ~ pointblank::small_table
## ----
Once this object is available, you can check that the table of interest is
produced to your specification with the tbl_get()
function.
tbl_get(
tbl = "small_table_duck",
store = store_3
)
## # Source: table<small_table> [?? x 8]
## # Database: duckdb_connection
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## # … with more rows
Another way to get the same table materialized is by using $
to get the
entry of choice for tbl_get()
.
## # Source: table<small_table> [?? x 8]
## # Database: duckdb_connection
## date_time date a b c d e f
## <dttm> <date> <int> <chr> <dbl> <dbl> <lgl> <chr>
## 1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423. TRUE high
## 2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 10000. TRUE low
## 3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343. TRUE high
## 4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 NA 3892. FALSE mid
## 5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 284. TRUE low
## 6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291. TRUE mid
## 7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843. TRUE high
## 8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1036. FALSE low
## 9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## 10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 838. FALSE high
## # … with more rows
Creating an agent is easy when all table-prep formulas are encapsulated in a
tbl_store
object. Use $
notation to pass the appropriate procedure for
reading a table to the tbl
argument.
agent_1 <-
create_agent(
tbl = store_3$small_table_duck
)
There are other ways to use the table store to assign a target table to an
agent, like using the tbl_source()
function (which extracts the table-prep
formula from the table store).
agent_2 <-
create_agent(
tbl = ~ tbl_source(
tbl = "small_table_duck",
store = store_3
)
)
Writing a table store to a YAML file
The table store can be moved to YAML with yaml_write
and the tbl_source()
call could then refer to that on-disk table store. Let's do that YAML
conversion.
yaml_write(store_3)
The above writes the tbl_store.yml
file (by not providing a filename
this
default filename is chosen).
It can be convenient to read table-prep formulas from a YAML file that's a
table store. To achieve this, we can modify the tbl_source()
statement in
the create_agent()
call so that store
refers to the on-disk YAML file.
agent_3 <-
create_agent(
tbl = ~ tbl_source(
tbl = "small_table_duck",
store = "tbl_store.yml"
)
)
See also
Other Planning and Prep:
action_levels()
,
create_agent()
,
create_informant()
,
db_tbl()
,
draft_validation()
,
file_tbl()
,
scan_data()
,
tbl_get()
,
tbl_source()
,
validate_rmd()