2 dplyr Basics

2.1 Create a table variable

Basics to how to point a variable in R to a table or view inside the database

  1. Load the dplyr, DBI and dbplyr libraries
library(dplyr)
library(dbplyr)
library(DBI)
  1. (Optional) Open a connection to the database if it’s currently closed
con <- dbConnect(odbc::odbc(), "Postgres Dev")
  1. Use the tbl() and in_schema() functions to create a reference to a table
tbl(con, in_schema("datawarehouse", "airport"))
## # Source:   table<datawarehouse.airport> [?? x 7]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##    airport airportname               city        state country   lat   long
##    <chr>   <chr>                     <chr>       <chr> <chr>   <dbl>  <dbl>
##  1 ABE     Lehigh Valley Internatio… Allentown   PA    USA      40.7  -75.4
##  2 ABI     Abilene Regional          Abilene     TX    USA      32.4  -99.7
##  3 ABQ     Albuquerque International Albuquerque NM    USA      35.0 -107. 
##  4 ABY     Southwest Georgia Region… Albany      GA    USA      31.5  -84.2
##  5 ACK     Nantucket Memorial        Nantucket   MA    USA      41.3  -70.1
##  6 ACT     Waco Regional             Waco        TX    USA      31.6  -97.2
##  7 ACV     Arcata                    Arcata/Eur… CA    USA      41.0 -124. 
##  8 ACY     Atlantic City Internatio… Atlantic C… NJ    USA      39.5  -74.6
##  9 ADK     Adak                      Adak        AK    USA      51.9 -177. 
## 10 ADQ     Kodiak                    Kodiak      AK    USA      57.7 -152. 
## # … with more rows
  1. Load the reference, not the table data, into a variable
airports <- tbl(con, in_schema("datawarehouse", "airport"))
  1. Call the variable to see preview the data in the table
airports
## # Source:   table<datawarehouse.airport> [?? x 7]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##    airport airportname               city        state country   lat   long
##    <chr>   <chr>                     <chr>       <chr> <chr>   <dbl>  <dbl>
##  1 ABE     Lehigh Valley Internatio… Allentown   PA    USA      40.7  -75.4
##  2 ABI     Abilene Regional          Abilene     TX    USA      32.4  -99.7
##  3 ABQ     Albuquerque International Albuquerque NM    USA      35.0 -107. 
##  4 ABY     Southwest Georgia Region… Albany      GA    USA      31.5  -84.2
##  5 ACK     Nantucket Memorial        Nantucket   MA    USA      41.3  -70.1
##  6 ACT     Waco Regional             Waco        TX    USA      31.6  -97.2
##  7 ACV     Arcata                    Arcata/Eur… CA    USA      41.0 -124. 
##  8 ACY     Atlantic City Internatio… Atlantic C… NJ    USA      39.5  -74.6
##  9 ADK     Adak                      Adak        AK    USA      51.9 -177. 
## 10 ADQ     Kodiak                    Kodiak      AK    USA      57.7 -152. 
## # … with more rows
  1. Set up the pointers to the other of the tables
flights <- tbl(con, in_schema("datawarehouse", "vflight"))
carriers <- tbl(con, in_schema("datawarehouse", "carrier"))

2.2 Under the hood

Use show_query() to preview the SQL statement that will be sent to the database

  1. SQL statement that actually runs when we ran airports as a command
show_query(airports)
## <SQL>
## SELECT *
## FROM datawarehouse.airport
  1. Easily view the resulting query by adding show_query() in another piped command
airports %>%
  show_query()
## <SQL>
## SELECT *
## FROM datawarehouse.airport
  1. Insert head() in between the two statements to see how the SQL changes
airports %>%
  head() %>%
  show_query()
## <SQL>
## SELECT *
## FROM datawarehouse.airport
## LIMIT 6
  1. Use sql_render() and simulate_mssql() to see how the SQL statement changes from vendor to vendor
airports %>%
  head() %>%
  sql_render(con = simulate_mssql()) 
## <SQL> SELECT  TOP 6 *
## FROM datawarehouse.airport
  1. Use explain() to explore the query plan
airports %>% 
  head() %>% 
  explain()
## <SQL>
## SELECT *
## FROM datawarehouse.airport
## LIMIT 6
## 
## <PLAN>
## Limit  (cost=0.00..0.14 rows=6 width=56)
##   ->  Seq Scan on airport  (cost=0.00..7.05 rows=305 width=56)

2.3 Un-translated R commands

Review of how dbplyr handles R commands that have not been translated into a like-SQL command

  1. Preview how Sys.time() is translated
airports %>%
  mutate(today = Sys.time()) %>%
  show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", SYS.TIME() AS "today"
## FROM datawarehouse.airport
  1. Use PostgreSQL’s native commands, in this case now()
airports %>%
  mutate(today = now()) %>%
  show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", NOW() AS "today"
## FROM datawarehouse.airport
  1. Run the dplyr code to confirm it works
airports %>%
  mutate(today = now()) %>%
  select(today) %>%
  head()
## # Source:   lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##   today              
##   <dttm>             
## 1 2019-01-11 03:43:50
## 2 2019-01-11 03:43:50
## 3 2019-01-11 03:43:50
## 4 2019-01-11 03:43:50
## 5 2019-01-11 03:43:50
## 6 2019-01-11 03:43:50

2.4 Using bang-bang

Intro on passing unevaluated code to a dplyr verb

  1. Preview how Sys.time() is translated
airports %>%
  mutate(today = Sys.time()) %>%
  show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", SYS.TIME() AS "today"
## FROM datawarehouse.airport
  1. Preview how Sys.time() is translated when prefixing !!
airports %>%
  mutate(today = !!Sys.time()) %>%
  show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", '2019-01-11T03:43:50Z' AS "today"
## FROM datawarehouse.airport
  1. Preview how Sys.time() is translated when prefixing !!
airports %>%
  mutate(today = !!Sys.time()) %>%
  select(today) %>%
  head()
## # Source:   lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##   today               
##   <chr>               
## 1 2019-01-11T03:43:50Z
## 2 2019-01-11T03:43:50Z
## 3 2019-01-11T03:43:50Z
## 4 2019-01-11T03:43:50Z
## 5 2019-01-11T03:43:50Z
## 6 2019-01-11T03:43:50Z

2.5 knitr SQL engine

  1. Copy the result of the latest show_query() exercise
airports %>%
  mutate(today = !!Sys.time()) %>%
  show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", '2019-01-11T03:43:50Z' AS "today"
## FROM datawarehouse.airport
  1. Paste the result in this SQL chunk
SELECT "airport", "airportname", "city", "state", "country", "lat", "long", '2018-01-26T14:50:10Z' AS "today"
FROM datawarehouse.airport
Table 2.1: Displaying records 1 - 10
airport airportname city state country lat long today
ABE Lehigh Valley International Allentown PA USA 40.65236 -75.44040 2018-01-26T14:50:10Z
ABI Abilene Regional Abilene TX USA 32.41132 -99.68190 2018-01-26T14:50:10Z
ABQ Albuquerque International Albuquerque NM USA 35.04022 -106.60919 2018-01-26T14:50:10Z
ABY Southwest Georgia Regional Albany GA USA 31.53552 -84.19447 2018-01-26T14:50:10Z
ACK Nantucket Memorial Nantucket MA USA 41.25305 -70.06018 2018-01-26T14:50:10Z
ACT Waco Regional Waco TX USA 31.61129 -97.23052 2018-01-26T14:50:10Z
ACV Arcata Arcata/Eureka CA USA 40.97812 -124.10862 2018-01-26T14:50:10Z
ACY Atlantic City International Atlantic City NJ USA 39.45758 -74.57717 2018-01-26T14:50:10Z
ADK Adak Adak AK USA 51.87796 -176.64603 2018-01-26T14:50:10Z
ADQ Kodiak Kodiak AK USA 57.74997 -152.49386 2018-01-26T14:50:10Z

2.6 Basic aggregation

A couple of dplyr commands that run in-database

  1. How many records are in the airport table?
tbl(con, in_schema("datawarehouse", "airport"))  %>%
  tally()
## # Source:   lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##   n              
##   <S3: integer64>
## 1 305
  1. What is the average character length of the airport codes? How many characters is the longest and the shortest airport name?
airports %>%
  summarise(
    avg_airport_length = mean(str_length(airport), na.rm = TRUE),
    max_airport_name = max(str_length(airportname), na.rm = TRUE),
    min_airport_name = min(str_length(airportname), na.rm = TRUE),
    total_records = n()
  )
## # Source:   lazy query [?? x 4]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##   avg_airport_length max_airport_name min_airport_name total_records  
##                <dbl>            <int>            <int> <S3: integer64>
## 1                  3               40                3 305
  1. How many records are in the carrier table?
carriers %>%
  tally()
## # Source:   lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##   n              
##   <S3: integer64>
## 1 20
  1. How many characters is the longest carriername?
carriers %>%
  summarise(x = max(str_length(carriername), na.rm = TRUE))
## # Source:   lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
##       x
##   <int>
## 1    83
  1. What is the SQL statement sent in exercise 4?
carriers %>%
  summarise(x = max(str_length(carriername), na.rm = TRUE)) %>%
  show_query()
## <SQL>
## SELECT MAX(LENGTH("carriername")) AS "x"
## FROM datawarehouse.carrier