2 dplyr
Basics
2.1 Create a table variable
Basics to how to point a variable in R to a table or view inside the database
- Load the
dplyr
,DBI
anddbplyr
libraries
library(dplyr)
library(dbplyr)
library(DBI)
- (Optional) Open a connection to the database if it’s currently closed
con <- dbConnect(odbc::odbc(), "Postgres Dev")
- Use the
tbl()
andin_schema()
functions to create a reference to a table
tbl(con, in_schema("datawarehouse", "airport"))
## # Source: table<datawarehouse.airport> [?? x 7]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## airport airportname city state country lat long
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 ABE Lehigh Valley Internatio… Allentown PA USA 40.7 -75.4
## 2 ABI Abilene Regional Abilene TX USA 32.4 -99.7
## 3 ABQ Albuquerque International Albuquerque NM USA 35.0 -107.
## 4 ABY Southwest Georgia Region… Albany GA USA 31.5 -84.2
## 5 ACK Nantucket Memorial Nantucket MA USA 41.3 -70.1
## 6 ACT Waco Regional Waco TX USA 31.6 -97.2
## 7 ACV Arcata Arcata/Eur… CA USA 41.0 -124.
## 8 ACY Atlantic City Internatio… Atlantic C… NJ USA 39.5 -74.6
## 9 ADK Adak Adak AK USA 51.9 -177.
## 10 ADQ Kodiak Kodiak AK USA 57.7 -152.
## # … with more rows
- Load the reference, not the table data, into a variable
airports <- tbl(con, in_schema("datawarehouse", "airport"))
- Call the variable to see preview the data in the table
airports
## # Source: table<datawarehouse.airport> [?? x 7]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## airport airportname city state country lat long
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 ABE Lehigh Valley Internatio… Allentown PA USA 40.7 -75.4
## 2 ABI Abilene Regional Abilene TX USA 32.4 -99.7
## 3 ABQ Albuquerque International Albuquerque NM USA 35.0 -107.
## 4 ABY Southwest Georgia Region… Albany GA USA 31.5 -84.2
## 5 ACK Nantucket Memorial Nantucket MA USA 41.3 -70.1
## 6 ACT Waco Regional Waco TX USA 31.6 -97.2
## 7 ACV Arcata Arcata/Eur… CA USA 41.0 -124.
## 8 ACY Atlantic City Internatio… Atlantic C… NJ USA 39.5 -74.6
## 9 ADK Adak Adak AK USA 51.9 -177.
## 10 ADQ Kodiak Kodiak AK USA 57.7 -152.
## # … with more rows
- Set up the pointers to the other of the tables
flights <- tbl(con, in_schema("datawarehouse", "vflight"))
carriers <- tbl(con, in_schema("datawarehouse", "carrier"))
2.2 Under the hood
Use show_query()
to preview the SQL statement that will be sent to the database
- SQL statement that actually runs when we ran
airports
as a command
show_query(airports)
## <SQL>
## SELECT *
## FROM datawarehouse.airport
- Easily view the resulting query by adding
show_query()
in another piped command
airports %>%
show_query()
## <SQL>
## SELECT *
## FROM datawarehouse.airport
- Insert
head()
in between the two statements to see how the SQL changes
airports %>%
head() %>%
show_query()
## <SQL>
## SELECT *
## FROM datawarehouse.airport
## LIMIT 6
- Use
sql_render()
andsimulate_mssql()
to see how the SQL statement changes from vendor to vendor
airports %>%
head() %>%
sql_render(con = simulate_mssql())
## <SQL> SELECT TOP 6 *
## FROM datawarehouse.airport
- Use
explain()
to explore the query plan
airports %>%
head() %>%
explain()
## <SQL>
## SELECT *
## FROM datawarehouse.airport
## LIMIT 6
##
## <PLAN>
## Limit (cost=0.00..0.14 rows=6 width=56)
## -> Seq Scan on airport (cost=0.00..7.05 rows=305 width=56)
2.3 Un-translated R commands
Review of how dbplyr
handles R commands that have not been translated into a like-SQL command
- Preview how
Sys.time()
is translated
airports %>%
mutate(today = Sys.time()) %>%
show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", SYS.TIME() AS "today"
## FROM datawarehouse.airport
- Use PostgreSQL’s native commands, in this case
now()
airports %>%
mutate(today = now()) %>%
show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", NOW() AS "today"
## FROM datawarehouse.airport
- Run the
dplyr
code to confirm it works
airports %>%
mutate(today = now()) %>%
select(today) %>%
head()
## # Source: lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## today
## <dttm>
## 1 2019-01-11 03:43:50
## 2 2019-01-11 03:43:50
## 3 2019-01-11 03:43:50
## 4 2019-01-11 03:43:50
## 5 2019-01-11 03:43:50
## 6 2019-01-11 03:43:50
2.4 Using bang-bang
Intro on passing unevaluated code to a dplyr verb
- Preview how
Sys.time()
is translated
airports %>%
mutate(today = Sys.time()) %>%
show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", SYS.TIME() AS "today"
## FROM datawarehouse.airport
- Preview how
Sys.time()
is translated when prefixing!!
airports %>%
mutate(today = !!Sys.time()) %>%
show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", '2019-01-11T03:43:50Z' AS "today"
## FROM datawarehouse.airport
- Preview how
Sys.time()
is translated when prefixing!!
airports %>%
mutate(today = !!Sys.time()) %>%
select(today) %>%
head()
## # Source: lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## today
## <chr>
## 1 2019-01-11T03:43:50Z
## 2 2019-01-11T03:43:50Z
## 3 2019-01-11T03:43:50Z
## 4 2019-01-11T03:43:50Z
## 5 2019-01-11T03:43:50Z
## 6 2019-01-11T03:43:50Z
2.5 knitr SQL engine
- Copy the result of the latest
show_query()
exercise
airports %>%
mutate(today = !!Sys.time()) %>%
show_query()
## <SQL>
## SELECT "airport", "airportname", "city", "state", "country", "lat", "long", '2019-01-11T03:43:50Z' AS "today"
## FROM datawarehouse.airport
- Paste the result in this SQL chunk
SELECT "airport", "airportname", "city", "state", "country", "lat", "long", '2018-01-26T14:50:10Z' AS "today"
FROM datawarehouse.airport
airport | airportname | city | state | country | lat | long | today |
---|---|---|---|---|---|---|---|
ABE | Lehigh Valley International | Allentown | PA | USA | 40.65236 | -75.44040 | 2018-01-26T14:50:10Z |
ABI | Abilene Regional | Abilene | TX | USA | 32.41132 | -99.68190 | 2018-01-26T14:50:10Z |
ABQ | Albuquerque International | Albuquerque | NM | USA | 35.04022 | -106.60919 | 2018-01-26T14:50:10Z |
ABY | Southwest Georgia Regional | Albany | GA | USA | 31.53552 | -84.19447 | 2018-01-26T14:50:10Z |
ACK | Nantucket Memorial | Nantucket | MA | USA | 41.25305 | -70.06018 | 2018-01-26T14:50:10Z |
ACT | Waco Regional | Waco | TX | USA | 31.61129 | -97.23052 | 2018-01-26T14:50:10Z |
ACV | Arcata | Arcata/Eureka | CA | USA | 40.97812 | -124.10862 | 2018-01-26T14:50:10Z |
ACY | Atlantic City International | Atlantic City | NJ | USA | 39.45758 | -74.57717 | 2018-01-26T14:50:10Z |
ADK | Adak | Adak | AK | USA | 51.87796 | -176.64603 | 2018-01-26T14:50:10Z |
ADQ | Kodiak | Kodiak | AK | USA | 57.74997 | -152.49386 | 2018-01-26T14:50:10Z |
2.6 Basic aggregation
A couple of dplyr
commands that run in-database
- How many records are in the airport table?
tbl(con, in_schema("datawarehouse", "airport")) %>%
tally()
## # Source: lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## n
## <S3: integer64>
## 1 305
- What is the average character length of the airport codes? How many characters is the longest and the shortest airport name?
airports %>%
summarise(
avg_airport_length = mean(str_length(airport), na.rm = TRUE),
max_airport_name = max(str_length(airportname), na.rm = TRUE),
min_airport_name = min(str_length(airportname), na.rm = TRUE),
total_records = n()
)
## # Source: lazy query [?? x 4]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## avg_airport_length max_airport_name min_airport_name total_records
## <dbl> <int> <int> <S3: integer64>
## 1 3 40 3 305
- How many records are in the carrier table?
carriers %>%
tally()
## # Source: lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## n
## <S3: integer64>
## 1 20
- How many characters is the longest carriername?
carriers %>%
summarise(x = max(str_length(carriername), na.rm = TRUE))
## # Source: lazy query [?? x 1]
## # Database: postgres [rstudio_dev@localhost:/postgres]
## x
## <int>
## 1 83
- What is the SQL statement sent in exercise 4?
carriers %>%
summarise(x = max(str_length(carriername), na.rm = TRUE)) %>%
show_query()
## <SQL>
## SELECT MAX(LENGTH("carriername")) AS "x"
## FROM datawarehouse.carrier