Create models with parsnip :: Cheatsheet

Download PDF

Basics

parsnip provides a tidy, unified interface to a range of models from other packages. It helps avoid having to remember how to properly call the modeling functions of those external packages.

A parsnip specification is made up of 3 main components:

The type of model to be used, such as Random Forest (rand_forest()) or linear regression (linear_reg())
How will the model be used, or mode. The two most common are “regression” and “classification”.
The computational engine, or program that will actually execute the training. It could be an external R package, such as ranger, or even an engine outside of R, such as Stan or Apache Spark.

library(tidymodels)

rand_forest(mtry = 10, trees = 2000) |> # Define type of model
  set_engine("ranger", importance = "impurity") |> # Select an engine
  set_mode("regression") # Set the mode

Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  trees = 2000

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger

set_engine(object, engine, ...) - Specifies which package or system will be used to fit the model, along with any arguments specific to that software.

set_args(object, ...) - Modifies the arguments of a model specification

set_mode(object, mode, …) - Changes the model’s mode.

show_engines(x) - The possible engines for a model can depend on what packages are loaded. Some parsnip extension add engines to existing models.

show_engines("linear_reg")

# A tibble: 8 × 2
  engine   mode               
  <chr>    <chr>              
1 lm       regression         
2 glm      regression         
3 glmnet   regression         
4 stan     regression         
5 spark    regression         
6 keras    regression         
7 brulee   regression         
8 quantreg quantile regression

Legends

Mode Support Numbers

1 - Classification
2 - Regression
3 - Censored Regression
4 - Quantile Regression

Engine Tags

Engine tags show the engine name and mode support numbers. For example, h2o ¹² means engine h2o supports classification (1), and regression (2).

Classification Only

logistic_reg(mode = "classification", engine = "glm", penalty, mixture) - Generalized linear model for binary outcomes. A linear combination of the predictors is used to model the log odds of an event.

brulee ¹ gee ¹ glm ¹ glmer ¹ glmnet ¹ h2o ¹ keras ¹ LiblineaR ¹ spark ¹ stan ¹ stan_glmer ¹
multinom_reg(mode = "classification", engine = "nnet", penalty, mixture) - Uses linear predictors to predict multiclass data using the multinomial distribution.

brulee ¹ glmnet ¹ h2o ¹ keras ¹ nnet ¹ spark ¹
naive_Bayes(mode = "classification", smoothness, Laplace, engine = "klaR") - Uses Bayes’ theorem to compute the probability of each class, given the predictor values.

h2o ¹ klaR ¹ naivebayes ¹
null_model(mode = "classification", engine = "parsnip") - Fit a single mean or largest class model. This is the user-facing function for the null_model() specification.

parsnip ¹
ordinal_reg(mode = "classification", ordinal_link, odds_link, penalty, mixture, engine = "polr") - Defines a generalized linear model that predicts an ordinal outcome.

rpartScore ¹ polr ¹ vgam ¹ vglm ¹

Regression Only

linear_reg(mode = "regression", engine = "lm", penalty, mixture) - Defines a model that can predict numeric values from predictors using a linear function.

brulee ² gee ² glm ² glmer ² glmnet ² gls ² h2o ² keras ² lm ² lme ² quantreg ² spark ² stan ² stan_glmer ²
poisson_reg(mode = "regression", penalty, mixture, engine = "glm") - Defines a generalized linear model for count data that follow a Poisson distribution.

gee ² glm ² glmer ² glmnet ² h2o ² hurdle ² stan ² stan_glmer ² zeroinfl ²

General Use

decision_tree(mode, engine = "rpart", cost_complexity, tree_depth, min_n) - A set of if/then statements creates a tree-based structure.

partykit ¹²³ rpart ¹²³ spark ¹² C5.0 ¹
mars(mode, engine = "earth", num_terms, prod_degree, prune_method) - Uses artificial features for some predictors. These features resemble hinge functions and the result is a model that is a segmented regression in small dimensions.

earth ¹²
mlp(mode, engine = "nnet", hidden_units, penalty, dropout, epochs, activation, learn_rate) - Defines a multilayer perceptron model (a.k.a. a single layer, feed-forward neural network).

nnet ¹² brulee ¹² brulee_two_layer ¹² keras ¹² grnn ¹²
gen_additive_mod(mode, select_features, adjust_deg_free, engine = "mgcv") - Uses smoothed functions of numeric predictors in a generalized linear model.

mgcv ¹²
nearest_neighbor(mode, engine = "kknn", neighbors, weight_func, dist_power) - Uses the K most similar data points from the training set to predict new samples.

knn ¹²
pls(mode, predictor_prop, num_comp, engine = "mixOmics") - Uses latent variables to model the data. Similar to a supervised version of PCA.

mixOmics ¹²

Discriminant

discrim_flexible(mode = "classification", num_terms, prod_degree, prune_method, engine = "earth") - Fits a discriminant analysis model that uses nonlinear features created using MARS.

earth ¹
discrim_regularized(mode = "classification", frac_common_cov, frac_identity, engine = "klaR") - Estimates a multivariate distribution for the predictors separately for the data in each class. The model’s structure can be LDA, QDA, or a combination. Each probability class is computed using Bayes’s theorem, given the predictor values.

klaR ¹

Estimates a multivariate distribution for the predictors separately for the data in each class using a method described below. Each class’ probability is computed using Bayes’ theorem, given the predictor values.

discrim_linear(mode = "classification", regularization_method, engine = "MASS", penalty) - Uses Gaussian with a common covariance matrix to perform the estimate.

MASS ¹ mda ¹ sda ¹ sparsediscrim ¹
discrim_quad(mode = "classification", regularization_method, engine = "MASS") - Uses Gaussian with separate covariance matrices to perform the estimate.

MASS ¹ sparsediscrim ¹

Support Vector Machine

Classification: Maximizes the width of the margin between classes using a method described below.

Regression: Optimizes a robust loss function only affected by very large model residuals and uses an additional method described below.

svm_linear(mode, cost, engine = "LiblineaR", margin) - Classification: A linear class boundary. Regression: Uses a linear fit.

kernlab ¹² LiblineaR ¹²
svm_poly(mode, cost, engine = "kernlab", degree, scale_factor) - Classification: A polynomial class boundary. Regression: Uses polynomial functions of the predictors.

kernlab ¹²
svm_rbf(mode, cost, engine = "kernlab", rbf_sigma) - Classification: A nonlinear class boundary. Regression: Uses nonlinear functions of the predictors.

kernlab ¹²

Feature Rules

rule_fit(mode, mtry, trees, min_n, tree_depth, learn_rate, loss_reduction, sample_size, stop_iter, penalty, engine = "xrf") - Derives simple feature rules from a tree ensemble and uses them as features in a regularized model.

xrf ¹² h2o ¹
C5_rules(mode = "classification", trees, min_n, engine = "C5.0") - Derives feature rules from a tree for prediction. A single tree or boosted ensemble can be used.

C5.0 ¹
cubist_rules(mode = "regression", committees, neighbors, max_rules, engine = "Cubist") - Derives simple feature rules from a tree ensemble and creates regression models within each rule.

Cubist ²

Ensemble

“E Pluribus Unum”

bag_mars(mode, num_terms, prod_degree, prune_method, engine = "earth") - Ensemble of generalized linear models that use artificial features for some predictors. These features resemble hinge functions and the result is a model that is a segmented regression in small dimensions.

earth ¹²
bag_mlp(mode, hidden_units, penalty, epochs, engine = "nnet") - An ensemble of single layer, feed-forward neural networks.

nnet ¹²
bag_tree(mode, cost_complexity = 0, tree_depth, min_n = 2, class_cost, engine = "rpart") - Ensemble of decision trees.

C5.0 ¹ rpart ¹²³
bart(mode, engine = "dbarts", trees, prior_terminal_node_coef, prior_terminal_node_expo, prior_outcome_range) - Tree ensemble model that uses Bayesian analysis to assemble the ensemble.

bart ¹²
boost_tree(mode, engine = "xgboost", mtry, trees, min_n, tree_depth, learn_rate, loss_reduction, sample_size, stop_iter) - Creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

C5.0 ¹ catboost ¹² h2o ¹² lightgbm ¹² mboost ³ spark ¹² xgboost ¹²⁴
rand_forest(mode, engine = "ranger", mtry, trees, min_n) - Creates a large number of decision trees, each independent of the others. The final prediction uses all predictions from the individual trees and combines them.

aorsf ¹²³ grf ¹²⁴ h2o ¹² partykit ¹²³ randomForest ¹² ranger ¹² spark ¹²

Survival

proportional_hazards(mode = "censored regression", engine = "survival", penalty, mixture) - Defines a model for the hazard function as a multiplicative function of covariates times a baseline hazard.

glmnet ³ survival ³
survival_reg(mode = "censored regression", engine = "survival", dist) - Defines a parametric survival model.

flexsurv ³ flexsurvspline ³ survival ³

Operations

library(tidymodels)

lm_spec <- linear_reg() |> 
  set_engine("lm")

lm_spec

Linear Regression Model Specification (regression)

Computational engine: lm

Methods

fit(object, ...) - Estimates parameters for a given model from a set of data.

lm_fit <- fit(lm_spec, mpg ~ ., data = mtcars)

lm_fit

parsnip model object


Call:
stats::lm(formula = mpg ~ ., data = data)

Coefficients:
(Intercept)          cyl         disp           hp         drat           wt  
   12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
       qsec           vs           am         gear         carb  
    0.82104      0.31776      2.52023      0.65541     -0.19942

predict(object, ...)

predict(lm_fit, mtcars)

# A tibble: 32 × 1
   .pred
   <dbl>
 1  22.6
 2  22.1
 3  26.3
 4  21.2
 5  17.7
 6  20.4
 7  14.4
 8  22.5
 9  24.4
10  18.7
# ℹ 22 more rows

autoplot(object, ...) - Uses ggplot2 to draw a particular plot for an object of a particular class
update(object, ...) - Updates and (by default) re-fit a model. It does this by extracting the call stored in the object, updating the call and evaluating that call.

Tidiers

augment(x, ...) - Augment data with model results

augment(lm_fit, mtcars)

# A tibble: 32 × 13
   .pred  .resid   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  22.6 -1.60    21       6  160    110  3.9   2.62  16.5     0     1     4
 2  22.1 -1.11    21       6  160    110  3.9   2.88  17.0     0     1     4
 3  26.3 -3.45    22.8     4  108     93  3.85  2.32  18.6     1     1     4
 4  21.2  0.163   21.4     6  258    110  3.08  3.22  19.4     1     0     3
 5  17.7  1.01    18.7     8  360    175  3.15  3.44  17.0     0     0     3
 6  20.4 -2.28    18.1     6  225    105  2.76  3.46  20.2     1     0     3
 7  14.4 -0.0863  14.3     8  360    245  3.21  3.57  15.8     0     0     3
 8  22.5  1.90    24.4     4  147.    62  3.69  3.19  20       1     0     4
 9  24.4 -1.62    22.8     4  141.    95  3.92  3.15  22.9     1     0     4
10  18.7  0.501   19.2     6  168.   123  3.92  3.44  18.3     1     0     4
# ℹ 22 more rows
# ℹ 1 more variable: carb <dbl>

glance(x, ...) - Construct a single row summary “glance” of a model fit

glance(lm_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.869         0.807  2.65      13.9 0.000000379    10  -69.9  164.  181.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(x, ...) - Turn an object into a tidy tibble

tidy(lm_fit)

# A tibble: 11 × 5
   term        estimate std.error statistic p.value
   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)  12.3      18.7        0.657  0.518 
 2 cyl          -0.111     1.05      -0.107  0.916 
 3 disp          0.0133    0.0179     0.747  0.463 
 4 hp           -0.0215    0.0218    -0.987  0.335 
 5 drat          0.787     1.64       0.481  0.635 
 6 wt           -3.72      1.89      -1.96   0.0633
 7 qsec          0.821     0.731      1.12   0.274 
 8 vs            0.318     2.10       0.151  0.881 
 9 am            2.52      2.06       1.23   0.234 
10 gear          0.655     1.49       0.439  0.665 
11 carb         -0.199     0.829     -0.241  0.812

General

repair_call(x, data) - When the user passes a formula to fit() and the underlying model function uses a formula, the call object produced by fit() may not be usable by other functions.
control_parsnip(verbosity = 1L, catch = FALSE) - Pass options to the fit.model_spec() function to control its output and computations.
```
control_parsnip(verbosity = 2)
```
```
parsnip control object
 - verbose level 2 
```

show_engines(x) - The possible engines for a model can depend on what packages are loaded. Some parsnip extension add engines to existing models.

show_engines("linear_reg")

# A tibble: 8 × 2
  engine   mode               
  <chr>    <chr>              
1 lm       regression         
2 glm      regression         
3 glmnet   regression         
4 stan     regression         
5 spark    regression         
6 keras    regression         
7 brulee   regression         
8 quantreg quantile regression

translate(x, ...) - Translates a model specification into a code object that is specific to a particular engine (e.g. R package). It translates generic parameters to their counterparts.

translate(lm_spec)

Linear Regression Model Specification (regression)

Computational engine: lm 

Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

multi_predict(object, ...) - For some models, predictions can be made on sub-models in the model object.

Extract

extract_spec_parsnip(x, ...) - Returns a parsnip model specification.

extract_spec_parsnip(lm_fit)

Linear Regression Model Specification (regression)

Computational engine: lm 

Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())

extract_fit_engine(x, ...) - Returns the engine specific fit embedded within a parsnip model fit. For example, when using linear_reg() with the “lm” engine, this returns the underlying lm object.

extract_fit_engine(lm_fit)


Call:
stats::lm(formula = mpg ~ ., data = data)

Coefficients:
(Intercept)          cyl         disp           hp         drat           wt  
   12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
       qsec           vs           am         gear         carb  
    0.82104      0.31776      2.52023      0.65541     -0.19942

extract_parameter_dials(x, parameter, ...) - Returns a single dials parameter object.
extract_parameter_set_dials(x, ...) - Returns a set of dials parameter objects.
extract_fit_time(x, summarize = TRUE, ...) - returns a tibble with fit times. The fit times correspond to the time for the parsnip engine to fit and do not include other portions of the elapsed time in fit.model_spec().