Download PDF
Translations (PDF)
Supported in Databricks Connect v2
Open your .Renviron file: usethis::edit_r_environ()
In the .Renviron file add your Databricks Host Url and Token (PAT):
DATABRICKS_HOST = \[Your Host URL\]DATABRICKS_TOKEN = \[Your PAT\]Install extension: install.packages("pysparklyr")
Open connection:
Install RStudio Server on one of the existing nodes or a server in the same LAN
Open a connection
Install RStudio Server on an edge node
Locate path to the clusterʼs Spark Home Directory, it normally is "/usr/lib/spark"
Basic configuration example
Make sure to have copies of the yarn-site.xml and hive-site.xml files in the RStudio Server
Point environment variables to the correct paths
Use the following to obtain the Host and Port system2("kubectl", "cluster-info")
Open a connection
No cluster required. Use for learning purposes only
Install a local version of Spark: spark_install()
Open a connection
Azure - spark_connect(method = "synapse")
Qubole - spark_connect(method = "qubole")


Import data into Spark, not R
Arguments that apply to all functions:
sc, name, path, options=list(), repartition=0, memory=TRUE, overwrite=TRUE
spark_read_csv(header = TRUE, columns = NULL, infer_schema = TRUE, delimiter = ",", quote= "\"", escape = "\\", charset = "UTF-8", null_value = NULL)spark_read_json()spark_read_parquet()spark_read_text()spark_read_delta()dplyr::tbl(scr, ...) - Creates a reference to the table without loading its data into memory
dbplyr::in_catalog() - Enables a three part table address
Supported in Databricks Connect v2
Apache Arrow accelerates data transfer between R and Spark. To use, simply load the library
Supported in Databricks Connect v2
Translates into Spark SQL statements
pivot_longer() - Collapse several columns into two. (Supported in Databricks Connect v2)
pivot_wider() - Expand two columns into several. (Supported in Databricks Connect v2)
nest() / unnest() - Convert groups of cells into list-columns, and vice versa.
unite() / separate() - Split a single column into several columns, and vice versa.
fill() - Fill NA with the previous value.
ft_binarizer() - Assigned values based on thresholdft_bucketizer() - Numeric column to discretized columnft_count_vectorizer() - Extracts a vocabulary from documentft_discrete_cosine_transform() - 1D discrete cosine transform of a real vectorft_elementwise_product() - Element- wise product between 2 colsft_hashing_tf() - Maps a sequence of terms to their term frequencies using the hashing trick.ft_idf() - Compute the Inverse Document Frequency (IDF) given a collection of documents.ft_imputer() - Imputation estimator for completing missing values, uses the mean or the median of the columns.ft_index_to_string() - Index labels back to label as stringsft_interaction() - Takes in Double and Vector columns and outputs a flattened vector of their feature interactions.ft_max_abs_scaler() - Rescale each feature individually to range [-1, 1] (Supported in Databricks Connect v2)ft_min_max_scaler() - Rescale each feature to a common range [min, max] linearlyft_ngram() - Converts the input array of strings into an array of n-gramsft_bucketed_random_projection_lsh()ft_minhash_lsh() - Locality Sensitive Hashing functions for Euclidean distance and Jaccard distance (MinHash)ft_normalizer() - Normalize a vector to have unit norm using the given p-normft_one_hot_encoder() - Continuous to binary vectorsft_pca() - Project vectors to a lower dimensional space of top k principal componentsft_quantile_discretizer() - Continuous to binned categorical values.ft_regex_tokenizer() - Extracts tokens either by using the provided regex pattern to split the textft_robust_scaler() - Removes the median and scales according to standard scaleft_standard_scaler() - Removes the mean and scaling to unit variance using column summary statistics (Supported in Databricks Connect v2)ft_stop_words_remover() - Filters out stop words from inputft_string_indexer() - Column of labels into a column of label indices.ft_tokenizer() - Converts to lowercase and then splits it by white spacesft_vector_assembler() - Combine vectors into single row-vectorft_vector_indexer() - Indexing categorical feature columns in a dataset of Vectorft_vector_slicer() - Takes a feature vector and outputs a new feature vector with a subarray of the original featuresft_word2vec() - Word2Vec transforms a word into a codeSupported in Databricks Connect v2
ml_linear_regression() - Linear regression.ml_aft_survival_regression() - Parametric survival regression model named accelerated failure time (AFT) model.ml_generalized_linear_regression() - GLM.ml_isotonic_regression() - Uses parallelized pool adjacent violators algorithm.ml_random_forest_regressor() - Regression using random forests.ml_linear_svc() - Classification using linear support vector machines.ml_logistic_regression() - Logistic regression. (Supported in Databricks Connect v2)ml_multilayer_perceptron_classifier() - Based on the Multilayer Perceptron.ml_naive_bayes() - It supports Multinomial NB which can handle finitely supported discrete data.ml_one_vs_rest() - Reduction of Multiclass, performs reduction using one against all strategy.ml_decision_tree_classifier(), ml_decision_tree(), `ml_decision_tree_regressor(). - Classification and regression using decision trees.ml_gbt_classifier(), ml_gradient_boosted_trees(), ml_gbt_regressor() - Binary classification and regression using gradient boosted trees.ml_random_forest_classifier() - Classification and regression using random forests.ml_feature_importances(), ml_tree_feature_importance() - Feature Importance for Tree Models.ml_bisecting_kmeans() - A bisecting k-means algorithm based on the paper.ml_lda(), ml_describe_topics(), ml_log_likelihood(), ml_log_perplexity(), ml_topics_matrix() - LDA topic model designed for text documents.ml_gaussian_mixture() - Expectation maximization for multivariate Gaussian Mixture Models (GMMs).ml_kmeans(), ml_compute_cost(), ml_compute_silhouette_measure() - Clustering with support for k-means.ml_power_iteration() - For clustering vertices of a graph given pairwise similarities as edge properties.ml_als(), ml_recommend() - Recommendation using Alternating Least Squares matrix factorization.ml_clustering_evaluator() - Evaluator for clustering.ml_evaluate() - Compute performance metrics.ml_binary_classification_evaluator(), ml_binary_classification_eval(), ml_classification_eval() - A set of functions to calculate performance metrics for prediction models.ml_fpgrowth(), ml_association_rules(), ml_freq_itemsets() - A parallel FP-growth algorithm to mine frequent itemsets.ml_freq_seq_patterns(), ml_prefixspan() - PrefixSpan algorithm for mining frequent itemsets.ml_summary() - Extracts a metric from the summary object of a Spark ML model.ml_corr() - Compute correlation matrix.ml_chisquare_test(x,features,label) - Pearson’s independence test for every feature against the label.ml_default_stop_words() - Loads the default stop words for the given language.ml_call_constructor() - Identifies the associated sparklyr ML constructor for the JVM.ml_model_data() - Extracts data associated with a Spark ML model.ml_standardize_formula() - Generates a formula string from user inputs.ml_uid() - Extracts the UID of an ML object.Easily create a formal Spark Pipeline models using R. Save the Pipeline in native Sacala. It will have no dependencies on R.
Supported in Databricks Connect v2
ml_pipeline() - Initializes a new Spark Pipeline.ml_fit() - Trains the model, outputs a Spark Pipeline Model.Supported in Databricks Connect v2
ml_save() - Saves into a format that can be read by Scala and PySpark.ml_read() - Reads Spark object into sparklyr.
Supported in Databricks Connect v2
Run arbitrary R code at scale inside your cluster with spark_apply(). Useful when there you need functionality only available in R, and to solve ‘embarrassingly parallel problems’.
CC BY SA Posit Software, PBC • info@posit.co • posit.co
Learn more at spark.posit.co and therinspark.com.
Updated: 2024-06.