R-ohjelmointi.org

Tilastotieteellistä ohjelmointia R-kielellä

A quick note on Spark and R

Apache Spark is cluster-computing system with a few add-on tools for added functionality such as SQL queries and machine learning. Spark has an APIs in several languages, one of which is R. There are at least two R packages that implement an interface to Spark. These are SparkR and sparklyr. Package sparklyr makes installing Hadoop, Spark and add-on packages to Spark rather easy. Hence, let’s walk through the installation and a couple of examples.

Installation

Package sparklyr is developed by RStudio, and basic installation is outlined on their website. Briefly, the following R script takes care of the basic installation:

# Installing packages
library(devtools)
devtools::install_github("rstudio/sparklyr")
library(sparklyr)
 
# This downloads hadoop + spark to computer
spark_available_versions()
spark_install(version = "2.1.0", hadoop_version = "2.7")

On windows you need to copy Hadoop’s winutils.exe to the correct place. To get the instructions how to do this, try to connect to the Spark:

# Download and copy the winutils as suggested by this:
sc <- spark_connect(master = "local")

Last, you might want to install machine learning libraries. The following command installs Sparkling Water, an interface to h2o machine learning library in a Spark cluster:

# rsparkling also install h2o
install.packages("rsparkling")
# install.packages("h2o")

Note that the analyses might not work directly after the installation without restarting R first.

Usage

Connecting to Spark cluster

A connection from to Spark can be established and tested as follows:

# Open connection
sc <- spark_connect(master = "local")
 
# Test connection
connection_is_open(sc)

Also a web interface to monitor cluster performance can be opened:

# Open the web interface
spark_web(sc)

Copying data to Spark cluster

After the connection has been established, data can be copied to the cluster:

library(dplyr)
iris_tbl <- copy_to(sc, iris)

Datasets on the cluster can be listed:

src_tbls(sc)

And datasets can also be queried with SQL:

# SQL Query
library(DBI)
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
iris_preview

Runnning analyses on the cluster

Let’s run a simple generalized linear model (GLM) on the cars dataset. Library rsparkling (and h2o) need to be loaded first. The dataset in then copied to the cluster, and converted to a h2o frame. After that a GLM predicting mpg with wt and cyl variabes is fitted and the results are printed to the screen:

library(rsparkling)
library(h2o)
 
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)
mtcars_glm <- h2o.glm(x = c("wt", "cyl"), 
                      y = "mpg",
                      training_frame = mtcars_h2o,
                      lambda_search = TRUE)
mtcars_glm

The model details are as follows:

Model Details:
==============
 
H2ORegressionModel: glm
Model ID:  GLM_model_R_1495887526376_1 
GLM Model: summary
    family     link                              regularization                                                              lambda_search number_of_predictors_total
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 ) nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0                          2
  number_of_active_predictors number_of_iterations                                training_frame
1                           2                    0 frame_rdd_29_acf580c07eb0ec1f9b3ff42292864df0
 
Coefficients: glm coefficients
      names coefficients standardized_coefficients
1 Intercept    38.941654                 20.090625
2       cyl    -1.468783                 -2.623132
3        wt    -3.034558                 -2.969186
 
H2ORegressionMetrics: glm
** Reported on training data. **
 
MSE:  6.017684
RMSE:  2.453097
MAE:  1.940985
RMSLE:  0.1114801
Mean Residual Deviance :  6.017684
R^2 :  0.8289895
Null Deviance :1126.047
Null D.o.F. :31
Residual Deviance :192.5659
Residual D.o.F. :29
AIC :156.2425

Closing the connection

The the (only) connection to the cluster are accomplished with:

spark_disconnect_all()

Summary

This is a very short overview of installation and usage of a Spark (cluster) on Windows. For more information on machine learning functionality, see MLlib help. Sparkling water for R is documented in it’s help, too. Package h2o can also be used without Spark, but it would appear that the R version for Spark and the stand-alone version might not be compatible (at least today), and they might need to be installed on separate R instances.

Tags: ,