R-ohjelmointi.org
Tilastotieteellistä ohjelmointia R-kielellä
A quick note on Spark and R
Apache Spark is cluster-computing system with a few add-on tools for added functionality such as SQL queries and machine learning. Spark has an APIs in several languages, one of which is R. There are at least two R packages that implement an interface to Spark. These are SparkR and sparklyr. Package sparklyr makes installing Hadoop, Spark and add-on packages to Spark rather easy. Hence, let’s walk through the installation and a couple of examples.
Installation
Package sparklyr is developed by RStudio, and basic installation is outlined on their website. Briefly, the following R script takes care of the basic installation:
# Installing packages library(devtools) devtools::install_github("rstudio/sparklyr") library(sparklyr) # This downloads hadoop + spark to computer spark_available_versions() spark_install(version = "2.1.0", hadoop_version = "2.7") |
On windows you need to copy Hadoop’s winutils.exe to the correct place. To get the instructions how to do this, try to connect to the Spark:
# Download and copy the winutils as suggested by this: sc <- spark_connect(master = "local") |
Last, you might want to install machine learning libraries. The following command installs Sparkling Water, an interface to h2o machine learning library in a Spark cluster:
# rsparkling also install h2o install.packages("rsparkling") # install.packages("h2o") |
Note that the analyses might not work directly after the installation without restarting R first.
Usage
Connecting to Spark cluster
A connection from to Spark can be established and tested as follows:
# Open connection sc <- spark_connect(master = "local") # Test connection connection_is_open(sc) |
Also a web interface to monitor cluster performance can be opened:
# Open the web interface spark_web(sc) |
Copying data to Spark cluster
After the connection has been established, data can be copied to the cluster:
library(dplyr) iris_tbl <- copy_to(sc, iris) |
Datasets on the cluster can be listed:
src_tbls(sc) |
And datasets can also be queried with SQL:
# SQL Query library(DBI) iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10") iris_preview |
Runnning analyses on the cluster
Let’s run a simple generalized linear model (GLM) on the cars dataset. Library rsparkling (and h2o) need to be loaded first. The dataset in then copied to the cluster, and converted to a h2o frame. After that a GLM predicting mpg with wt and cyl variabes is fitted and the results are printed to the screen:
library(rsparkling) library(h2o) mtcars_tbl <- copy_to(sc, mtcars, "mtcars") mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE) mtcars_glm <- h2o.glm(x = c("wt", "cyl"), y = "mpg", training_frame = mtcars_h2o, lambda_search = TRUE) mtcars_glm |
The model details are as follows:
Model Details: ============== H2ORegressionModel: glm Model ID: GLM_model_R_1495887526376_1 GLM Model: summary family link regularization lambda_search number_of_predictors_total 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 ) nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0 2 number_of_active_predictors number_of_iterations training_frame 1 2 0 frame_rdd_29_acf580c07eb0ec1f9b3ff42292864df0 Coefficients: glm coefficients names coefficients standardized_coefficients 1 Intercept 38.941654 20.090625 2 cyl -1.468783 -2.623132 3 wt -3.034558 -2.969186 H2ORegressionMetrics: glm ** Reported on training data. ** MSE: 6.017684 RMSE: 2.453097 MAE: 1.940985 RMSLE: 0.1114801 Mean Residual Deviance : 6.017684 R^2 : 0.8289895 Null Deviance :1126.047 Null D.o.F. :31 Residual Deviance :192.5659 Residual D.o.F. :29 AIC :156.2425 |
Closing the connection
The the (only) connection to the cluster are accomplished with:
spark_disconnect_all() |
Summary
This is a very short overview of installation and usage of a Spark (cluster) on Windows. For more information on machine learning functionality, see MLlib help. Sparkling water for R is documented in it’s help, too. Package h2o can also be used without Spark, but it would appear that the R version for Spark and the stand-alone version might not be compatible (at least today), and they might need to be installed on separate R instances.
Vastaa