R-ohjelmointi.org

Tilastotieteellistä ohjelmointia R-kielellä

Julia in a simple test

I wrote earlier a short note describing the same statistical analysis done in R and Python, but I left out Julia. Since Julia has constantly appeared on blogs I follow, I finally thought that it’s time to rectify the situation. Even more so, since I would fancy a huge speed boost on the execution time of the analyses I’m currently running. But let’s first test Julia in a simple setting: running an ANOVA model on the easter grass experiment I already described in the previous post.

Julia is a ”high-level, high-performance … language for technical computing”. According to the benchmark results available on the Julia homepage, Julia is very close to the speed of C, which sounds very impressive. There is currently close to 1000 add-on packages to Julia. Those include packages with statistics and machine learning functionality. Those are also separately collected to JuliaStats.org.

The current release (0.4.5) is well documented at http://docs.julialang.org/en/release-0.4/, and the manual is very much worth reading. It also documents how to manage packages in Julia: First, you can check which packages are readily available using the command Pkg.available(). Packages can then be installed using the command Pkg.add(). For example, Pkg.add("GLM") would install the GLM package.

But enough for the introduction, let’s see how Julia works in this case.

Analysis

The aim of the analysis was to test for the possible effects of bed and watering scheme for the growth of the easter grass. The response variable is the measured median length of the easter grass growth at day 7 after sowing. Explanatory variables are the bed (mould or paper) and treatment (just water or nutrient solution).

The analysis can be carried out in R with the following code:

# Loads the plotting library
library(ggplot2)
 
# Loads the data
d<-read.table("C:/Users/Acer/Desktop/rairuoho.txt", header=T, sep="\t")
 
# ANOVA model
fit1 = glm(day7~bed+treatment, data=d)
summary(fit1)
 
# Results:
#Call:
#glm(formula = day7 ~ bed + treatment, data = d)
#
#Deviance Residuals: 
#     Min        1Q    Median        3Q       Max  
#-25.7292   -6.9375    0.7708    6.2708   26.2708  
#
#Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
#(Intercept)     105.229      3.543  29.699  < 2e-16 ***
#bedmould2         4.333      4.482   0.967 0.339017    
#bedpaper4       -19.500      4.482  -4.351  8.2e-05 ***
#bedpaper8        -7.750      4.482  -1.729 0.090947 .  
#treatmentwater  -11.292      3.169  -3.563 0.000912 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#(Dispersion parameter for gaussian family taken to be 120.5208)
#
#    Null deviance: 10646.0  on 47  degrees of freedom
#Residual deviance:  5182.4  on 43  degrees of freedom
#AIC: 372.95
#
#Number of Fisher Scoring iterations: 2
 
# Scatterplot
ggplot(d, aes(x=day6, y=day7)) +  geom_point(shape=1)

Similar analysis in Julia would require about the same amount of code:

# Loads the libraries
using DataFrames
using GLM
using Gadfly

# Read the dataset
d = readtable("C:/Users/Acer/Desktop/rairuoho.txt", header=true, separator='\t')

# Convert to 'factors'
pool!(d)

# Fit the model
fit1 = glm(day7~bed+treatment, d, Normal(), IdentityLink())

#Formula: day7 ~ 1 + bed + treatment
#
#Coefficients:
#                   Estimate Std.Error  z value Pr(>|z|)
#(Intercept)         105.229    3.5432  29.6989   <1e-99
#bed - mould2        4.33333   4.48183 0.966867   0.3336
#bed - paper4          -19.5   4.48183  -4.3509    <1e-4
#bed - paper8          -7.75   4.48183  -1.7292   0.0838
#treatment - water  -11.2917   3.16913 -3.56301   0.0004

# Plot the results
plot(d, x="day6", y="day7", Geom.point)

Both programs report the results as essentially the same. What about the running time then?

Running time

The analysis above is so simple that the actual running time of the regression is very short, at most a fraction of a second in either R or Julia. The speed comparision is still of relevance, since in the case of a short analysis the starting time of the application becomes relevant.

I timed both R and Julia from starting the program to the appearance of the plot. Both were tested thrice. The average time for R from startup to completion was 5 seconds. For Julia the average was 130 seconds. Starting up Julia took almost 45 seconds, and getting a plot took easily 30 seconds. These are about comparable to Python, or just slightly longer.

I'm sure Julia would beat R in longer jobs, but in this case both the load time, and the plot initilization time took such a heavy chunk of the total time that Julia appeared much slower than R. But exactly the same can stated of Python, because it also takes a long time just to boot up.

Summary

Both R and Julia are fully capable of fitting ANOVA models. The results are also equivalent. For short analyses, R might be faster than Julia, because Julia takes much longer to start and plot.