I think the obvious answer here is the correct one. Matrix multiplication is not embarrassingly parallel. And you do not appear to have modified the serial multiplication code to parallelize it.
Instead, you are multiplying two matrices. Since the multiplication of each matrix is likely being handled by only a single core, every core in excess of two is simply idle overhead. The result is that you only see a speed improvement of 2x.
You could test this by running more than 2 matrix multiplications. But I'm not familiar with the foreach, doParallel framework (I use parallel framework) nor do I see where in your code to modify this to test it.
mmulthread <- function(u, v, w) {
require(parallel)
# determine which rows for this thread
myidxs <- splitIndices(nrow(u), myinfo$nwrkrs ) [[ myinfo$id ]]
# compute this thread's portion of the result
w[myidxs, ] <- u [myidxs, ] %*% v [ , ]
0 # dont return result -- expensive
}
# t e s t on snow c l u s t e r c l s
test <- function (cls, n = 2^5) {
# i n i t Rdsm
mgrinit(cls)
# shared variables
mgrmakevar(cls, "a", n, n)
mgrmakevar(cls, "b", n, n)
mgrmakevar(cls, "c", n, n)
# f i l l i n some t e s t data
a [ , ] <- 1:n
b [ , ] <- rep (1 ,n)
# export function
clusterExport(cls , "mmulthread" )
# run function
clusterEvalQ(cls , mmulthread (a ,b ,c ))
#print ( c[ , ] ) # not p ri n t ( c ) !
}
library(parallel)
library(Rdsm)
c1 <- makeCluster(1)
c2 <- makeCluster (2)
c4 <- makeCluster(4)
c8 <- makeCluster(8)
library(microbenchmark)
microbenchmark(node1= test(c1, n= 2^10),
node2= test(c2, n= 2^10),
node4= test(c4, n= 2^10),
node8= test(c8, n= 2^10))
Unit: milliseconds
expr min lq mean median uq max neval cld
node1 715.8722 780.9861 818.0487 817.6826 847.5353 922.9746 100 d
node2 404.9928 422.9330 450.9016 437.5942 458.9213 589.1708 100 c
node4 255.3105 285.8409 309.5924 303.6403 320.8424 481.6833 100 a
node8 304.6386 328.6318 365.5114 343.0939 373.8573 836.2771 100 b
As expected, by parallelizing the matrix multiplication, we do see the spend improvement we wanted, although parallel overhead is clearly extensive.
I think the obvious answer here is the correct one. Matrix multiplication is not embarrassingly parallel. And you do not appear to have modified the serial multiplication code to parallelize it.
Instead, you are multiplying two matrices. Since the multiplication of each matrix is likely being handled by only a single core, every core in excess of two is simply idle overhead. The result is that you only see a speed improvement of 2x.
You could test this by running more than 2 matrix multiplications. But I'm not familiar with the
foreach
,doParallel
framework (I useparallel
framework) nor do I see where in your code to modify this to test it.An alternative test is to do a parallelized version of matrix multiplication, which I borrow directly from Matloff's Parallel Computing for Data Science. Draft available here, see page 27
As expected, by parallelizing the matrix multiplication, we do see the spend improvement we wanted, although parallel overhead is clearly extensive.