How to efficiently calculate a sum of arrays with numpy and @parallel decorator?
Hello!
I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:
import datetime
import numpy as np
import time
N = 1024 * 2
K = 256
def f():
return np.ones((N, N), dtype=np.complex128)
buffer = np.zeros((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for i in range(K):
buffer += f()
print 'Elapsed time:', (datetime.datetime.now() - start_time)
Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:
import datetime
import numpy as np
N = 1024 * 2
K = 256
@parallel
def f(_):
return np.ones((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for o in f(range(K)):
buffer += o[1]
print 'Elapsed time:', (datetime.datetime.now() - start_time)
And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)
Inter-process communication has a cost. I don't think @parallel uses shared memory for communication (and even then there would be extra copying to/from the communication buffers), so with a function that spends all its time on allocating the result (the filling with ones is negligible) I am not surprised that the IPC is more expensive than executing the function itself.
For @parallel to work well you should make sure that the input and the output of the function are small compared to the amount of work done inside the function.
Thanks for the note about shared memory! I think that is the case here, I found a workaround with sharedmem and multiprocessing modules, gonna post it as an answer.