# How to efficiently calculate a sum of arrays with numpy and @parallel decorator?

Hello!

I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:

import datetime
import numpy as np
import time

N = 1024 * 2
K = 256

def f():
return np.ones((N, N),  dtype=np.complex128)

buffer = np.zeros((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for i in range(K):
buffer += f()

print 'Elapsed time:', (datetime.datetime.now() - start_time)


Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:

import datetime
import numpy as np

N = 1024 * 2
K = 256

@parallel
def f(_):
return np.ones((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for o in f(range(K)):
buffer += o

print 'Elapsed time:', (datetime.datetime.now() - start_time)


And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)

edit retag close merge delete

Inter-process communication has a cost. I don't think @parallel uses shared memory for communication (and even then there would be extra copying to/from the communication buffers), so with a function that spends all its time on allocating the result (the filling with ones is negligible) I am not surprised that the IPC is more expensive than executing the function itself.

For @parallel to work well you should make sure that the input and the output of the function are small compared to the amount of work done inside the function.

Thanks for the note about shared memory! I think that is the case here, I found a workaround with sharedmem and multiprocessing modules, gonna post it as an answer.

Sort by » oldest newest most voted

Workaround: use sharedmemory and multiprocessing packages (sharedmemory is available via pip). I tried the following code within a separate .py file:

import sharedmem
import multiprocessing as mp
import numpy as np

N = 1024 * 2
K = 256

_lock = mp.Lock()

def f(_shared_buffer):
partial_result = np.ones((N, N), dtype=np.complex128)
_lock.acquire()
_shared_buffer += partial_result
_lock.release()

def test():
shared_buffer = sharedmem.empty((N, N), dtype=np.complex128)
pool = mp.Pool(4)
pool.map(f, [shared_buffer for _ in range(K)])
return shared_buffer


• Executes as fast as the original code with no parallelism (about 5 seconds on my PC)

• Requires usage of the additional package sharedmem (although this package is available via pip)
• I was not able to pass _lock object to f() from within Sage's Notebook and have to put the code in a separate file, than import and only then execute
• The buffer from np.ndarray becomes sharedmem.sharedmem.anonymousmemmap (not sure if that is a disadvantage, I guess one can convert it back to numpy's type pretty fast)
more