# How to efficiently calculate a sum of arrays with numpy and @parallel decorator?

Hello!

I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:

import datetime
import numpy as np
import time

N = 1024 * 2
K = 256

def f():
return np.ones((N, N),  dtype=np.complex128)

buffer = np.zeros((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for i in range(K):
buffer += f()

print 'Elapsed time:', (datetime.datetime.now() - start_time)


Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:

import datetime
import numpy as np

N = 1024 * 2
K = 256

@parallel
def f(_):
return np.ones((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for o in f(range(K)):
buffer += o[1]

print 'Elapsed time:', (datetime.datetime.now() - start_time)


And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)

edit retag close merge delete

Inter-process communication has a cost. I don't think @parallel uses shared memory for communication (and even then there would be extra copying to/from the communication buffers), so with a function that spends all its time on allocating the result (the filling with ones is negligible) I am not surprised that the IPC is more expensive than executing the function itself.

For @parallel to work well you should make sure that the input and the output of the function are small compared to the amount of work done inside the function.

( 2017-02-03 02:30:33 -0600 )edit

Thanks for the note about shared memory! I think that is the case here, I found a workaround with sharedmem and multiprocessing modules, gonna post it as an answer.

( 2017-02-04 14:03:46 -0600 )edit

Sort by » oldest newest most voted

Workaround: use sharedmemory and multiprocessing packages (sharedmemory is available via pip). I tried the following code within a separate .py file:

import sharedmem
import multiprocessing as mp
import numpy as np

N = 1024 * 2
K = 256

_lock = mp.Lock()

def f(_shared_buffer):
partial_result = np.ones((N, N), dtype=np.complex128)
_lock.acquire()
_shared_buffer += partial_result
_lock.release()

def test():
shared_buffer = sharedmem.empty((N, N), dtype=np.complex128)
pool = mp.Pool(4)
pool.map(f, [shared_buffer for _ in range(K)])
return shared_buffer


• Executes as fast as the original code with no parallelism (about 5 seconds on my PC)

• Requires usage of the additional package sharedmem (although this package is available via pip)
• I was not able to pass _lock object to f() from within Sage's Notebook and have to put the code in a separate file, than import and only then execute
• The buffer from np.ndarray becomes sharedmem.sharedmem.anonymousmemmap (not sure if that is a disadvantage, I guess one can convert it back to numpy's type pretty fast)
more