How to efficiently calculate a sum of arrays with numpy and @parallel decorator?

asked 2017-02-02 10:27:41 +0200

Eugene
623 ●20 ●36 ●48

Hello!

I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:

import datetime
import numpy as np
import time

N = 1024 * 2
K = 256

def f():
    return np.ones((N, N),  dtype=np.complex128)

buffer = np.zeros((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for i in range(K):
    buffer += f()

print 'Elapsed time:', (datetime.datetime.now() - start_time)

Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:

import datetime
import numpy as np

N = 1024 * 2
K = 256

@parallel
def f(_):
    return np.ones((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for o in f(range(K)):
    buffer += o[1]

print 'Elapsed time:', (datetime.datetime.now() - start_time)

And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)

edit retag flag offensive close merge delete

Comments

Inter-process communication has a cost. I don't think @parallel uses shared memory for communication (and even then there would be extra copying to/from the communication buffers), so with a function that spends all its time on allocating the result (the filling with ones is negligible) I am not surprised that the IPC is more expensive than executing the function itself.

For @parallel to work well you should make sure that the input and the output of the function are small compared to the amount of work done inside the function.

nbruin ( 2017-02-03 09:30:33 +0200 )edit

Thanks for the note about shared memory! I think that is the case here, I found a workaround with sharedmem and multiprocessing modules, gonna post it as an answer.

Eugene ( 2017-02-04 21:03:46 +0200 )edit

add a comment

import sharedmem import multiprocessing as mp import numpy as np N = 1024 * 2 K = 256 _lock = mp.Lock() def f(_shared_buffer): partial_result = np.ones((N, N), dtype=np.complex128) _lock.acquire() _shared_buffer += partial_result _lock.release() def test(): shared_buffer = sharedmem.empty((N, N), dtype=np.complex128) pool = mp.Pool(4) pool.map(f, [shared_buffer for _ in range(K)]) return shared_buffer

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

How to efficiently calculate a sum of arrays with numpy and @parallel decorator?

Comments

1 Answer

Your Answer

Question Tools

Stats

Related questions

How to efficiently calculate a sum of arrays with numpy and @parallel decorator? edit

Comments

1 Answer

Your Answer

Question Tools

Stats

Related questions

How to efficiently calculate a sum of arrays with numpy and @parallel decorator?