Ask Your Question
1

How to efficiently calculate a sum of arrays with numpy and @parallel decorator?

asked 2017-02-02 10:27:41 +0200

Eugene gravatar image

Hello!

I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:

import datetime
import numpy as np
import time

N = 1024 * 2
K = 256

def f():
    return np.ones((N, N),  dtype=np.complex128)

buffer = np.zeros((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for i in range(K):
    buffer += f()

print 'Elapsed time:', (datetime.datetime.now() - start_time)

Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:

import datetime
import numpy as np

N = 1024 * 2
K = 256

@parallel
def f(_):
    return np.ones((N, N),  dtype=np.complex128)

start_time = datetime.datetime.now()

for o in f(range(K)):
    buffer += o[1]

print 'Elapsed time:', (datetime.datetime.now() - start_time)

And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)

edit retag flag offensive close merge delete

Comments

Inter-process communication has a cost. I don't think @parallel uses shared memory for communication (and even then there would be extra copying to/from the communication buffers), so with a function that spends all its time on allocating the result (the filling with ones is negligible) I am not surprised that the IPC is more expensive than executing the function itself.

For @parallel to work well you should make sure that the input and the output of the function are small compared to the amount of work done inside the function.

nbruin gravatar imagenbruin ( 2017-02-03 09:30:33 +0200 )edit

Thanks for the note about shared memory! I think that is the case here, I found a workaround with sharedmem and multiprocessing modules, gonna post it as an answer.

Eugene gravatar imageEugene ( 2017-02-04 21:03:46 +0200 )edit

1 Answer

Sort by ยป oldest newest most voted
1

answered 2017-02-04 21:14:26 +0200

Eugene gravatar image

Workaround: use sharedmemory and multiprocessing packages (sharedmemory is available via pip). I tried the following code within a separate .py file:

import sharedmem
import multiprocessing as mp
import numpy as np

N = 1024 * 2
K = 256

_lock = mp.Lock()


def f(_shared_buffer):
    partial_result = np.ones((N, N), dtype=np.complex128)
    _lock.acquire()
    _shared_buffer += partial_result
    _lock.release()


def test():
    shared_buffer = sharedmem.empty((N, N), dtype=np.complex128)
    pool = mp.Pool(4)
    pool.map(f, [shared_buffer for _ in range(K)])
    return shared_buffer

Advantages of this workaround:

  • Executes as fast as the original code with no parallelism (about 5 seconds on my PC)

Disadvantages:

  • Requires usage of the additional package sharedmem (although this package is available via pip)
  • I was not able to pass _lock object to f() from within Sage's Notebook and have to put the code in a separate file, than import and only then execute
  • The buffer from np.ndarray becomes sharedmem.sharedmem.anonymousmemmap (not sure if that is a disadvantage, I guess one can convert it back to numpy's type pretty fast)
edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

2 followers

Stats

Asked: 2017-02-02 10:27:41 +0200

Seen: 2,470 times

Last updated: Feb 04 '17