ASKSAGE: Sage Q&A Forum - RSS feedhttps://ask.sagemath.org/questions/Q&A Forum for SageenCopyright Sage, 2010. Some rights reserved under creative commons license.Sat, 04 Feb 2017 21:14:26 +0100How to efficiently calculate a sum of arrays with numpy and @parallel decorator?https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/ Hello!
I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:
import datetime
import numpy as np
import time
N = 1024 * 2
K = 256
def f():
return np.ones((N, N), dtype=np.complex128)
buffer = np.zeros((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for i in range(K):
buffer += f()
print 'Elapsed time:', (datetime.datetime.now() - start_time)
Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:
import datetime
import numpy as np
N = 1024 * 2
K = 256
@parallel
def f(_):
return np.ones((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for o in f(range(K)):
buffer += o[1]
print 'Elapsed time:', (datetime.datetime.now() - start_time)
And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)
Thu, 02 Feb 2017 10:27:41 +0100https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/Comment by nbruin for <p>Hello!</p>
<p>I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:</p>
<pre><code>import datetime
import numpy as np
import time
N = 1024 * 2
K = 256
def f():
return np.ones((N, N), dtype=np.complex128)
buffer = np.zeros((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for i in range(K):
buffer += f()
print 'Elapsed time:', (datetime.datetime.now() - start_time)
</code></pre>
<p>Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:</p>
<pre><code>import datetime
import numpy as np
N = 1024 * 2
K = 256
@parallel
def f(_):
return np.ones((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for o in f(range(K)):
buffer += o[1]
print 'Elapsed time:', (datetime.datetime.now() - start_time)
</code></pre>
<p>And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)</p>
https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/?comment=36447#post-id-36447Inter-process communication has a cost. I don't think @parallel uses shared memory for communication (and even then there would be extra copying to/from the communication buffers), so with a function that spends all its time on allocating the result (the filling with ones is negligible) I am not surprised that the IPC is more expensive than executing the function itself.
For @parallel to work well you should make sure that the input and the output of the function are small compared to the amount of work done inside the function.Fri, 03 Feb 2017 09:30:33 +0100https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/?comment=36447#post-id-36447Comment by Eugene for <p>Hello!</p>
<p>I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:</p>
<pre><code>import datetime
import numpy as np
import time
N = 1024 * 2
K = 256
def f():
return np.ones((N, N), dtype=np.complex128)
buffer = np.zeros((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for i in range(K):
buffer += f()
print 'Elapsed time:', (datetime.datetime.now() - start_time)
</code></pre>
<p>Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:</p>
<pre><code>import datetime
import numpy as np
N = 1024 * 2
K = 256
@parallel
def f(_):
return np.ones((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for o in f(range(K)):
buffer += o[1]
print 'Elapsed time:', (datetime.datetime.now() - start_time)
</code></pre>
<p>And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)</p>
https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/?comment=36453#post-id-36453Thanks for the note about shared memory! I think that is the case here, I found a workaround with sharedmem and multiprocessing modules, gonna post it as an answer.Sat, 04 Feb 2017 21:03:46 +0100https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/?comment=36453#post-id-36453Answer by Eugene for <p>Hello!</p>
<p>I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:</p>
<pre><code>import datetime
import numpy as np
import time
N = 1024 * 2
K = 256
def f():
return np.ones((N, N), dtype=np.complex128)
buffer = np.zeros((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for i in range(K):
buffer += f()
print 'Elapsed time:', (datetime.datetime.now() - start_time)
</code></pre>
<p>Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:</p>
<pre><code>import datetime
import numpy as np
N = 1024 * 2
K = 256
@parallel
def f(_):
return np.ones((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for o in f(range(K)):
buffer += o[1]
print 'Elapsed time:', (datetime.datetime.now() - start_time)
</code></pre>
<p>And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)</p>
https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/?answer=36454#post-id-36454Workaround: use sharedmemory and multiprocessing packages (sharedmemory is available via pip). I tried the following code within a separate .py file:
import sharedmem
import multiprocessing as mp
import numpy as np
N = 1024 * 2
K = 256
_lock = mp.Lock()
def f(_shared_buffer):
partial_result = np.ones((N, N), dtype=np.complex128)
_lock.acquire()
_shared_buffer += partial_result
_lock.release()
def test():
shared_buffer = sharedmem.empty((N, N), dtype=np.complex128)
pool = mp.Pool(4)
pool.map(f, [shared_buffer for _ in range(K)])
return shared_buffer
Advantages of this workaround:
- Executes as fast as the original code with no parallelism (about 5 seconds on my PC)
Disadvantages:
- Requires usage of the additional package sharedmem (although this package is available via pip)
- I was not able to pass _lock object to f() from within Sage's Notebook and have to put the code in a separate file, than import and only then execute
- The buffer from np.ndarray becomes sharedmem.sharedmem.anonymousmemmap (not sure if that is a disadvantage, I guess one can convert it back to numpy's type pretty fast)Sat, 04 Feb 2017 21:14:26 +0100https://ask.sagemath.org/question/36437/how-to-efficiently-calculate-a-sum-of-arrays-with-numpy-and-parallel-decorator/?answer=36454#post-id-36454