### How to efficiently calculate a sum of arrays with numpy and @parallel decorator?

Hello!

I have an algorithm to process a huge array by chunks. Each processing operation results in a matrix of size N*N, I need to calculate a sum of these matrices. For simplicity assume processing function does almost nothing and requires no input - just returns zeros. In that case working example looks like this:

```
import datetime
import numpy as np
import time
N = 1024 * 2
K = 256
def f():
return np.ones((N, N), dtype=np.complex128)
buffer = np.zeros((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for i in range(K):
buffer += f()
print 'Elapsed time:', (datetime.datetime.now() - start_time)
```

Execution takes about 5 seconds on my PC. Now, as function f becomes more complex, I would like to run in parallel, so I modify code as follows:

```
import datetime
import numpy as np
N = 1024 * 2
K = 256
@parallel
def f(_):
return np.ones((N, N), dtype=np.complex128)
start_time = datetime.datetime.now()
for o in f(range(K)):
buffer += o[1]
print 'Elapsed time:', (datetime.datetime.now() - start_time)
```

And now it takes about 26 seconds to calculate! What am I doing wrong? Or what causes such a huge overhead? (it looks silly for if the cost of collecting the result of f() across parallel processes is more than calculating one iteration of f() itself, I better run f() without parallelism at all)