1 | initial version |

Niles is right. I also ran into this problem recently but found a straightforward way to solve it. I have some sample code below with detailed explanation. First, let's see a simple comparison between serial and parallel execution of a function.

```
@parallel(p_iter='multiprocessing', ncpus=3)
def test_parallel(n):
f = factor(2^128+n)
return len(f)
t=walltime()
r = range(1000)
p = sorted(list( test_parallel(r)))
print p[-1]
print walltime(t)
t=walltime()
for i in range(1000):
f = factor(2^128+i)
print f
print walltime(t)
```

```
(((999,), {}), 5)
6.359593153
5 * 23 * 383 * 1088533 * 7097431848554855065803619703
17.0849101543
```

`test_parallel`

is a simple function that takes a nontrivial time to execute for testing purposes. It returns the number of distinct factors of the factorization of 2^128+n. The argument of `test_parallel`

is a list created by the `range`

function. Note that this has to be a list, there is currently no alternative, so e.g. `xrange`

cannot be used in place of `range`

because `xrange`

generates numbers on the fly rather than creating a whole list of them. This can be a serious problem (mainly memory problem) but it can be overcome as will be shown further below. `test_parallel`

as any parallel decorated function returns a special object, which is an iterator over 2-tuples and the order of the 2-tuples is entirely random! So, the output `(((999,), {}), 5)`

above representing the last item in the calculation `p[-1]`

includes the input value of `n=999`

, an empty input keyword list/dictionary `{}`

, and the return value of `5`

. It should be noted that in order to be able to parse the output from `test_parallel`

it should be cast to a sorted list.

In this particular run the parallel calculation (using 3 cores) took some 6 seconds whereas at the bottom of the sample code the serial equivalent took some 17 seconds to execute (and the result of the last factorization confirms that there were 5 distinct factors).

This is all well and my great appreciation to the developers. Unfortunately, however, a serious problem arises when the list argument to a parallel function grows too big. One solution that has worked very well for me involves chunks as Niles suggested plus numpy arrays. The following code is a more robust and significantly more efficient alternative to the naive parallel code above.

```
%timeit
import numpy as np
sizof_chunk = 10^3
numof_chunk = 10^2
np_array = np.zeros((sizof_chunk*numof_chunk,), dtype=np.uint8)
for i in range(numof_chunk):
beg = i * sizof_chunk
end = beg + sizof_chunk
tuples = sorted(list(test_parallel(range(beg,end))))
iter = [ x[1] for x in tuples ]
np_array[beg:end] = np.fromiter(iter, np.uint8)
print np_array
```

```
[1 2 3 ..., 6 8 3]
CPU time: 13.88 s, Wall time: 670.06 s
```

`sizof_chunk`

is set to the same number 1000 and `numof_chunk`

can be set to anything. If it is set to 1 then the calculation will be the exact same as above (and will take about 6 seconds) but in this example it is set to 100 meaning that we tackle a problem 100 times larger than before. The code is self explanatory, it uses/recycles the same size of list (1000) and fills up `np_array`

(which is created up front in its full size via `np.zeros`

) by consecutive slices of `np_array[beg:end]`

. `iter = [ x[1] for x in tuples ]`

extracts just the returned values from the `test_parallel`

output object, and `np_array[beg:end] = np.fromiter(iter, np.uint8)`

fills then the array with these values. Note that with numpy arrays one can define the data type, which in this example allows for using an 8 byte integer, which is far less storage than a corresponding Python list.

Altogether, varying `sizof_chunk`

, `numof_chunk`

,and of course `ncpus`

can help solve a lot of these kind of problems.

Istvan

Copyright Sage, 2010. Some rights reserved under creative commons license. Content on this site is licensed under a Creative Commons Attribution Share Alike 3.0 license.