Ask Your Question

Revision history [back]

Caching with @parallel

So I have a function which I want to run in a parallel manner as I'll need to run the function a few million times. The thing is, I know that quite a few times, I'll have duplicate information. So I want to speed up the process even more by skipping the ones I don't need. Here's what I have so far:

@parallel(ncpus=7)
def parallel_function(initial,allElts,e1,e2):
    untested = set([initial])
    verified = set([])
    potentials = getPotentials(e1,e2)
    while untested:
        curTest = untested.pop()
        verified.add(curTest)
        for e in allElts:
            newElt = someProp(potentials,curTest)
            if newElt not in verified:
                untested.add(newElt)
    return verified

def testElts(allElts, potentials):
    initial = allElts[0]
    returnInfo = {}
    for tup in list(parallel_function([(initial,allElts,e1,e2) for e1 in allElts for e2 in allElts)):
         returnInfo[tup[0][0][2]] = tup[1]

NOTE The above is an exmple. I've tried to dumb it down to try and give a better feel of the functions I'm working with. So consider the above more "pseudo" code than real code in case I missed something

We're assuming that allElts has at least 1 million elements, so this function will take a while. Notice that the parallel_function is a fairly intense function itself, but given any two element, the "getPotentials" function might return similar results. So basically I want to skip my while loop if I've already tested these "potentials" before.

I've tried looking into how caching works in parallel environments, but there doesn't seem to be too much. Would caching in this case be possible with @cache_function? If not, can I do a dictionary that stores the values and access the information that way? Since parallel processing creates forks I assume that means the dictionary information is not necessarily accessible by everyone? Or is that not accurate?

Any help on this would be beneficial. Thanks.

Caching with @parallel

So I have a function which I want to run in a parallel manner as I'll need to run the function a few million times. The thing is, I know that quite a few times, I'll have duplicate information. So I want to speed up the process even more by skipping the ones I don't need. Here's what I have so far:

@parallel(ncpus=7)
def parallel_function(initial,allElts,e1,e2):
    untested = set([initial])
    verified = set([])
    potentials = getPotentials(e1,e2)
    while untested:
        curTest = untested.pop()
        verified.add(curTest)
        for e in allElts:
            newElt = someProp(potentials,curTest)
someProp(potentials,curTest,e)
            if newElt not in verified:
                untested.add(newElt)
    return verified

def testElts(allElts, potentials):
    initial = allElts[0]
    returnInfo = {}
    for tup in list(parallel_function([(initial,allElts,e1,e2) for e1 in allElts for e2 in allElts)):
         returnInfo[tup[0][0][2]] = tup[1]

NOTE The above is an exmple. I've tried to dumb it down to try and give a better feel of the functions I'm working with. So consider the above more "pseudo" code than real code in case I missed something

We're assuming that allElts has at least 1 million elements, so this function will take a while. Notice that the parallel_function is a fairly intense function itself, but given any two element, the "getPotentials" function might return similar results. So basically I want to skip my while loop if I've already tested these "potentials" before.

I've tried looking into how caching works in parallel environments, but there doesn't seem to be too much. Would caching in this case be possible with @cache_function? If not, can I do a dictionary that stores the values and access the information that way? Since parallel processing creates forks I assume that means the dictionary information is not necessarily accessible by everyone? Or is that not accurate?

Any help on this would be beneficial. Thanks.