Parallelizing symbolic matrix*vector computations for sparse matrices and vectors    
   Hi everyone, I am currently running computations which involve a lot of operations of the type matrix*vector, where both objects are sparse and the result can be expected to be sparse as well. The matrices' base field is the symbolic ring. If I drop certain normalizations I could also work over QQ, I think. The above operations appear sequentially. Although there are some options for parallelizing these at least in part, the biggest acceleration would be achieved if the multiplication itself could be parallelized. Is there a way to do this in Sage?
Greetings and thank you very much.
Edit: Work in progress
Below I put some code that I produced so far which is very far from a solution but produces new slow-downs that I do not understand. The strategy is to save a sparse matrix by its nonzero rows which is facilitated by the following code snippet:
def default_to_row(M):
rows_of_M=M.rows()
relevant=[]
for r in range(len(rows_of_M)):
    row=rows_of_M[r]
    if dot_sparse(row,row)!=0:
        relevant.append([r,row])
return relevant
The row*vector multiplications are facilitated by the function
# indexed_row of type [i,v] where v is a sparse vector and i is 
# the row in the matrix it corresponds to.
def mult_sparse(indexed_row,v2):
    res=0
    for k in indexed_row[1].dict():
        res=res+indexed_row[1][k]*v2[k]
    return [indexed_row[0],res]
and regular vector multiplication is given by
def dot_sparse(v,w):
res=0
for k in v.dict():
    res=res+v[k]*w[k]
return res
Now to the actual benchmarks:
import concurrent.futures
import time
import itertools
dim=5000
M=random_matrix(QQ,nrows=dim,ncols=dim,sparse=True,density=0.0001)
M_by_rows=default_to_row(M)
v=vector(QQ,dim,list(range(dim)),sparse=True)
print(len(M_by_rows))
Using the concurrent.futures library yields the following performance:
start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor() as executor:
    start2 = time.perf_counter()
    results = executor.map(mult_sparse,M_by_rows,itertools.repeat(v) )
    finish2 = time.perf_counter()
    coefficients={}
    for result in results:
        if result[1]!=0:
            coefficients[result[0]]=result[1]
    new_v=vector(QQ,v.degree(),coefficients,sparse=True)
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} seconds with Pooling.')
print(f'The executor alone took {round(finish2-start2, 2)} seconds.')
where the dimensions are adjusted so that the computation does not exceed my RAM capacities because every spawned subprocess takes copies (at least of v). A comparison with serial approaches, both the standard one and the serial version of my particular straegy, show that the parallel version is significantly slower (18 and 7 seconds in the above outputs) while the serial versions do not differ (0.02 seconds in both cases):
start = time.perf_counter()
new_v=M*v
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s) for the regular computation')
start = time.perf_counter()
coefficients={}
for row in M_by_rows:
    coefficients[row[0]]=mult_sparse(row,v)[1]
new_v=vector(QQ,v.degree(),coefficients,sparse=True)    
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s) in a serial computation')
print(f'Check if the vectors coincide:')
test=new_v-M*v
n=dot_sparse(test,test)
if n==0:
    print(True)
else:
    print(False)
 
 
To be able to give an answer, we need to know the type of the entries in the matrix/vector. The symbolic ring is the certainly the most slow ring where to do addition and multiplication. So before using parallel computations, I would first try to find the best ring to perform the computations.
As I said, it may be possible to normalize everything such that QQ works. So for now, just assume the entries to be in QQ.
With QQ, sometimes the denominators gets huge and that becomes an issue. Is it your case? If precision is not an issue, than you can try with RDF or RealField(prec=...) which could be faster. For the parallelization question, what is more important for you? Do you need that each single multiplication
Mvis really fast (using low-level parallization done with more work in the internal of Sage or other libraries) or do you want to make millions of productsMvreally fast (using higher-level parallization done easily with python decorator @parallel)?It doesn't appear to be the case that the denominators are growing but I will check this more thoroughly, thanks a lot for this hint. Towards the type of parallelization: I really need to do one M*v operation at the time because the next steps depend on the result. I figured that splitting the matrix M into n lists of sparse row vectors, where n is the number of cores could potentially work. Then I can iterate over the lists in parallel so it should take roughly 1/n times the comptation time. The fortunate thing is that the matrices M do not change throughout the entire computation.
Do not hesitate to share your solution in the answers below.