Boost implementation of Dijkstra's algorithm

asked 2019-07-12 14:08:13 +0200

elastica gravatar image

updated 2019-07-14 11:16:28 +0200

I have benchmarked Dijkstra's algorithm using the Boost Graph library interface provided by SageMath against a basic and academic one but written in pure Python. And the later performs always better. This is very surprising since Boost Graph library is a templated C++ library. Perphaps I'using it in the wrong way. Here is the code :

from time import clock
from heapq import heappush, heappop

def make_graph(n, p):
    import networkx as nx
    G=nx.erdos_renyi_graph(n, p)

    G=nx.relabel_nodes(G, lambda k:'v'+str(k))    
    edges=list(G.edges())
    wedges=[]
    for u, v in edges:
        w=float(randrange(1, 10))
        G[u][v]["weight"]=w
        wedges.append((u, v, w))

    return G, wedges

def wedges2adjw(nodes, wedges):
    "From weighted wedges to weighted adjacency list"
    n=len(nodes)
    node2int={node:i for (i,node) in enumerate(nodes)}
    adj=[[] for _ in range(n)]
    for a, b,w in wedges:
        i=node2int[a]
        j=node2int[b]        
        adj[i].append((j,w))
        adj[j].append((i,w))

    return adj, node2int

def dijkstra_int(adj, src):
    """
    adj: weighted adjacency lists of a graph with n nodes
    vertices indexed fropm 0 to n-1
    Returns shortest path list from src to all vertices"""

    n=len(adj)
    processed=[0]*n
    processed[src]=1
    distances=[(0,src)]
    INF=float("inf")
    dist=[INF]*n
    dist[src]=0
    while distances:
        d, u=heappop(distances)
        if processed[u]==1:
            processed[u]=2

            for v,w in adj[u]:
                dd=d+w
                if processed[v]==0 or dd<dist[v]:
                    heappush(distances, (dd,v))
                    dist[v]=dd
                    if processed[v]==0:
                        processed[v]=1
    return dist

def dijkstra(nodes, wedges, n, src):
    adj, node2int=wedges2adjw(nodes, wedges)
    int2node={i:node for (node,i) in node2int.items()}
    src_int=node2int[src]
    return {int2node[i]:d for (i, d) in enumerate(dijkstra_int(adj, src_int)) if d!=float("inf")}



print("------------  Generating Graph --------------------")
debut=clock()

n=3000
p=0.3

Gnx, wedges=make_graph(n, p)
src=sample(Gnx.nodes(),1)[0]
nodes=Gnx.nodes()
gtime=(clock()-debut)
print "Number of nodes:", len(Gnx)
print "Number of edges:", len(Gnx.edges())
print("Graph generation: %.2fs" %gtime) 


print("------------  Boost --------------------")


G = Graph()
G.add_vertices(nodes)
G.add_edges(wedges)
G.weighted(True)

debut=clock()
dist_boost = G.shortest_path_lengths(src,by_weight = True, algorithm='Dijkstra_Boost')

boost=(clock()-debut)
print("Boost: %.2fs" %boost)  


print("------------  Pure Python --------------------")
debut=clock()
dist_pp=dijkstra(nodes, wedges, n, src)
py_time=(clock()-debut)
print("Pure Python: %.2fs" %py_time) 

print"----------"
print "Checking:", dist_boost==dist_pp

Output :

------------  Generating Graph --------------------
Number of nodes: 3000
Number of edges: 1350005
Graph generation: 11.76s
------------  Boost --------------------
Boost: 1.61s
------------  Pure Python --------------------
Pure Python: 1.48s
----------
Checking: True

EDIT

Now, the tested functions have the same interface:

from time import clock
from heapq import heappush, heappop

def make_graph(n, p):
    G = Graph()
    G.add_vertices(range(n))
    G.add_edges([(u, v, float(randrange(1, 10)))
                 for (u, v, _) in graphs.RandomGNP(n, p).edges()])
    G.weighted(True)

    return G

def dijkstra(G, src):
    adj = [[] for _ in range(len(G))]

    for i, j, w in G.edges():
        adj[i].append((j, w))
        adj[j].append((i, w))    

    n = len(adj)
    processed = [0] * n
    processed[src] = 1
    distances = [(0, src)]
    INF = float("inf")
    dist = [INF] * n
    dist[src] = 0
    while distances:
        d, u = heappop(distances)
        if processed[u] == 1:
            processed[u] = 2

            for v, w in adj[u]:
                dd = d + w
                if processed[v] == 0 or dd < dist[v]:
                    heappush(distances, (dd, v))
                    dist[v] = dd
                    if processed[v] == 0:
                        processed[v] = 1
    return {i: dist[i] for i in G if dist[i] != INF}


print("------------  Generating Graph --------------------")
debut = clock()

n = 3000
p = 0.3

G = make_graph(n, p)
src = randrange(n)
nodes = G.vertices()
gtime = (clock() - debut)
print "Number of nodes:", len(G)
print "Number of edges:", len(G.edges())
print("Graph generation: %.2fs" % gtime)


print("------------  Sage --------------------")

debut = clock()
dist_sage = G.shortest_path_lengths(
    src, by_weight=True)
default_time = (clock() - debut)
print("Default: %.2fs" % default_time)


print("------------  Boost --------------------")

debut = clock()
dist_boost = G.shortest_path_lengths(
    src, by_weight=True, algorithm='Dijkstra_Boost')
boost_time = (clock() - debut)
print("Boost: %.2fs" % boost_time)

print("------------  Pure Python --------------------")
debut = clock()
dist_pp = dijkstra(G, src)
py_time = (clock() - debut)
print("Pure Python: %.2fs" % py_time)

print "----------"
print "Checking:", dist_sage == dist_boost == dist_pp

print "Time ratio: %.2f" %(py_time/boost_time)

output:

------------  Generating Graph --------------------
Number of nodes: 3000
Number of edges: 1349390
Graph generation: 7.74s
------------  Sage --------------------
Default: 2.23s
------------  Boost --------------------
Boost: 1.53s
------------  Pure Python --------------------
Pure Python: 2.54s
----------
Checking: True
Time ratio: 1.66

By the way, the docs explain about the default implementation:

(default): Sage chooses the best algorithm: 'BFS' if by_weight is False, 'Dijkstra_Boost' if all weights are positive, 'Bellman-Ford_Boost' otherwise.

This is not consistent with the timing above. [As explained by David Coudert in the comments, the difference is due to sign checking]

edit retag flag offensive close merge delete

Comments

The question is formulated so that one can start to digest it only after digesting the code. Which is highly undocumented itself. Also, try as a rule to follow the pep8 conventions, this would make it at least readable. So which isolated line does not perform as good as "the other" code? Can we have a minimal example where the performance is worse. (Compared to the code we do not have.)

dan_fulea gravatar imagedan_fulea ( 2019-07-12 18:19:23 +0200 )edit

The reported experiment is unfair: the methods don't take the same parameters as input. To be fair, both methods should take as input a Sage Graph. If you count the time to build wedges from a Sage Graph, you will get very different results.

You can also use graphs.RandomGNP(n, p) to generate your random graph.

David Coudert gravatar imageDavid Coudert ( 2019-07-13 17:13:09 +0200 )edit

You are right, the test in unfair ... for the Pure Python code. Did you realize that, on the Boost side, I did'nt take into account the time elapsed to build the graph? On the contrary, on the Pure Python side, I took in account the correspondind time since the Dijkstra function built the graph from the weighted list of edges and converted from string node to integer node.

elastica gravatar imageelastica ( 2019-07-13 22:44:24 +0200 )edit

Do the following test: write a pure Python method that takes as input a Sage Graph (or a networkx graph) and a source node. This method will then build adj and run dijkstra_int. For instance:

def dijsktra_int_complete(Gnx, src):
    wedges = [(u, v, label['weight']) for u, v, label in Gnx.edges(data=True)]
    return dijkstra_int(Gnx.nodes(), wedges, -1, src)

On my computer the longest operation is to build list wedges from the networkx graph. It takes roughly twice the time of the call to dijkstra_int. But you don't include that time in your measurement... That's why it's unfair.

David Coudert gravatar imageDavid Coudert ( 2019-07-13 23:04:45 +0200 )edit

Depend on your starting point. If you start from a bunch of weighted edges, probably Sage graph + Boost is slower than pure Python code, this sounds a problem to me (recall that Sage graphs are written in Cython for speed), doesn't it? If your starting point is a Sage Graph (I avoid NetworkX graphs as their dict of dict of dict implementation is slow), my pure Python code has to extract the weighted adjacency list and pure Python code is now slower (cf. my update above), but the ratio (about x1.7) is far from the ratio speed between pure Python code and C++ code (about 25 in our case I imagine) and this was my initial topic.

elastica gravatar imageelastica ( 2019-07-14 01:12:32 +0200 )edit

The running difference in your experiment between Sage and Boost is the time to check if all weights are positive or not. When the algorithm is specified we don't do that test, and it's faster.

Sage graphs are written in Cython, but we must convert the format to boost graph before calling Dijkstra of boost (same for igraph, and also for BFS, etc.). This has a cost. We could try to speed up this conversion.

David Coudert gravatar imageDavid Coudert ( 2019-07-14 06:27:30 +0200 )edit