Loading a 5GB dictionary of matrices uses up all of 64GB RAM

answered 2019-01-16 07:12:52 +0200

nbruin
4143 ●3 ●45 ●90

updated 2019-01-20 22:14:05 +0200

See https://bugs.python.org/issue26415 . Python's parser is not good for memory-efficiently parsing large expressions. For one thing, it will compile the entire expression to bytecode that produces the data structure. In principle that could be done in memory usage linear in the input, but possibly with a nasty constant.

If you want to read in expressions in an efficient way, you should probably consider a more restricted file format that has parsers implemented that work more efficiently. For a matrix, a "csv" file or a json file may well work better.

The Python parser (and the sage preparser!) make trade-offs that don't make them suitable to parse large data structures.

EDIT: for sparse matrices, CSV is probably not such a great solution, because it basically is a textual spreadsheet format. JSON would probably be fairly good at encoding a list of coordinate-and-value pairs that would be suitable for representing a sparse matrix in text form, but you'd have to read up on python JSON tools.

I would probably try get a file on which the lines contain i,j,A[i,j] and write a quick loop to read the lines from that file and fill in a matrix from it, but there might be more elegant solutions than that (and it may be easy to parse the file that you already have).

Mathematica can read the file without problem? That's a nice job. You have to take some care to parse data in such a way that extremely long data structures are parsed quickly and efficiently. Python's parser definitely does not have that property. I guess most python solutions decide to read/write special data formats (or exchange formats such as JSON or CSV instead).

Note that Python would probably also be able to write the file quite easily.

Using JSON is a bit hard-going, because only the basics are present, and I don't think sage types have particularly good JSON support. However, the following might give you some inspiration:

sage: import json
sage: D=dict( (str((i,j)),1r) for i in range(1000) for j in range(1000) )
sage: S=json.dumps(D) #encode as a json string (fast)
sage: D2=json.loads(S) #get a dictionary back (also fast)
sage: D == D2
True

(note that the default json has not been extended to handle anything beyond strings as keys and does not handle Sage integers either. There are probably better libraries out there. This is just what comes with python by default.)

EDIT 2: if you want to produce a file with which you can do this, look at the string "S" and make sure to write your file in that format. You'd have to do some post-processing on the dictionary to make it suitable for input into the matrix constructor. Beware: a lot of reading tools in python tend to read the entire contents of the file into memory in one big string. For most cases that's pretty efficient and computers have a lot of memory nowadays, but for a 5GB file it's probably not a good idea.

If you're going to write a custom routine to produce a file anyway, you might as well make up your own format. If I were to do this, I'd figure out on the mathematica side how to write a file that looks like

consisting of lines i j v indicating that A[i,j] = v, of course only for the non-zero entries of the matrix. The program to construct the matrix on the sage side would then be

F=open("matrix_file","r")
ns,ms = F.readline().split()
A=matrix(int(ns),int(ms),0,sparse=True)
for line in F:
    i_s, js, vs = line.split()
    A[int(i_s),int(js)]=ZZ(vs)
F.close()

(problem: is is a reserved word, so we use i_s instead)

It may not be super-fast, but it is guaranteed to have only one copy in memory of the big object (the sparse matrix) and it really reads the file line-by-line from disk (the OS will buffer in larger blocks, though).

edit flag offensive delete link

Comments

Or for something that large, a binary format.

Iguananaut ( 2019-01-16 11:31:11 +0200 )edit

Oh, that's disappointing. I thought Sage was supposed to be much more efficient than Mathematica. Even when I try to import just the largest matrix (that takes up less that 800MB), Sage uses up 28GB of RAM and crashes.

Leon ( 2019-01-16 21:25:05 +0200 )edit

"I thought Sage was supposed to be much more efficient than Mathematica" it depends on what you mean by "efficient" and exactly what tasks you're judging on. For loading a huge matrix it may be just as efficient, you just have to be using the best data format for the task (which, generally, is not representing huge datasets as code).

Iguananaut ( 2019-01-17 17:05:50 +0200 )edit

Hmm, I still haven't been able to import the file, but probably due to my incompetence. Could you give me specific instructions on how (in what form) to export the file from Mathematica and how to import it in Sage? Do I export it to file.json? Should its content be e.g. [ "bdrs={", "1: matrix(ZZ,1,7,{}),", "2: matrix(ZZ,7,21,{(3,3):-1, (3,9):-1, (3,14):-1}),", "};" ] How do I import this into sage? If I run load('file.json'), I get No such file or directory: '/home/file.json.sobj'. If I run json.loads('file.json'), I get ValueError: No JSON object could be decoded.

Leon ( 2019-01-19 17:39:49 +0200 )edit

I do prefer option in EDIT 2, however I would include the dimensions on the first line of the file. Note that in your script you use is, js for the string but wrote int(i), int(j) for the conversion.

vdelecroix ( 2019-01-20 10:40:33 +0200 )edit

see more comments

Loading a 5GB dictionary of matrices uses up all of 64GB RAM

1 Answer

Comments

Your Answer

Question Tools

Stats

Related questions

Loading a 5GB dictionary of matrices uses up all of 64GB RAM edit

1 Answer

Comments

Your Answer

Question Tools

Stats

Related questions

Loading a 5GB dictionary of matrices uses up all of 64GB RAM