Ask Your Question

Loading and analyzing data, Datamining

asked 2011-04-15 14:11:22 +0200

dartdog gravatar image

updated 2011-06-16 15:31:08 +0200

Kelvin Li gravatar image

I am considering using Sage as a data-mining type tool. Is this way off base? If so where should I head? I do not see much in the reference docs on loading and interfacing with large-ish datafiles? As well as cleaning, standardizing and faceting the data.

In particular I have a 18mg compressed CSV file that expands to at least 512 mg.. It has about 20 fields per record and 400,000 or so records (it is Residential Real Estate listing data) I need to summarize, average and count by area, subdivision and dates (and more) I had been using An Access database, but am looking to move to Python, and avoid an SQL structure if possible, since now it seems we have enough memory to hold the data in memory..

My brief search has not turned up any similar examples?? At the least any pointers to loading the compressed CSV? and guidance on the right Python/Sage Data structures to start off with?

edit retag flag offensive close merge delete

3 Answers

Sort by ยป oldest newest most voted

answered 2012-06-20 08:51:33 +0200

updated 2012-06-20 08:53:27 +0200

This comes very late, but I thought I would post it anyway for future reference.

Apart from R, there is a pure python-based data mining and visualization package called Orange besides a custom nice frontend, the whole thing works as a python library and can be installed into sage just with

easy_install orange

run from the sage shell.

edit flag offensive delete link more


Looks cool!

kcrisman gravatar imagekcrisman ( 2012-06-20 12:22:40 +0200 )edit

answered 2011-04-15 15:23:36 +0200

Kelvin Li gravatar image

Many of the operations you listed can be accomplished using pure Python and its standard library. As far as I know, Sage does not provide much more in this area than Python itself.

Custom data structures may be needed for the actual data manipulations. The data structure can be as "dumb" as a two-dimensional nested list (representing records and fields). Or it could be well-packaged using an object-oriented approach. Which one to choose depends on your specific usage case.

Regarding the mundane parsing of CSV (I am assuming "Comma Separated Values") files, there is a Python module called csv. This is more convenient to use, but if for some reason it is not flexible enough, there is also the regular expression module re.

File compression/decompression is provided by the gzip, bz2, zipfile, and tarfile Python modules, for the respective formats.

datetime, time, and calendar provide date/time-related functions.

Although you stated that you wanted to avoid SQL, it might be worth mentioning that Python's standard library also features SQLite in the sqlite3 module.

edit flag offensive delete link more

answered 2011-04-15 15:20:30 +0200

kcrisman gravatar image

You might be best off using one of the subcomponents of Sage for this. R in particular is best suited, and has many things going on in the professional data mining community with this.

(That doesn't mean you couldn't use Sage directly, or with Python tools - I'm sure a web search will turn up people using some library!)

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools


Asked: 2011-04-15 14:11:22 +0200

Seen: 965 times

Last updated: Jun 20 '12