Loading and analyzing data, Datamining

asked 2011-04-15 14:11:22 +0200

dartdog
147 ●5 ●7 ●13

updated 2011-06-16 15:31:08 +0200

Kelvin Li
503 ●5 ●12 ●17

I am considering using Sage as a data-mining type tool. Is this way off base? If so where should I head? I do not see much in the reference docs on loading and interfacing with large-ish datafiles? As well as cleaning, standardizing and faceting the data.

In particular I have a 18mg compressed CSV file that expands to at least 512 mg.. It has about 20 fields per record and 400,000 or so records (it is Residential Real Estate listing data) I need to summarize, average and count by area, subdivision and dates (and more) I had been using An Access database, but am looking to move to Python, and avoid an SQL structure if possible, since now it seems we have enough memory to hold the data in memory..

My brief search has not turned up any similar examples?? At the least any pointers to loading the compressed CSV? and guidance on the right Python/Sage Data structures to start off with?

edit retag flag offensive close merge delete

add a comment

1

answered 2012-06-20 08:51:33 +0200

javier
11 ●1 ●3 http://www.ucl.ac.uk/~...

updated 2012-06-20 08:53:27 +0200

This comes very late, but I thought I would post it anyway for future reference.

Apart from R, there is a pure python-based data mining and visualization package called Orange besides a custom nice frontend, the whole thing works as a python library and can be installed into sage just with

easy_install orange

run from the sage shell.

edit flag offensive delete link

Comments

Looks cool!

kcrisman ( 2012-06-20 12:22:40 +0200 )edit

add a comment

0

answered 2011-04-15 15:23:36 +0200

Kelvin Li
503 ●5 ●12 ●17

Many of the operations you listed can be accomplished using pure Python and its standard library. As far as I know, Sage does not provide much more in this area than Python itself.

Custom data structures may be needed for the actual data manipulations. The data structure can be as "dumb" as a two-dimensional nested list (representing records and fields). Or it could be well-packaged using an object-oriented approach. Which one to choose depends on your specific usage case.

Regarding the mundane parsing of CSV (I am assuming "Comma Separated Values") files, there is a Python module called csv. This is more convenient to use, but if for some reason it is not flexible enough, there is also the regular expression module re.

File compression/decompression is provided by the gzip, bz2, zipfile, and tarfile Python modules, for the respective formats.

datetime, time, and calendar provide date/time-related functions.

Although you stated that you wanted to avoid SQL, it might be worth mentioning that Python's standard library also features SQLite in the sqlite3 module.

edit flag offensive delete link

add a comment

0

answered 2011-04-15 15:20:30 +0200

kcrisman
12222 ●41 ●135 ●254

You might be best off using one of the subcomponents of Sage for this. R in particular is best suited, and has many things going on in the professional data mining community with this.

(That doesn't mean you couldn't use Sage directly, or with Python tools - I'm sure a web search will turn up people using some library!)

edit flag offensive delete link

add a comment

Loading and analyzing data, Datamining

3 Answers

Comments

Your Answer

Question Tools

Stats

Related questions

Loading and analyzing data, Datamining edit

3 Answers

Comments

Your Answer

Question Tools

Stats

Related questions

Loading and analyzing data, Datamining