Loading and analyzing data, Datamining

0

dartdog
147 ●5 ●7 ●13

Kelvin Li
503 ●5 ●12 ●17

I am considering using Sage as a data-mining type tool. Is this way off base? If so where should I head? I do not see much in the reference docs on loading and interfacing with large-ish datafiles? As well as cleaning, standardizing and faceting the data.

In particular I have a 18mg compressed CSV file that expands to at least 512 mg.. It has about 20 fields per record and 400,000 or so records (it is Residential Real Estate listing data) I need to summarize, average and count by area, subdivision and dates (and more) I had been using An Access database, but am looking to move to Python, and avoid an SQL structure if possible, since now it seems we have enough memory to hold the data in memory..

My brief search has not turned up any similar examples?? At the least any pointers to loading the compressed CSV? and guidance on the right Python/Sage Data structures to start off with?

add a comment

1

answered 12 years ago

javier
11 ●1 ●3 http://www.ucl.ac.uk/~...

updated 12 years ago

This comes very late, but I thought I would post it anyway for future reference.

Apart from R, there is a pure python-based data mining and visualization package called Orange besides a custom nice frontend, the whole thing works as a python library and can be installed into sage just with

easy_install orange

run from the sage shell.

link

Comments

Looks cool!

kcrisman ( 12 years ago )

add a comment

0

answered 14 years ago

Kelvin Li
503 ●5 ●12 ●17

Many of the operations you listed can be accomplished using pure Python and its standard library. As far as I know, Sage does not provide much more in this area than Python itself.

Custom data structures may be needed for the actual data manipulations. The data structure can be as "dumb" as a two-dimensional nested list (representing records and fields). Or it could be well-packaged using an object-oriented approach. Which one to choose depends on your specific usage case.

Regarding the mundane parsing of CSV (I am assuming "Comma Separated Values") files, there is a Python module called csv. This is more convenient to use, but if for some reason it is not flexible enough, there is also the regular expression module re.

File compression/decompression is provided by the gzip, bz2, zipfile, and tarfile Python modules, for the respective formats.

datetime, time, and calendar provide date/time-related functions.

Although you stated that you wanted to avoid SQL, it might be worth mentioning that Python's standard library also features SQLite in the sqlite3 module.

link

add a comment

0

answered 14 years ago

kcrisman
12252 ●42 ●136 ●255

You might be best off using one of the subcomponents of Sage for this. R in particular is best suited, and has many things going on in the professional data mining community with this.

(That doesn't mean you couldn't use Sage directly, or with Python tools - I'm sure a web search will turn up people using some library!)

link

add a comment

3 Answers

Comments

Your Answer

Question Tools

Stats

Related questions

Loading and analyzing data, Datamining savecancel

3 Answers

Comments