python - Huge array representing redundant and sparse data optimization -


i tried search answer question, exists couldn't find any, because don't know how name it.

so let's point.

i working on statistical model. dataset contains information 45 public transport stations , static information them (so 45 rows , around 1000 feature columns) , regularly-spaced temporal measurements (so 2 000 000 rows 10 columns). "real" amount of information in small enough (around 500 mb) processed easily.

the problem statistical modules on python require simple 2d arrays, numpy's arrays. have combine data, 2 000 000 rows of measure have attach 1000+ columns of features related station in measure took place end 17 gb database... of redundant , feel waste of resource.

i have idea have absolutely no idea of how : array function given , j returns value, possible me "emulate", or come fake array, pseudo array, array interface, can accepted numpy array modules? don't see why couldn't it, because array function (i:j) -> a[i][j]. access little slower memory access of classic array problem, , still fast. don't know how it...

thank in advance , tell me if need bring more information or change way ask question!

edit :

ok maybe can clarify problem little bit :

my data can presented in relational database or object database fashion. pretty small (500 mb) in format, when merge tables make possible scikit-learn process it, becomes way big (17 gb +) , seems waste of memory! use techniques handle big data, may solution, isn't possible prevent doing so? can use data directly in scikit_learn without having merge explicitly tables? can make implicit merging emulates datastructure without taking additional memory?


Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -