apache spark - integrating scikit-learn with pyspark -


i'm exploring pyspark , possibilities of integrating scikit-learn pyspark. i'd train model on each partition using scikit-learn. means, when rdd is defined , gets distributed among different worker nodes, i'd use scikit-learn , train model (let's simple k-means) on each partition exists on each worker node. scikit-learn algorithms takes pandas dataframe, initial idea call topandas each partition , train model. however, topandas function collects dataframe driver , not i'm looking for. there other way achieve such goal?

scikit-learn can't integrated spark now, , reason scikit-learn algorithms aren't implemented distributed work on single machine.

nevertheless, can find ready use spark - scikit integration tools in spark-sklearn supports (for moments) executing gridsearch on spark cross validation.


Comments

Popular posts from this blog

java - Static nested class instance -

c# - Bluetooth LE CanUpdate Characteristic property -

JavaScript - Replace variable from string in all occurrences -