apache spark - Accessing client side objects and code -
a spark applications needs validate each element in rdd.
given driver\client side scala object called validator
, of following 2 solutions better:
rdd.filter { x => if validator.isvalid(x.somefield) true else false }
or like
// list of field validate against val list = rdd.map(x => x.somefield) // use validator check ones invalid var invalidelements = validator.getvalidelements().diff(list) // remove invalid elements rdd rdd.filter(x => !invalidelements.contains(x.somefield))
the second solution avoids referencing driver side object within function passed rdd. invalid elements determined on client, list passed rdd.
or neither recommended?
thanks
if understand correctly (i.e. have object validator
), that's not driver code, because job's jar distributed workers. scala object define instantiated in executor jvm. (that's why don't receive serialization exception in contrast using methods defined in job, e.g. in spark streaming checkpointing).
the first version should perform better because filter first. mapping on of data , filtering slower.
the second version problematic because if creating list of valid elements on driver, have ship workers.
Comments
Post a Comment