numpy - how to load a dataframe from a python requests stream that is downloading a csv file? -


i create dataframe csv file retrieve via streaming:

import requests  url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=open".format(host, filepath)  r = requests.get(url,                   auth=(username, password),                   verify=false,                   allow_redirects=true,                   stream=true)  chunk_size = 1024 chunk in r.iter_content(chunk_size):     # how load data 

how can data loaded spark http stream?

note isn't possible use hdfs format retrieving data - webhdfs must used.

you can pre-generate rdd of chunks' boundaries, use process file inside worker. examples:

def process(start, finish):    // download file    // process downloaded content in range [start, finish)    // return list of item  partition_size = file_size / num_partition boundaries = [(i, i+paritition_size - 1) in range(0, file_size, partition_size)] rrd = sc.parallelize(boundaries).flatmap(process) df = sqlcontext.createdataframe(rrd) 

Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -