numpy - how to load a dataframe from a python requests stream that is downloading a csv file? -
i create dataframe csv file retrieve via streaming:
import requests url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=open".format(host, filepath) r = requests.get(url, auth=(username, password), verify=false, allow_redirects=true, stream=true) chunk_size = 1024 chunk in r.iter_content(chunk_size): # how load data
how can data loaded spark http stream?
note isn't possible use hdfs format retrieving data - webhdfs must used.
you can pre-generate rdd of chunks' boundaries, use process file inside worker. examples:
def process(start, finish): // download file // process downloaded content in range [start, finish) // return list of item partition_size = file_size / num_partition boundaries = [(i, i+paritition_size - 1) in range(0, file_size, partition_size)] rrd = sc.parallelize(boundaries).flatmap(process) df = sqlcontext.createdataframe(rrd)
Comments
Post a Comment