hadoop - HDFS compression at the block level -


one of big issues hdfs compression: if compress file, have deal splittable compression. why hdfs require compress entire file, , not instead implement compression @ hdfs block level?

that solve problem: 64 mb block read or written in single chunk, it's big enough compress, , doesn't interefere operations or require splittable compression.

are there implementations of this?

i'm speculating here, can see several problems.

hdfs contains feature called local short-circuit reads. allows datanode open block file, validate security, , pass on filedescriptor application running on same node. bypasses file transfer via http or other means hdfs m/r app (or whatever hdfs app reading file). on performant clusters short circuit reads norm, rather exception, processing occurs split located. describe require reader comprehend block compression in order read block.

other considerations relate splits span blocks. compressed formats in general lack random access , require sequential access. reading last few bytes block, make split spans on next block, expensive reading entire block, due compression.

i'm not saying block compression impossible, feel more complex expect.

besides, block compression can transparently delegated filesystem.

and, last not least, better compression formats exists @ data layers above hdfs: orc, parquet.


Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -