Hadoop Hive Input Format Selection

Hadoop Hive Input Format Selection

Input formats are playing very important role in Hive performance.Primary choices of Input Format are Text,Sequence File,RC File,ORC .

Text Input Format :-

i) Default,Json,CSV formats are available

ii) Slow to read and write

iii) Can’t split compressed files (Leads to Huge maps)  

iv) Need to read/decompress all fields.

Sequence File Input Format :-

i) Traditional map reduce binary file format 

a) Stores Keys and Values as a class

b) Not good for Hive ,Which has sql types

c) Hive always stores entire line as a value

ii) Default block size is 1 MB

iii) Need to read and Decompress all the fields

RC (Row Columnar File )Input Format  :-

i) columns stored separately

a) Read and decompressed only needed one.

b) Better compression

ii) Columns stored as binary Blobs  

a) Depend on Meta store to supply Data types

iii) Large Blocks

a) 4MB default

b) Still search file for split boundary

ORC (Optimized Row Columnar)Input Format :-

i) column stored separately

ii) Knows Types

a) Uses Types specific en-coders  

b) Stores statistics (Min,Max,Sum,Count)

iii) Has Light weight Index

a) Skip over blocks of rows that that don’t matter 

iv) Larger Blocks

a) 256 MB by default

b) Has an index for block boundaries

hadoop hive input format

hadoop hive input format


  1. Hi,

    it’s really help full..

Speak Your Mind