hadoop - Get Line number in map method using FileInputFormat -


I was wondering if it is possible to get line numbers in my map method? My input file is only a column of values, such as

 Apple Orange Banana  

Is it possible to get the key: 1, value: apple, key: 2, value: orange .. In my map method?

Using CDH3 / CDH4 to change the input data, so using KeyValueInputFormat is not an option. Go ahead.

The default behavior of infobox formats such as TextInputFormat is to give off byte off compared to actual line numbers - this Primarily due to being unable to determine the correct line number when the input file is different and processed by two or more mapers.

You can create your own input format to create a line number instead of a byte offset on TextInputFormat and the associated LineRecordReader Must be configured to return incorrectly from isSplittable> > method (meaning that a large input file will not be processed by multiple mappers). If you have small files, or files that are close to the HDFS block size size, then this should not be a problem. Apart from this, non-partitioned compression formats (eg GZip .gz) mean that the entire file will be processed by a single mapper.

Comments

Popular posts from this blog

excel vba - How to delete Solver(SOLVER.XLAM) code -

github - Teamcity & Git - PR merge builds - anyway to get HEAD commit hash? -

ios - Replace text in UITextView run slowly -