hadoop - Get Line number in map method using FileInputFormat -
I was wondering if it is possible to get line numbers in my map method? My input file is only a column of values, such as
Apple Orange BananaIs it possible to get the key: 1, value: apple, key: 2, value: orange .. In my map method?
Using CDH3 / CDH4 to change the input data, so using KeyValueInputFormat is not an option. Go ahead.
The default behavior of infobox formats such as TextInputFormat is to give off byte off compared to actual line numbers - this Primarily due to being unable to determine the correct line number when the input file is different and processed by two or more mapers.
You can create your own input format to create a line number instead of a byte offset on
TextInputFormat and the associated
LineRecordReader Must be configured to return incorrectly from
isSplittable> > method (meaning that a large input file will not be processed by multiple mappers). If you have small files, or files that are close to the HDFS block size size, then this should not be a problem. Apart from this, non-partitioned compression formats (eg GZip .gz) mean that the entire file will be processed by a single mapper.
Comments
Post a Comment