How to manually set the number of mappers in a TEZ Hive job

Products

Services Suite

Issue/Introduction

While troubleshooting HIVE performance issues when a TEZ engine is being used, there may be a need to increase the number of mappers used during a query.

For example, in the example query, the TEZ engine decided only one mapper was needed.

However, reading the data with the single mapper took over three minutes (log extract from yarn log -applicationId <ApplicationID>):

2016-06-21 11:12:46,100 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezInitializeOperators start=1466503964475 end=1466503966100 duration=1625 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2016-06-21 11:12:46,107 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1
2016-06-21 11:12:46,122 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 10
2016-06-21 11:12:46,208 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 100
2016-06-21 11:12:46,461 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1000
2016-06-21 11:12:47,569 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 10000
2016-06-21 11:12:52,257 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 100000
2016-06-21 11:13:36,467 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1000000
2016-06-21 11:13:47,485 INFO [TezChild] exec.Utilities: Could not find plan string in conf
2016-06-21 11:13:47,518 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00001 with {include: null, offset: 0,
length: 9223372036854775807}
2016-06-21 11:13:47,535 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts
2016-06-21 11:14:31,069 INFO [TezChild] exec.Utilities: Could not find plan string in conf
2016-06-21 11:14:31,099 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00002 with {include: null, offset: 0,
length: 9223372036854775807}
2016-06-21 11:14:31,115 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts
2016-06-21 11:15:15,038 INFO [TezChild] exec.Utilities: Could not find plan string in conf
2016-06-21 11:15:15,068 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00003 with {include: null, offset: 0, length: 9223372036854775807}
2016-06-21 11:15:15,082 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts
2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: 5 finished. closing...
2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: DESERIALIZE_ERRORS:0
2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:4882272
2016-06-21 11:15:58,673 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezRunProcessor start=1466503964235 end=1466504158673 duration=194438 from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>

By setting four mappers to read the data, we were able to get the mapper phase down to 76 seconds.

Environment

Resolution

In order to manually set the number of mappers in a Hive query when TEZ is the execution engine, the configuration `tez.grouping.split-count` can be used by either:

Setting it when logged into the HIVE CLI. In other words, `set tez.grouping.split-count=4` will create four mappers
An entry in the `hive-site.xml` can be added through Ambari. If set with hive-site.xml, HIVE will need to be restarted.

Additional Information

Further information around HIVE performance troubleshooting is available through our partner Hortonworks here: