While troubleshooting HIVE performance issues when a TEZ engine is being used, there may be a need to increase the number of mappers used during a query.
For example, in the example query, the TEZ engine decided only one mapper was needed.
However, reading the data with the single mapper took over three minutes (log extract from yarn log -applicationId <ApplicationID>):
2016-06-21 11:12:46,100 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezInitializeOperators start=1466503964475 end=1466503966100 duration=1625 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor> 2016-06-21 11:12:46,107 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1 2016-06-21 11:12:46,122 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 10 2016-06-21 11:12:46,208 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 100 2016-06-21 11:12:46,461 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1000 2016-06-21 11:12:47,569 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 10000 2016-06-21 11:12:52,257 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 100000 2016-06-21 11:13:36,467 INFO [TezChild] exec.MapOperator: MAP[5]: records read - 1000000 2016-06-21 11:13:47,485 INFO [TezChild] exec.Utilities: Could not find plan string in conf 2016-06-21 11:13:47,518 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00001 with {include: null, offset: 0, length: 9223372036854775807} 2016-06-21 11:13:47,535 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts 2016-06-21 11:14:31,069 INFO [TezChild] exec.Utilities: Could not find plan string in conf 2016-06-21 11:14:31,099 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00002 with {include: null, offset: 0, length: 9223372036854775807} 2016-06-21 11:14:31,115 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts 2016-06-21 11:15:15,038 INFO [TezChild] exec.Utilities: Could not find plan string in conf 2016-06-21 11:15:15,068 INFO [TezChild] orc.ReaderImpl: Reading ORC rows from hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts/delta_0046869_0046869/bucket_00003 with {include: null, offset: 0, length: 9223372036854775807} 2016-06-21 11:15:15,082 INFO [TezChild] io.HiveContextAwareRecordReader: Processing file hdfs://PIVOTALCLUSTER/Data/Upstream/StructuredData/DATA/MAXDATA.MOW.workorder_mow.orc_4bkts 2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: 5 finished. closing... 2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: DESERIALIZE_ERRORS:0 2016-06-21 11:15:58,673 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:4882272 2016-06-21 11:15:58,673 INFO [TezChild] log.PerfLogger: </PERFLOG method=TezRunProcessor start=1466503964235 end=1466504158673 duration=194438 from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
By setting four mappers to read the data, we were able to get the mapper phase down to 76 seconds.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine, the configuration `tez.grouping.split-count` can be used by either:
Further information around HIVE performance troubleshooting is available through our partner Hortonworks here: