When executing sum against a decimal column, Hive returns NULL as the value.
hive> select sum(retail_price) from reatail_repro; Job 0: Map: 10 Reduce: 1 Cumulative CPU: 530.03 sec HDFS Read: 2600574562 HDFS Write: 3 SUCCESS Total MapReduce CPU Time Spent: 8 minutes 50 seconds 30 msec OK NULL
However, if you cast the column to double or bigint, the correct value is returned.
hive> select sum(double(retail_price)) from retail_repro; Job 0: Map: 1 Reduce: 1 Cumulative CPU: 16.48 sec HDFS Read: 30644180 HDFS Write: 21 SUCCESS Total MapReduce CPU Time Spent: 16 seconds 480 msec OK 5.1994505839339055E7 hive> select sum(bigint(retail_price)) from retail_repro; Job 0: Map: 1 Reduce: 1 Cumulative CPU: 14.31 sec HDFS Read: 30644180 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 310 msec OK 48073762
Observe that some values in the column retail_price are 32 digits. For example, one of the values in the column retail_price is: 0.0000000000000008881784197001252.
Given the MAX PRECISION is 38 digits for a HiveDecimal in v0.12.0, the SUM function will return null if HiveDecimal throws NumberFormatException (NFE). The problem in this case is the mapper process will not log any errors if the merge function receives the NFE. As shown below, just set sum = null and return.
./ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java public void merge(AggregationBuffer agg, Object partial) throws HiveException { . . try { myagg.sum = myagg.sum.add( PrimitiveObjectInspectorUtils.getHiveDecimal(partial, inputOI)); } catch (NumberFormatException e) { myagg.sum = null; }
You will need to insert some debug statements in order to get the stack trace and confirm the exception.
2014-11-30 22:14:40,654 ERROR [main] org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum: NFE encountered during merge:java.lang.NumberFormatException: Assignment would result in truncation at org.apache.hadoop.hive.common.type.HiveDecimal.(HiveDecimal.java:53) at org.apache.hadoop.hive.common.type.HiveDecimal.(HiveDecimal.java:47) at org.apache.hadoop.hive.common.type.HiveDecimal.add(HiveDecimal.java:178) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumHiveDecimal.merge(GenericUDAFSum.java:159) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumHiveDecimal.iterate(GenericUDAFSum.java:127) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:183) at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:629) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:826) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:723) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:791) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Cast the column to double.
This behavior has changed in hive 0.13.x as per Hive_Decimal_Precision_Scale_Support. By tuning precision and scale in the table definition, you can control how how HiveDecimal rounds off the digits after the decimal.
In addition, the error handling is improved so that the NumberFormatException will be caught and printed to the mapper container syslog.