public static class StatsRulesProcFactory.GroupByStatsRule extends StatsRulesProcFactory.DefaultStatsRule implements NodeProcessor
Suppose if we are grouping by attributes A,B,C and if statistics for columns A,B,C are available then a better estimate can be found by taking the smaller of product of V(R,[A,B,C]) (product of distinct cardinalities of A,B,C) and T(R)/2.
T(R) = min (T(R)/2 , V(R,[A,B,C]) ---> [1]
In the presence of grouping sets, map-side GBY will emit more rows depending on the size of grouping set (input rows * size of grouping set). These rows will get reduced because of map-side hash aggregation. Hash aggregation is an optimization in hive to reduce the number of rows shuffled between map and reduce stage. This optimization will be disabled if the memory used for hash aggregation exceeds 90% of max available memory for hash aggregation. The number of rows emitted from map-side will vary if hash aggregation is enabled throughout execution or disabled. In the presence of grouping sets, following rules will be applied
If hash-aggregation is enabled, for query SELECT * FROM table GROUP BY (A,B) WITH CUBE
T(R) = min(T(R)/2, T(R, GBY(A,B)) + T(R, GBY(A)) + T(R, GBY(B)) + 1))
where, GBY(A,B), GBY(B), GBY(B) are the GBY rules mentioned above [1]
If hash-aggregation is disabled, apply the GBY rule [1] and then multiply the result by number of elements in grouping set T(R) = T(R) * length_of_grouping_set. Since we do not know if hash-aggregation is enabled or disabled during compile time, we will assume worst-case i.e, hash-aggregation is disabled
NOTE: The number of rows from map-side GBY operator is dependent on map-side parallelism i.e, number of mappers. The map-side parallelism is expected from hive config "hive.stats.map.parallelism". If the config is not set then default parallelism of 1 will be assumed.
Worst case: If no column statistics are available, then T(R) = T(R)/2 will be used as heuristics.
For more information, refer 'Estimating The Cost Of Operations' chapter in "Database Systems: The Complete Book" by Garcia-Molina et. al.
Constructor and Description |
---|
StatsRulesProcFactory.GroupByStatsRule() |
public StatsRulesProcFactory.GroupByStatsRule()
public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs) throws SemanticException
NodeProcessor
process
in interface NodeProcessor
process
in class StatsRulesProcFactory.DefaultStatsRule
nd
- operator to processprocCtx
- operator processor contextnodeOutputs
- A variable argument list of outputs from other nodes in the walkSemanticException
Copyright © 2017 The Apache Software Foundation. All rights reserved.