public static class StatsRulesProcFactory.JoinStatsRule extends StatsRulesProcFactory.DefaultStatsRule implements NodeProcessor
In the absence of histograms, we can use the following general case
2 Relations, 1 attribute
T(RXS) = (T(R)*T(S))/max(V(R,Y), V(S,Y)) where Y is the join attribute
2 Relations, 2 attributes
T(RXS) = T(R)*T(S)/max(V(R,y1), V(S,y1)) * max(V(R,y2), V(S,y2)), where y1 and y2 are the join attributes
3 Relations, 1 attributes
T(RXSXQ) = T(R)*T(S)*T(Q)/top2largest(V(R,y), V(S,y), V(Q,y)), where y is the join attribute
3 Relations, 2 attributes
T(RXSXQ) = T(R)*T(S)*T(Q)/top2largest(V(R,y1), V(S,y1), V(Q,y1)) * top2largest(V(R,y2), V(S,y2), V(Q,y2)), where y1 and y2 are the join attributes
Worst case: If no column statistics are available, then T(RXS) = joinFactor * max(T(R), T(S)) * (numParents - 1) will be used as heuristics. joinFactor is from hive.stats.join.factor hive config. In the worst case, since we do not know any information about join keys (and hence which of the 3 cases to use), we let it to the user to provide the join factor.
For more information, refer 'Estimating The Cost Of Operations' chapter in "Database Systems: The Complete Book" by Garcia-Molina et. al.
Constructor and Description |
---|
StatsRulesProcFactory.JoinStatsRule() |
public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs) throws SemanticException
NodeProcessor
process
in interface NodeProcessor
process
in class StatsRulesProcFactory.DefaultStatsRule
nd
- operator to processprocCtx
- operator processor contextnodeOutputs
- A variable argument list of outputs from other nodes in the walkSemanticException
Copyright © 2017 The Apache Software Foundation. All rights reserved.