SparkUtilities (Hive 3.1.3 API)

java.lang.Object
- org.apache.hadoop.hive.ql.exec.spark.SparkUtilities

```
public class SparkUtilities
extends Object
```
Contains utilities methods used as part of Spark tasks.

Constructor Summary

Constructors
Constructor and Description

SparkUtilities()

Constructors
Constructor and Description
`SparkUtilities()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`collectOp(Collection<Operator<?>> result, Operator<?> root, Class<?> clazz)` Recursively find all operators under root, that are of class clazz or are the sub-class of clazz, and put them in result.
`static <T extends Operator<?>> void`	`collectOp(Operator<?> root, Class<T> cls, Collection<T> result, Set<Operator<?>> seen)` Collect operators of type T starting from root.
`static org.apache.hadoop.io.BytesWritable`	`copyBytesWritable(org.apache.hadoop.io.BytesWritable bw)`
`static HiveKey`	`copyHiveKey(HiveKey key)`
`static SparkTask`	`createSparkTask(HiveConf conf)`
`static SparkTask`	`createSparkTask(SparkWork work, HiveConf conf)`
`static SparkPartitionPruningSinkOperator`	`findReusableDPPSink(Operator<? extends OperatorDesc> branchingOP, List<Operator<? extends OperatorDesc>> list)`
`static org.apache.hadoop.fs.Path`	`generateTmpPathForPartitionPruning(org.apache.hadoop.fs.Path basePath, String id)` Generate a temporary path for dynamic partition pruning in Spark branch TODO: no longer need this if we use accumulator!
`static SparkSession`	`getSparkSession(HiveConf conf, SparkSessionManager sparkSessionManager)`
`static String`	`getWorkId(BaseWork work)` Return the ID for this BaseWork, in String form.
`static boolean`	`isDedicatedCluster(org.apache.hadoop.conf.Configuration conf)`
`static boolean`	`isDirectDPPBranch(Operator<?> op)`
`static boolean`	`needUploadToHDFS(URI source, org.apache.spark.SparkConf sparkConf)`
`static void`	`removeEmptySparkTask(SparkTask currTask)` remove currTask from the children of its parentTask remove currTask from the parent of its childrenTask
`static void`	`removeNestedDPP(OptimizeSparkProcContext procContext)` For DPP sinks w/ common join, we'll split the tree and what's above the branching operator is computed multiple times.
`static String`	`reverseDNSLookupURL(String url)`
`static URI`	`uploadToHDFS(URI source, HiveConf conf)` Uploads a local file to HDFS

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- SparkUtilities
```
public SparkUtilities()
```

Method Detail

copyHiveKey

public static HiveKey copyHiveKey(HiveKey key)

copyBytesWritable

public static org.apache.hadoop.io.BytesWritable copyBytesWritable(org.apache.hadoop.io.BytesWritable bw)

uploadToHDFS

public static URI uploadToHDFS(URI source,
                               HiveConf conf)
                        throws IOException

Uploads a local file to HDFS

Parameters:: source -; conf -
Returns:
Throws:: IOException

needUploadToHDFS

public static boolean needUploadToHDFS(URI source,
                                       org.apache.spark.SparkConf sparkConf)

isDedicatedCluster

public static boolean isDedicatedCluster(org.apache.hadoop.conf.Configuration conf)

getSparkSession

public static SparkSession getSparkSession(HiveConf conf,
                                           SparkSessionManager sparkSessionManager)
                                    throws HiveException

Throws:: HiveException

generateTmpPathForPartitionPruning

public static org.apache.hadoop.fs.Path generateTmpPathForPartitionPruning(org.apache.hadoop.fs.Path basePath,
                                                                           String id)

Generate a temporary path for dynamic partition pruning in Spark branch TODO: no longer need this if we use accumulator!

Parameters:: basePath -; id -
Returns:

getWorkId
```
public static String getWorkId(BaseWork work)
```
Return the ID for this BaseWork, in String form.

Parameters:

work - the input BaseWork

Returns:

the unique ID for this BaseWork

createSparkTask

public static SparkTask createSparkTask(HiveConf conf)

createSparkTask

public static SparkTask createSparkTask(SparkWork work,
                                        HiveConf conf)

collectOp
```
public static void collectOp(Collection<Operator<?>> result,
                             Operator<?> root,
                             Class<?> clazz)
```
Recursively find all operators under root, that are of class clazz or are the sub-class of clazz, and put them in result.

Parameters:

result - all operators under root that are of class clazz

root - the root operator under which all operators will be examined

clazz - clas to collect. Must NOT be null.

collectOp

public static <T extends Operator<?>> void collectOp(Operator<?> root,
                                                     Class<T> cls,
                                                     Collection<T> result,
                                                     Set<Operator<?>> seen)

Collect operators of type T starting from root. Matching operators will be put into result. Set seen can be used to skip search in certain branches.

removeEmptySparkTask
```
public static void removeEmptySparkTask(SparkTask currTask)
```
remove currTask from the children of its parentTask remove currTask from the parent of its childrenTask

Parameters:

currTask -

findReusableDPPSink

public static SparkPartitionPruningSinkOperator findReusableDPPSink(Operator<? extends OperatorDesc> branchingOP,
                                                                    List<Operator<? extends OperatorDesc>> list)

removeNestedDPP
```
public static void removeNestedDPP(OptimizeSparkProcContext procContext)
```
For DPP sinks w/ common join, we'll split the tree and what's above the branching operator is computed multiple times. Therefore it may not be good for performance to support nested DPP sinks, i.e. one DPP sink depends on other DPP sinks. The following is an example: TS TS | | ... FIL | | \ RS RS SEL \ / | TS JOIN GBY | / \ | RS RS SEL DPP2 \ / | JOIN GBY | DPP1 where DPP1 depends on DPP2. To avoid such case, we'll visit all the branching operators. If a branching operator has any further away DPP branches w/ common join in its sub-tree, such branches will be removed. In the above example, the branch of DPP1 will be removed.

isDirectDPPBranch

public static boolean isDirectDPPBranch(Operator<?> op)

reverseDNSLookupURL

public static String reverseDNSLookupURL(String url)
                                  throws UnknownHostException

Throws:: UnknownHostException

Class SparkUtilities

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

SparkUtilities

Method Detail

copyHiveKey

copyBytesWritable

uploadToHDFS

needUploadToHDFS

isDedicatedCluster

getSparkSession

generateTmpPathForPartitionPruning

getWorkId

createSparkTask

createSparkTask

collectOp

collectOp

removeEmptySparkTask

findReusableDPPSink

removeNestedDPP

isDirectDPPBranch

reverseDNSLookupURL