public enum BucketCodec extends Enum<BucketCodec>
RecordIdentifier.getBucketProperty()
. Up until ASF Hive 3.0 this
field was simply the bucket ID. Since 3.0 it does bit packing to store several things:
top 3 bits - version describing the format (we can only have 8).
The rest is version specific - see below.Enum Constant and Description |
---|
V0
This is the "legacy" version.
|
V1
Represents format of "bucket" property in Hive 3.0.
|
Modifier and Type | Method and Description |
---|---|
abstract int |
decodeStatementId(int bucketProperty) |
abstract int |
decodeWriterId(int bucketProperty)
For bucketed tables this the bucketId, otherwise writerId
|
static BucketCodec |
determineVersion(int bucket) |
abstract int |
encode(AcidOutputFormat.Options options) |
static BucketCodec |
getCodec(int version) |
int |
getVersion() |
static BucketCodec |
valueOf(String name)
Returns the enum constant of this type with the specified name.
|
static BucketCodec[] |
values()
Returns an array containing the constants of this enum type, in
the order they are declared.
|
public static final BucketCodec V0
bucket
value just has the bucket ID in it.
The numeric code for this version is 0. (Assumes bucket ID takes less than 29 bits... which
implies top 3 bits are 000 so data written before Hive 3.0 is readable with this scheme).public static final BucketCodec V1
RecordIdentifier
we ensure that RecordIdentifier
is unique.
The intent is that sorting rows by RecordIdentifier
groups rows in the same physical
bucket next to each other.
For any row created by a given version of Hive, top 3 bits are constant. The next
most significant bits are the bucket ID, then the statement ID. This ensures that
SortedDynPartitionOptimizer
works which is
designed so that each task only needs to keep 1 writer opened at a time. It could be
configured such that a single writer sees data for multiple buckets so it must "group" data
by bucket ID (and then sort within each bucket as required) which is achieved via sorting
by RecordIdentifier
which includes the RecordIdentifier.getBucketProperty()
which has the actual bucket ID in the high order bits. This scheme also ensures that
FileSinkOperator.process(Object, int)
works in case
there numBuckets > numReducers. (The later could be fixed by changing how writers are
initialized in "if (fpaths.acidLastBucket != bucketNum) {")public static BucketCodec[] values()
for (BucketCodec c : BucketCodec.values()) System.out.println(c);
public static BucketCodec valueOf(String name)
name
- the name of the enum constant to be returned.IllegalArgumentException
- if this enum type has no constant with the specified nameNullPointerException
- if the argument is nullpublic static BucketCodec determineVersion(int bucket)
public static BucketCodec getCodec(int version)
public abstract int decodeWriterId(int bucketProperty)
public abstract int decodeStatementId(int bucketProperty)
public abstract int encode(AcidOutputFormat.Options options)
public int getVersion()
Copyright © 2022 The Apache Software Foundation. All rights reserved.