Overview
Design
Experiment
Conclusion
User Guide
Future Works
- Instance Level Capacity Limitation
- Rebalance Algorithm Takes Partition Weight into Consideration

Overview

CRUSH (or MultiRoundCRUSH) algorithm ensures consistent partition distribution. It is based on pseudo-random partition placement. When the number of placed items (in our case the partitions) approaches infinity, the distribution approaches perfectly uniform.
However, it behaves the worst with a small number of placed items. Especially, for those single tenant clusters with a small number of partitions, we may experience un-evenly partition distributions over instances in the cluster.
According to the data we collected from our PROD environment, distribution calculated by CRUSH is not satisfying. About 50% difference between the heavy loaded nodes and the light loaded nodes. Even with MultiRoundCRUSH, the difference is still about 30%.

Alternatively, card dealing strategy generates more even distribution in this cases. However, comparing it with CRUSH, there will be much more extra partition movements ( A movement is “extra” if the host instance has one partition out and another partition in) when the cluster topology is changed. The legacy AutoRebalanceStrategy, which is also based on card dealing, resolves this issue by taking the current mapping as an input. And it only changes the mapping for the delta part. It means the algorithm is not deterministic. Worse, it does not support fault zone configuration. So legacy AutoRebalanceStrategy is not an option.

Given a deterministic algorithm is required, we find load balance and minimal partition movements are hard to achieve at the same time. A trade-off between uniform distribution and partition movements during cluster changes is needed.

In this document, we propose a hybrid algorithm based on CRUSH that ensures both even distribution and minimized extra partition movement. We call it CRUSH-ed.
The basic idea is running additional rounds of re-balance on the uneven partitions. Note that CRUSH-ed guarantees stability (deterministic) and is fault zone aware.

According to our experiments, CRUSH-ed results in a much more uniform distribution and very few extra partition movements (when nodes are changed unexpectedly) compared with the original CRUSH. The cost is additional run time for the re-assigning calculation.
We think CRUSH-ed can to achieve better partition assignment in most of the cases.

Design

In general, we have 2 goals:

Even distribution.
Minimize partition movements when nodes are changed.

CRUSH has very small movement count, but the distribution is not optimal.

As for MultiRound-CRUSH, it is designed for even distribution. The idea is running CRUSH multiple times. Each time the CRUSH will only be applied to the spiking part. So the distribution would eventually converge to even distribution.
However, the number of iterations that are required is not guaranteed. And there will be more partition movements when the topology of the nodes is changed. As a result, it more or less fails both goals. Still not ideal.

Since we already have a good base, we built CRUSH-ed based on CRUSH. It changes the uneven partition assignment generated by CRUSH as following.
Note that blue part should keep unchanged before and after.

Before (CRUSH)

After (new strategy)

Since the problem is NP-hard. We are not expecting the best assignment. A greedy algorithm works good enough.
After we tried different designs, we find it's hard to achieve both goals (even distribution and fewer movements) using a single strategy. So we decided to apply a hybrid algorithm that finishes the work step by step.

Step 1, run CRUSH to get a base assignment.
The base assignment usually contains a certain number of uneven partitions, so we need the following steps to re-distribute them.

Step 2, run a card dealing algorithm on the uneven parts.
And assign them to idle nodes. This algorithm is conceptually simple. The result ensures that all partitions are assigned to instances with minimum difference. Note that when fault zone joins the game, our greedy algorithm may not be able to calculate possible results because the candidate assignment may have fault zone conflict. So we add the buffer to tolerate small uneven assignment.

Example of assignments after step 2,

Example

Step 3, Shuffle partitions' preference lists.
Since replica states are assigned according to node order in these lists, if the lists are randomly ordered, the states will also be random. This may cause uneven states distribution.
To resolve this issue, before the final step 4, CRUSH-ed executes a simpler version card dealing algorithm on replica's order to shuffle them in each preference list. This operation results in a much evener state distribution.

Example of master distribution before step 3,

Example

Example of master distribution after step 3,

Example

Step 4, re-calculate the assignment for the partitions on temporarily disabled nodes using a consistent hashing algorithm.
Consistent hashing can ensure minimize partition movement.
Note that the first 3 steps are using full node list, regardless of disabled or offline nodes. So the assignment will be stable even the algorithm contains random factors such hashCode. Then step 4 ensures all the disabled nodes are handled correctly without causing huge partition movements.

Assume the first node is down, after step 4, the final assignment and master distribution are as following. Note that there is no assignment on the first node anymore.

Example

One potential issue of using intuitive algorithm is not converging. In this case, CRUSH-ed falls back to CRUSH.
Pseudocode is listed below.

Pseudo Code

// Round 1: Calculate mapping using the base strategy.
// Note to use all nodes for minimizing the influence of live node changes.
origPartitionMap = getBaseRebalanceStrategy().computePartitionAssignment(allNodes, clusterData);

// Transform current assignment to instance->partitions map, and get total partitions
nodeToPartitionMap = convertMap(origPartitionMap);

// Round 2: Rebalance mapping using card dealing algorithm.
Topology allNodeTopo = new Topology(allNodes, clusterData);
cardDealer.computeMapping(allNodeTopo, nodeToPartitionMap);

// Since states are assigned according to preference list order, shuffle preference list for even states distribution.
shufflePreferenceList(nodeToPartitionMap);

// Round 3: Re-mapping the partitions on non-live nodes using consistent hashing for reducing movement.
// Consistent hashing ensures minimum movements when nodes are disabled unexpectedly.
if (!liveNodes.containsAll(allNodes)) {
  Topology liveNodeTopo = new Topology(liveNodes, clusterData);
  hashPlacement.computeMapping(liveNodeTopo, nodeToPartitionMap);
}

if (!nodeToPartitionMap.isEmpty()) {
  // Round 2 and 3 is done successfully
  return convertMap(nodeToPartitionMap);
} else {
  return getBaseRebalanceStrategy().computePartitionAssignment(liveNodes, clusterData);
}

The re-assigning logic is generic and can be applied to any algorithms. So we created an abstract class for common logic. And then create strategy classes for Helix users. The Java class diagram looks like following.

Example

Cap of difference participant load using CRUSH-ed

For each resource, CRUSH-ed tries the best to ensure the partition count difference is in maximum ONE (as long as the assignment is possible considering fault zone). In this case, the worst assignment happens when all these differences are located on one node. So N resources contribute their additional partitions to one node, the node will have N additional partitions in total. Theoretically, the maximum difference between the heavy loaded node and other nodes is the number of resources in a cluster.

Not a global uniform distribution solution that considers node capacity

Global distribution algorithm requires the algorithm takes other resources' partition distribution as part of the input. Since this related the distribution of all resources, the distribution would be very unstable. Any change in any resource will change it. As a result, the rebalance process may lead to a large number of partition movement.
Therefore, the solution in this document cannot be used to calculate global uniform distribution, neither node capacity aware throttling.
This requirement would be investigated in another initiative. Please refer to future works.

Experiment

We tested CRUSH-ed by simulating rebalancing using real product clusters topology data. And we tested multiple scenarios:

Distribution based on cluster topology.
Disabling hosts to simulate hosts down.
Adding hosts to simulate expansion.
Rolling upgrade.

All results show that CRUSH-ed generates more uniform global distribution compared with CRUSH.
Moreover, partition movements in most scenarios are minimized. Except for hosts expansion case. Which can be controlled by configuring state transition throttling.

Partition Distribution

Following charts demonstrate the worst cases (min load vs. max load) and STDEVs of partition/master distributions of several Espresso PROD clusters.
If we measure the improvement by STDEV, CRUSH-ed reduces the partition distribution unevenness by 87% on average compared with CRUSH. And for master replica distribution, the evenness improvement is 68% on average and 81% in maximum.

Example Example

In addition, even for the clusters that already have good distribution using CRUSH, their real resource usage might be uneven because of partitions size variance.
For example, Venice-0 has more than 50k replicas. So even with CRUSH, partition distribution is already good, STDEV is 32.
However, because the partitions are quite different in storage usage, nodes' storage usage STDEV is 222. Max disk usage is about 2956GB.

Then, if using CRUSH-ed, the partition distribution is improved a little bit, STDEV is 6.
Moreover, since all the resources have a uniform distribution, total storage usage is much evener compared with CRUSH. STDEV is 30. And max disk usage is about 2644GB.

Given the host resource is reserved according to max usage, CRUSH-ed can save (2956 - 2644) / 2956 = 10.5% cost.

Example Example

Disabling Nodes

When nodes are offline or disabled, CRUSH-ed will re-assigned the partitions to other live nodes. The algorithm ensures only moving the necessary partitions.
We simulated disabling a certain number of nodes, and check the changes regarding partition movement and master changes. We also used the expected movement (the partitions/masters count on the disabled nodes) as a baseline to measure if any extra movements happen.

The results show that movement is highly correlated to the portion of disabled nodes. And extra movements are minor (in most cases 0 movements).

Note that Rate in this document is the changed number / total partition or master count.

Example Example

Rolling upgrade

Rolling upgrade is different from disabling nodes. Since nodes are reset one by one, in this test we assume the difference could be 2 nodes in maximum (for example, upgrading Node A then upgrading Node B).
In this case, movements are still minimized.

Note that in real PROD clusters, since admin can set delayed scheduler, there won't be movement at all.
Following chart shows the maximum recorded movements or master changes during the whole rolling upgrade process. Even with the worst case, extra partition movements and master changes are close to 0%.
Example

Adding Nodes

Adding nodes (or cluster expansion) changes topology. CRUSH-ed leverages card dealing for evenness. So when topology changes, the movements would be much more than CRUSH.
As we evaluated, the moved partitions are majority uneven ones.

According to our experiment using all Espresso clusters, the extra partition movement rate could be as much as 23%. And master changes could be as much as 58%.
The percentage is still compared with total partition count or total master count.

Note the extra change rate is not correlated with the number of additional nodes. So our recommendation is finishing expansion in one operation so as to do only one partition shuffling.

Example

Algorithm Performance

We compared CRUSH-ed with CRUSH algorithms using different instance numbers. The tests are executed multiple times and we recorded middle numbers.
The results are shown as follows.

Note that CRUSH-ed does not cost much additional calculating time for regular rebalancing. In maximum, 30% more runtime compared with CRUSH. And it will be less than MultiRoundCRUSH.

Example Example

However, when there are nodes down since CRUSH-ed needs to run an additional consistent hashing based re-distribution, the calculating time will be much longer. More than 3 times compared with CRUSH.

Example

With some performance improvements, such as using cache to avoid duplicate calculation, we achieved to greatly reduce CRUSH-ed's running time. According to our experiment, it is now close to MultiRound CRUSH.

Example

Additional Tests for Corner Cases

These tests prove that the algorithm still works in the extreme cases.

ESPRESSO_IDENTITY

This cluster has only one resource. We get the distribution details for all 4 fabrics.

Example Example Example Example

100 nodes, 100 resources and 101 partitions for each resource

The result is good. STDEV is 0.11. And the max loaded node has 104 partitions, while the min loaded node has 100 partitions.

100 nodes, 1 resource with 10100 partitions

The result is strictly evenness. STDEV of the distribution is all 0.

Test without Fault Zone

Distribution with CRUSH-ed

Total Repilca	Min Replica	Max Replica	Partition STDDEV	Min Master	Max Master	Master STDDEV
30720	520	523	0.09886598	170	177	0.21981398

Partition Change on Node down (disabled)

Number Of Node Change	Total Partition Movements	Moved Partition / Total Partition (%)	Extra Movements (not one the changed nodes)	Total Master Changes	Moved Masters / Total Masters (%)	Extra Changes (not one the changed nodes)
-7	3642	11.8554688	0	1212	11.8359375	0.09765625

Test with 5 Fault Zones (11 participants in one zone and 12 participants in the rest)

Distribution with CRUSH-ed

Total Repilca	Min Replica	Max Replica	Partition STDDEV	Min Master	Max Master	Master STDDEV
30720	518	524	0.12537857	169	177	0.20728582

Partition Change on Random Node down (disabled)

Number Of Node Change	Total Partition Movements	Moved Partition / Total Partition (%)	Extra Movements (not one the changed nodes)	Total Master Changes	Moved Masters / Total Masters (%)	Extra Changes (not one the changed nodes)
-7	3646	11.8684896	0	1206	11.7773438	0

Partition Change on a Whole Fault Zone down (disabled)

Number Of Node Change	Total Partition Movements	Moved Partition / Total Partition (%)	Extra Movements (not one the changed nodes)	Total Master Changes	Moved Masters / Total Masters (%)	Extra Changes (not one the changed nodes)
-12	6248	20.3385417	0	2079	20.3027344	0

Partition Change on half of the disabled Fault Zone back

Number Of Node Change	Total Partition Movements	Moved Partition / Total Partition (%)	Extra Movements (not one the changed nodes)	Total Master Changes	Moved Masters / Total Masters (%)	Extra Changes (not one the changed nodes)
+6	3767	12.2623698	0	1043	10.1855469	0

Additional Tests for Validate Strategy Stability

Disable and Re-enabled participants. The distribution is recovered.
Keep bouncing the cluster by resetting random participants. The distribution is still good.

Conclusion

CRUSH-ed achieves more uniform distribution compared with CRUSH in the cost of longer running time and more partition movement when the cluster is changed.

Moreover, according to our experiments, MultiRoundCRUSH-FEAP can achieve the best uniform distribution. But the gain does not match its additional cost. So it is not recommended.

User Guide

Ensure the resouce is using FULL_AUTO mode.
Set rebalance strategy to be “org.apache.helix.controller.rebalancer.strategy.CrushEdRebalanceStrategy”.
Expect more partition movement on topology changes when using the new strategy.

IdeaState SimpleFields Example

HELIX_ENABLED : "true"
IDEAL\_STATE\_MODE : "AUTO_REBALANCE"
REBALANCE\_MODE : "FULL\_AUTO"
REBALANCE_STRATEGY : "org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy"
MIN\_ACTIVE\_REPLICAS : "0"
NUM_PARTITIONS : "64"
REBALANCER\_CLASS\_NAME : "org.apache.helix.controller.rebalancer.DelayedAutoRebalancer"
REPLICAS : "1"
STATE\_MODEL\_DEF_REF : "LeaderStandby"

Future Works

Instance Level Capacity Limitation

Currently, all resources are assigned separately.
The pros of this design are that resources change won't cause existing partitions to be re-assigned.
The cons are:

It's hard to ensure strict global uniform distribution.
Instance level capacity control is not possible given the algorithm doesn't have a global view of partition assignment.

Rebalance Algorithm Takes Partition Weight into Consideration

This algorithm still considers all partitions to be equally weighted. But in reality, different partitions may have different resource requirements.
Application admins need to configure partition weight and Helix should assignment them accordingly.

Note this feature only makes sense when it is applied to a global assignment algorithm since each partition in the same resource are weighted the same.

CRUSH-ed for even distribution

Overview

Design

Cap of difference participant load using CRUSH-ed

Not a global uniform distribution solution that considers node capacity

Experiment

Partition Distribution

Disabling Nodes

Rolling upgrade

Adding Nodes

Algorithm Performance

Additional Tests for Corner Cases

100 nodes, 100 resources and 101 partitions for each resource

100 nodes, 1 resource with 10100 partitions

Test without Fault Zone

Distribution with CRUSH-ed

Partition Change on Node down (disabled)

Test with 5 Fault Zones (11 participants in one zone and 12 participants in the rest)

Distribution with CRUSH-ed

Partition Change on Random Node down (disabled)

Partition Change on a Whole Fault Zone down (disabled)

Partition Change on half of the disabled Fault Zone back

Additional Tests for Validate Strategy Stability

Conclusion

User Guide

Future Works

Instance Level Capacity Limitation

Rebalance Algorithm Takes Partition Weight into Consideration