Helix Monitoring Metrics
Helix monitoring metrics are exposed as the MBeans attributes. The MBeans are registered based on instance role.
The easiest way to see the available metrics is using jconsole and point it at a running Helix instance. This will allow browsing all metrics with JMX.
Note that if not mentioned in the attribute name, all attributes are gauge by default.
Metrics on Both Controller and Participant
MBean ZkClientMonitor
ObjectName: “HelixZkClient:type=[client-type],key=[specified-client-key],PATH=[zk-client-listening-path]”
Attributes | Description |
---|---|
ReadCounter | Zk Read counter. Which could be used to identify unusually high/low ZK traffic |
WriteCounter | Same as above |
ReadBytesCounter | Same as above |
WriteBytesCounter | Same as above |
StateChangeEventCounter | Zk connection state change counter. Which could be used to identify ZkClient unstable connection |
DataChangeEventCounter | Zk node data change counter. which could be used to identify unusual high/low ZK events occurrence or slow event processing |
PendingCallbackGauge | Number of the pending Zk callbacks. |
TotalCallbackCounter | Number of total received Zk callbacks. |
TotalCallbackHandledCounter | Number of total handled Zk callbacks. |
ReadTotalLatencyCounter | Total read latency in ms. |
WriteTotalLatencyCounter | Total write latency in ms. |
WriteFailureCounter | Total write failures. |
ReadFailureCounter | Total read failures. |
ReadLatencyGauge | Histogram (with all statistic data) of read latency. |
WriteLatencyGauge | Histogram (with all statistic data) of write latency. |
ReadBytesGauge | Histogram (with all statistic data) of read bytes of single Zk access. |
WriteBytesGauge | Histogram (with all statistic data) of write bytes of single Zk access. |
MBean HelixCallbackMonitor
ObjectName: “HelixCallback:Type=[callback-type],Key=[cluster-name].[instance-name],Change=[callback-change-type]”
Attributes | Description |
---|---|
Counter | Zk Callback counter for each Helix callback type. |
UnbatchedCounter | Unbatched Zk Callback counter for each helix callback type. |
LatencyCounter | Callback handler latency counter in ms. |
LatencyGauge | Histogram (with all statistic data) of Callback handler latency. |
MBean MessageQueueMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],messageQueue=[instance-name]”
Attributes | Description |
---|---|
MessageQueueBacklog | Get the message queue size |
Metrics on Controller only
MBean ClusterStatusMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name]”
Attributes | Description |
---|---|
DisabledInstancesGauge | Current number of disabled instances |
DisabledPartitionsGauge | Current number of disabled partitions number |
DownInstanceGauge | Current down instances number |
InstanceMessageQueueBacklog | The sum of all message queue sizes for instances in this cluster |
InstancesGauge | Current live instances number |
MaxMessageQueueSizeGauge | The maximum message queue size across all instances including controller |
RebalanceFailureGauge | None 0 if previous rebalance failed unexpectedly. The Gauge will be set every time rebalance is done. |
RebalanceFailureCounter | The number of failures during rebalance pipeline. |
Enabled | 1 if cluster is enabled, otherwise 0 |
Maintenance | 1 if cluster is in maintenance mode, otherwise 0 |
Paused | 1 if cluster is paused, otherwise 0 |
MBean ClusterEventMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],eventName=ClusterEvent,phaseName=[event-handling-phase]”
Attributes | Description |
---|---|
TotalDurationCounter | Total event process duration for each stage. |
MaxSingleDurationGauge | Max event process duration for each stage within the recent hour. |
EventCounter | The count of processed event in each stage. |
DurationGauge | Histogram (with all statistic data) of event process duration for each stage. |
MBean InstanceMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],instanceName=[instance-name]”
Attributes | Description |
---|---|
Online | This instance is Online (1) or Offline (0) |
Enabled | This instance is Enabled (1) or Disabled (0) |
TotalMessageReceived | Number of messages sent to this instance by controller |
DisabledPartitions | Get the total disabled partitions number for this instance |
MBean ResourceMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],resourceName=[resource-name]”
Attributes | Description |
---|---|
PartitionGauge | Get number of partitions of the resource in best possible ideal state for this resource |
ErrorPartitionGauge | Get the number of current partitions in ERORR state for this resource |
DifferenceWithIdealStateGauge | Get the number of how many replicas' current state are different from ideal state for this resource |
MissingTopStatePartitionGauge | Get the number of partitions do not have top state for this resource |
ExternalViewPartitionGauge | Get number of partitions in ExternalView for this resource |
TotalMessageReceived | Get number of messages sent to this resource by controller |
LoadRebalanceThrottledPartitionGauge | Get number of partitions that need load rebalance but were throttled. |
RecoveryRebalanceThrottledPartitionGauge | Get number of partitions that need recovery rebalance but were throttled. |
PendingLoadRebalancePartitionGauge | Get number of partitions that have pending load rebalance requests. |
PendingRecoveryRebalancePartitionGauge | Get number of partitions that have pending recovery rebalance requests. |
MissingReplicaPartitionGauge | Get number of partitions that have replica number smaller than expected. |
MissingMinActiveReplicaPartitionGauge | Get number of partitions that have replica number smaller than the minimum requirement. |
MaxSinglePartitionTopStateHandoffDurationGauge | Get the max duration recorded when the top state is missing in any single partition. |
FailedTopStateHandoffCounter | Get the number of total top state transition failure. |
SucceededTopStateHandoffCounter | Get the number of total top state transition done successfully. |
SuccessfulTopStateHandoffDurationCounter | Get the total duration of all top state transitions. |
PartitionTopStateHandoffDurationGauge | Histogram (with all statistic data) of top state transition duration. |
MBean PerInstanceResourceMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],instanceName=[instance-name],resourceName=[resource-name]”
Attributes | Description |
---|---|
PartitionGauge | Get number of partitions of the resource in best possible ideal state for this resource on specific instance |
MBean JobMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],jobType=[job-type]”
Attributes | Description |
---|---|
SuccessfulJobCount | Get number of the succeeded jobs |
FailedJobCount | Get number of failed jobs |
AbortedJobCount | Get number of the aborted jobs |
ExistingJobGauge | Get number of existing jobs registered |
QueuedJobGauge | Get numbers of queued jobs, which are not running jobs |
RunningJobGauge | Get numbers of running jobs |
MaximumJobLatencyGauge | Get maximum latency of jobs running time. It will be cleared every hour |
JobLatencyCount | Get total job latency counter. |
MBean WorkflowMonitor
ObjectName: “ClusterStatus:cluster=[cluster-name],workflowType=[workflow-type]”
Attributes | Description |
---|---|
SuccessfulWorkflowCount | Get number of succeeded workflows |
FailedWorkflowCount | Get number of failed workflows |
FailedWorkflowGauge | Get number of current failed workflows |
ExistingWorkflowGauge | Get number of current existing workflows |
QueuedWorkflowGauge | Get number of queued but not started workflows |
RunningWorkflowGauge | Get number of running workflows |
WorkflowLatencyCount | Get workflow latency count |
MaximumWorkflowLatencyGauge | Get maximum workflow latency gauge. It will be reset in 1 hour. |
Metrics on Participant only
MBean StateTransitionStatMonitor
ObjectName: “CLMParticipantReport:Cluster=[cluster-name],Resource=[resource-name],Transition=[transaction-id]”
Attributes | Description |
---|---|
TotalStateTransitionGauge | Get the number of total state transitions |
TotalFailedTransitionGauge | Get the number of total failed state transitions |
TotalSuccessTransitionGauge | Get the number of total succeeded state transitions |
MeanTransitionLatency | Get the average state transition latency (from message read to finish) |
MaxTransitionLatency | Get the maximum state transition latency |
MinTransitionLatency | Get the minimum state transition latency |
PercentileTransitionLatency | Get the percentile of state transitions latency |
MeanTransitionExecuteLatency | Get the average execution latency of state transition (from task started to finish) |
MaxTransitionExecuteLatency | Get the maximum execution latency of state transition |
MinTransitionExecuteLatency | Get the minimum execution latency of state transition |
PercentileTransitionExecuteLatency | Get the percentile of execution latency of state transitions |
MBean ThreadPoolExecutorMonitor
ObjectName: “HelixThreadPoolExecutor:Type=[threadpool-type]” (threadpool-type in Message.MessageType, BatchMessageExecutor, Task)
Attributes | Description |
---|---|
ThreadPoolCoreSizeGauge | Thread pool size is as configured. Aggregate total thread pool size for the whole cluster. |
ThreadPoolMaxSizeGauge | Same as above |
NumOfActiveThreadsGauge | Number of running threads. |
QueueSizeGauge | Queue size. Could be used to identify if too many HelixTask blocked in participant. |
MBean MessageLatencyMonitor
ObjectName: “CLMParticipantReport:ParticipantName=[instance-name],MonitorType=MessageLatencyMonitor”
Attributes | Description |
---|---|
TotalMessageCount | Total message count |
TotalMessageLatency | Total message latency in ms |
MessagelatencyGauge | Histogram (with all statistic data) of message processing latency. |
MBean ParticipantMessageMonitor
ObjectName: “CLMParticipantReport:ParticipantName=[instance-name]”
Attributes | Description |
---|---|
ReceivedMessages | Number of received messages |
DiscardedMessages | Number of discarded messages |
CompletedMessages | Number of completed messages |
FailedMessages | Number of failed messages |
PendingMessages | Number of pending messages to be processed |