If you're new to Mesos
See the getting started page for more information about downloading, building, and deploying Mesos.
If you'd like to get involved or you're looking for support
See our community page for more details.
Mesos Observability Metrics
This document describes the observability metrics provided by Mesos master and agent nodes. This document also provides some initial guidance on which metrics you should monitor to detect abnormal situations in your cluster.
Overview
Mesos master and agent nodes report a set of statistics and metrics that enable cluster operators to monitor resource usage and detect abnormal situations early. The information reported by Mesos includes details about available resources, used resources, registered frameworks, active agents, and task state. You can use this information to create automated alerts and to plot different metrics over time inside a monitoring dashboard.
Metric information is not persisted to disk at either master or agent nodes, which means that metrics will be reset when masters and agents are restarted. Similarly, if the current leading master fails and a new leading master is elected, metrics at the new master will be reset.
Metric Types
Mesos provides two different kinds of metrics: counters and gauges.
Counters keep track of discrete events and are monotonically increasing. The value of a metric of this type is always a natural number. Examples include the number of failed tasks and the number of agent registrations. For some metrics of this type, the rate of change is often more useful than the value itself.
Gauges represent an instantaneous sample of some magnitude. Examples include the amount of used memory in the cluster and the number of connected agents. For some metrics of this type, it is often useful to determine whether the value is above or below a threshold for a sustained period of time.
The tables in this document indicate the type of each available metric.
Master Nodes
Metrics from each master node are available via the /metrics/snapshot master endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.
Observability metrics
This section lists all available metrics from Mesos master nodes grouped by category.
Resources
The following metrics provide information about the total resources available in the cluster and their current usage. High resource usage for sustained periods of time may indicate that you need to add capacity to your cluster or that a framework is misbehaving.
Metric | Description | Type |
---|---|---|
master/cpus_percent
|
Percentage of allocated CPUs | Gauge |
master/cpus_used
|
Number of allocated CPUs | Gauge |
master/cpus_total
|
Number of CPUs | Gauge |
master/cpus_revocable_percent
|
Percentage of allocated revocable CPUs | Gauge |
master/cpus_revocable_total
|
Number of revocable CPUs | Gauge |
master/cpus_revocable_used
|
Number of allocated revocable CPUs | Gauge |
master/disk_percent
|
Percentage of allocated disk space | Gauge |
master/disk_used
|
Allocated disk space in MB | Gauge |
master/disk_total
|
Disk space in MB | Gauge |
master/disk_revocable_percent
|
Percentage of allocated revocable disk space | Gauge |
master/disk_revocable_total
|
Revocable disk space in MB | Gauge |
master/disk_revocable_used
|
Allocated revocable disk space in MB | Gauge |
master/gpus_percent
|
Percentage of allocated GPUs | Gauge |
master/gpus_used
|
Number of allocated GPUs | Gauge |
master/gpus_total
|
Number of GPUs | Gauge |
master/gpus_revocable_percent
|
Percentage of allocated revocable GPUs | Gauge |
master/gpus_revocable_total
|
Number of revocable GPUs | Gauge |
master/gpus_revocable_used
|
Number of allocated revocable GPUs | Gauge |
master/mem_percent
|
Percentage of allocated memory | Gauge |
master/mem_used
|
Allocated memory in MB | Gauge |
master/mem_total
|
Memory in MB | Gauge |
master/mem_revocable_percent
|
Percentage of allocated revocable memory | Gauge |
master/mem_revocable_total
|
Revocable memory in MB | Gauge |
master/mem_revocable_used
|
Allocated revocable memory in MB | Gauge |
Master
The following metrics provide information about whether a master is currently elected and how long it has been running. A cluster with no elected master for sustained periods of time indicates a malfunctioning cluster. This points to either leadership election issues (so check the connection to ZooKeeper) or a flapping Master process. A low uptime value indicates that the master has restarted recently.
Metric | Description | Type |
---|---|---|
master/elected
|
Whether this is the elected master | Gauge |
master/uptime_secs
|
Uptime in seconds | Gauge |
System
The following metrics provide information about the resources available on this master node and their current usage. High resource usage in a master node for sustained periods of time may degrade the performance of the cluster.
Metric | Description | Type |
---|---|---|
system/cpus_total
|
Number of CPUs available in this master node | Gauge |
system/load_15min
|
Load average for the past 15 minutes | Gauge |
system/load_5min
|
Load average for the past 5 minutes | Gauge |
system/load_1min
|
Load average for the past minute | Gauge |
system/mem_free_bytes
|
Free memory in bytes | Gauge |
system/mem_total_bytes
|
Total memory in bytes | Gauge |
Agents
The following metrics provide information about agent events, agent counts, and agent states. A low number of active agents may indicate that agents are unhealthy or that they are not able to connect to the elected master.
Metric | Description | Type |
---|---|---|
master/slave_registrations
|
Number of agents that were able to cleanly re-join the cluster and connect back to the master after the master is disconnected. | Counter |
master/slave_removals
|
Number of agent removed for various reasons, including maintenance | Counter |
master/slave_reregistrations
|
Number of agent re-registrations | Counter |
master/slave_unreachable_scheduled
|
Number of agents which have failed their health check and are scheduled
to be marked unreachable. They will not be marked unreachable immediately due to the Agent
Removal Rate-Limit, but master/slave_unreachable_completed
will start increasing as they do get removed. |
Counter |
master/slave_unreachable_canceled
|
Number of times that an agent was due to be marked unreachable but this
transition was cancelled. This happens when the agent removal rate limit
is enabled and the agent sends a PONG response message to the
master before the rate limit allows the agent to be marked unreachable. |
Counter |
master/slave_unreachable_completed
|
Number of agents that were marked as unreachable because they failed health checks. These are agents which were not heard from despite the agent-removal rate limit, and have been marked as unreachable in the master's agent registry. | Counter |
master/slaves_active
|
Number of active agents | Gauge |
master/slaves_connected
|
Number of connected agents | Gauge |
master/slaves_disconnected
|
Number of disconnected agents | Gauge |
master/slaves_inactive
|
Number of inactive agents | Gauge |
master/slaves_inactive
|
Number of unreachable agents. Unreachable agents are periodically garbage collected from the registry, which will cause this value to decrease. | Gauge |
Frameworks
The following metrics provide information about the registered frameworks in the cluster. No active or connected frameworks may indicate that a scheduler is not registered or that it is misbehaving.
Metric | Description | Type |
---|---|---|
master/frameworks_active
|
Number of active frameworks | Gauge |
master/frameworks_connected
|
Number of connected frameworks | Gauge |
master/frameworks_disconnected
|
Number of disconnected frameworks | Gauge |
master/frameworks_inactive
|
Number of inactive frameworks | Gauge |
master/outstanding_offers
|
Number of outstanding resource offers | Gauge |
Tasks
The following metrics provide information about active and terminated tasks. A high rate of lost tasks may indicate that there is a problem with the cluster. The task states listed here match those of the task state machine.
Metric | Description | Type |
---|---|---|
master/tasks_error
|
Number of tasks that were invalid | Counter |
master/tasks_failed
|
Number of failed tasks | Counter |
master/tasks_finished
|
Number of finished tasks | Counter |
master/tasks_killed
|
Number of killed tasks | Counter |
master/tasks_killing
|
Number of tasks currently being killed | Gauge |
master/tasks_lost
|
Number of lost tasks | Counter |
master/tasks_running
|
Number of running tasks | Gauge |
master/tasks_staging
|
Number of staging tasks | Gauge |
master/tasks_starting
|
Number of starting tasks | Gauge |
master/tasks_unreachable
|
Number of unreachable tasks | Gauge |
Messages
The following metrics provide information about messages between the master and the agents and between the framework and the executors. A high rate of dropped messages may indicate that there is a problem with the network.
Metric | Description | Type |
---|---|---|
master/invalid_executor_to_framework_messages
|
Number of invalid executor to framework messages | Counter |
master/invalid_framework_to_executor_messages
|
Number of invalid framework to executor messages | Counter |
master/invalid_status_update_acknowledgements
|
Number of invalid status update acknowledgements | Counter |
master/invalid_status_updates
|
Number of invalid status updates | Counter |
master/dropped_messages
|
Number of dropped messages | Counter |
master/messages_authenticate
|
Number of authentication messages | Counter |
master/messages_deactivate_framework
|
Number of framework deactivation messages | Counter |
master/messages_decline_offers
|
Number of offers declined | Counter |
master/messages_executor_to_framework
|
Number of executor to framework messages | Counter |
master/messages_exited_executor
|
Number of terminated executor messages | Counter |
master/messages_framework_to_executor
|
Number of messages from a framework to an executor | Counter |
master/messages_kill_task
|
Number of kill task messages | Counter |
master/messages_launch_tasks
|
Number of launch task messages | Counter |
master/messages_reconcile_tasks
|
Number of reconcile task messages | Counter |
master/messages_register_framework
|
Number of framework registration messages | Counter |
master/messages_register_slave
|
Number of agent registration messages | Counter |
master/messages_reregister_framework
|
Number of framework re-registration messages | Counter |
master/messages_reregister_slave
|
Number of agent re-registration messages | Counter |
master/messages_resource_request
|
Number of resource request messages | Counter |
master/messages_revive_offers
|
Number of offer revival messages | Counter |
master/messages_status_update
|
Number of status update messages | Counter |
master/messages_status_update_acknowledgement
|
Number of status update acknowledgement messages | Counter |
master/messages_unregister_framework
|
Number of framework unregistration messages | Counter |
master/messages_unregister_slave
|
Number of agent unregistration messages | Counter |
master/messages_update_slave
|
Number of update agent messages | Counter |
master/recovery_slave_removals
|
Number of agents not re-registered during master failover | Counter |
master/slave_removals/reason_registered
|
Number of agents removed when new agents registered at the same address | Counter |
master/slave_removals/reason_unhealthy
|
Number of agents failed due to failed health checks | Counter |
master/slave_removals/reason_unregistered
|
Number of agents unregistered | Counter |
master/valid_framework_to_executor_messages
|
Number of valid framework to executor messages | Counter |
master/valid_status_update_acknowledgements
|
Number of valid status update acknowledgement messages | Counter |
master/valid_status_updates
|
Number of valid status update messages | Counter |
master/task_lost/source_master/reason_invalid_offers
|
Number of tasks lost due to invalid offers | Counter |
master/task_lost/source_master/reason_slave_removed
|
Number of tasks lost due to agent removal | Counter |
master/task_lost/source_slave/reason_executor_terminated
|
Number of tasks lost due to executor termination | Counter |
master/valid_executor_to_framework_messages
|
Number of valid executor to framework messages | Counter |
Event queue
The following metrics provide information about different types of events in the event queue.
Metric | Description | Type |
---|---|---|
master/event_queue_dispatches
|
Number of dispatches in the event queue | Gauge |
master/event_queue_http_requests
|
Number of HTTP requests in the event queue | Gauge |
master/event_queue_messages
|
Number of messages in the event queue | Gauge |
Registrar
The following metrics provide information about read and write latency to the agent registrar.
Metric | Description | Type |
---|---|---|
registrar/state_fetch_ms
|
Registry read latency in ms | Gauge |
registrar/state_store_ms
|
Registry write latency in ms | Gauge |
registrar/state_store_ms/max
|
Maximum registry write latency in ms | Gauge |
registrar/state_store_ms/min
|
Minimum registry write latency in ms | Gauge |
registrar/state_store_ms/p50
|
Median registry write latency in ms | Gauge |
registrar/state_store_ms/p90
|
90th percentile registry write latency in ms | Gauge |
registrar/state_store_ms/p95
|
95th percentile registry write latency in ms | Gauge |
registrar/state_store_ms/p99
|
99th percentile registry write latency in ms | Gauge |
registrar/state_store_ms/p999
|
99.9th percentile registry write latency in ms | Gauge |
registrar/state_store_ms/p9999
|
99.99th percentile registry write latency in ms | Gauge |
Replicated log
The following metrics provide information about the replicated log underneath the registrar, which is the persistent store for masters.
Metric | Description | Type |
---|---|---|
registrar/log/recovered
|
Whether the replicated log for the registrar has caught up with the other masters in the cluster. A cluster is operational as long as a quorum of "recovered" masters is available in the cluster. | Gauge |
Allocator
The following metrics provide information about performance and resource allocations in the allocator.
Metric | Description | Type |
---|---|---|
allocator/mesos/allocation_run_ms
|
Allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/count
|
Number of allocation algorithm latency measurements in the window | Gauge |
allocator/mesos/allocation_run_ms/max
|
Maximum allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/min
|
Minimum allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/p50
|
Median allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/p90
|
90th percentile allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/p95
|
95th percentile allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/p99
|
99th percentile allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/p999
|
99.9th percentile allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_run_ms/p9999
|
99.99th percentile allocation algorithm latency in ms | Gauge |
allocator/mesos/allocation_runs
|
Number of times the allocation algorithm has run | Counter |
allocator/mesos/roles/<role>/shares/dominant
|
Dominant resource share for the role, exposed as a percentage (0.0-1.0) | Gauge |
allocator/mesos/event_queue_dispatches
|
Number of dispatch events in the event queue | Gauge |
allocator/mesos/offer_filters/roles/<role>/active
|
Number of active offer filters for all frameworks within the role | Gauge |
allocator/mesos/quota/roles/<role>/resources/<resource>/offered_or_allocated
|
Amount of resources considered offered or allocated towards a role's quota guarantee | Gauge |
allocator/mesos/quota/roles/<role>/resources/<resource>/guarantee
|
Amount of resources guaranteed for a role via quota | Gauge |
allocator/mesos/resources/cpus/offered_or_allocated
|
Number of CPUs offered or allocated | Gauge |
allocator/mesos/resources/cpus/total
|
Number of CPUs | Gauge |
allocator/mesos/resources/disk/offered_or_allocated
|
Allocated or offered disk space in MB | Gauge |
allocator/mesos/resources/disk/total
|
Total disk space in MB | Gauge |
allocator/mesos/resources/mem/offered_or_allocated
|
Allocated or offered memory in MB | Gauge |
allocator/mesos/resources/mem/total
|
Total memory in MB | Gauge |
Basic Alerts
This section lists some examples of basic alerts that you can use to detect abnormal situations in a cluster.
master/uptime_secs is low
The master has restarted.
master/uptime_secs < 60 for sustained periods of time
The cluster has a flapping master node.
master/tasks_lost is increasing rapidly
Tasks in the cluster are disappearing. Possible causes include hardware failures, bugs in one of the frameworks, or bugs in Mesos.
master/slaves_active is low
Agents are having trouble connecting to the master.
master/cpus_percent > 0.9 for sustained periods of time
Cluster CPU utilization is close to capacity.
master/mem_percent > 0.9 for sustained periods of time
Cluster memory utilization is close to capacity.
master/elected is 0 for sustained periods of time
No master is currently elected.
Agent Nodes
Metrics from each agent node are available via the /metrics/snapshot agent endpoint. The response is a JSON object that contains metrics names and values as key-value pairs.
Observability Metrics
This section lists all available metrics from Mesos agent nodes grouped by category.
Resources
The following metrics provide information about the total resources available in the agent and their current usage.
Metric | Description | Type |
---|---|---|
slave/cpus_percent
|
Percentage of allocated CPUs | Gauge |
slave/cpus_used
|
Number of allocated CPUs | Gauge |
slave/cpus_total
|
Number of CPUs | Gauge |
slave/cpus_revocable_percent
|
Percentage of allocated revocable CPUs | Gauge |
slave/cpus_revocable_total
|
Number of revocable CPUs | Gauge |
slave/cpus_revocable_used
|
Number of allocated revocable CPUs | Gauge |
slave/disk_percent
|
Percentage of allocated disk space | Gauge |
slave/disk_used
|
Allocated disk space in MB | Gauge |
slave/disk_total
|
Disk space in MB | Gauge |
slave/gpus_percent
|
Percentage of allocated GPUs | Gauge |
slave/gpus_used
|
Number of allocated GPUs | Gauge |
slave/gpus_total
|
Number of GPUs | Gauge |
slave/gpus_revocable_percent
|
Percentage of allocated revocable GPUs | Gauge |
slave/gpus_revocable_total
|
Number of revocable GPUs | Gauge |
slave/gpus_revocable_used
|
Number of allocated revocable GPUs | Gauge |
slave/mem_percent
|
Percentage of allocated memory | Gauge |
slave/disk_revocable_percent
|
Percentage of allocated revocable disk space | Gauge |
slave/disk_revocable_total
|
Revocable disk space in MB | Gauge |
slave/disk_revocable_used
|
Allocated revocable disk space in MB | Gauge |
slave/mem_used
|
Allocated memory in MB | Gauge |
slave/mem_total
|
Memory in MB | Gauge |
slave/mem_revocable_percent
|
Percentage of allocated revocable memory | Gauge |
slave/mem_revocable_total
|
Revocable memory in MB | Gauge |
slave/mem_revocable_used
|
Allocated revocable memory in MB | Gauge |
Agent
The following metrics provide information about whether an agent is currently registered with a master and for how long it has been running.
Metric | Description | Type |
---|---|---|
slave/registered
|
Whether this agent is registered with a master | Gauge |
slave/uptime_secs
|
Uptime in seconds | Gauge |
System
The following metrics provide information about the agent system.
Metric | Description | Type |
---|---|---|
system/cpus_total
|
Number of CPUs available | Gauge |
system/load_15min
|
Load average for the past 15 minutes | Gauge |
system/load_5min
|
Load average for the past 5 minutes | Gauge |
system/load_1min
|
Load average for the past minute | Gauge |
system/mem_free_bytes
|
Free memory in bytes | Gauge |
system/mem_total_bytes
|
Total memory in bytes | Gauge |
Executors
The following metrics provide information about the executor instances running on the agent.
Metric | Description | Type |
---|---|---|
containerizer/mesos/container_destroy_errors
|
Number of containers destroyed due to launch errors | Counter |
slave/container_launch_errors
|
Number of container launch errors | Counter |
slave/executors_preempted
|
Number of executors destroyed due to preemption | Counter |
slave/frameworks_active
|
Number of active frameworks | Gauge |
slave/executor_directory_max_allowed_age_secs
|
Maximum allowed age in seconds to delete executor directory | Gauge |
slave/executors_registering
|
Number of executors registering | Gauge |
slave/executors_running
|
Number of executors running | Gauge |
slave/executors_terminated
|
Number of terminated executors | Counter |
slave/executors_terminating
|
Number of terminating executors | Gauge |
slave/recovery_errors
|
Number of errors encountered during agent recovery | Gauge |
Tasks
The following metrics provide information about active and terminated tasks.
Metric | Description | Type |
---|---|---|
slave/tasks_failed
|
Number of failed tasks | Counter |
slave/tasks_finished
|
Number of finished tasks | Counter |
slave/tasks_killed
|
Number of killed tasks | Counter |
slave/tasks_lost
|
Number of lost tasks | Counter |
slave/tasks_running
|
Number of running tasks | Gauge |
slave/tasks_staging
|
Number of staging tasks | Gauge |
slave/tasks_starting
|
Number of starting tasks | Gauge |
Messages
The following metrics provide information about messages between the agents and the master it is registered with.
Metric | Description | Type |
---|---|---|
slave/invalid_framework_messages
|
Number of invalid framework messages | Counter |
slave/invalid_status_updates
|
Number of invalid status updates | Counter |
slave/valid_framework_messages
|
Number of valid framework messages | Counter |
slave/valid_status_updates
|
Number of valid status updates | Counter |