If you're new to Mesos
See the getting started page for more information about downloading, building, and deploying Mesos.
If you'd like to get involved or you're looking for support
See our community page for more details.
Executor HTTP API
A Mesos executor can be built in two different ways:
By using the
ExecutorDriver
C++ interface. TheExecutorDriver
handles the details of communicating with the Mesos agent. Executor developers implement custom executor logic by registering callbacks with theExecutorDriver
for significant events, such as when a new task launch request is received. Because theExecutorDriver
interface is written in C++, this typically requires that executor developers either use C++ or use a C++ binding to their language of choice (e.g., JNI when using JVM-based languages).By using the new HTTP API. This allows Mesos executors to be developed without using C++ or a native client library; instead, a custom executor interacts with the Mesos agent via HTTP requests, as described below. Although it is theoretically possible to use the HTTP executor API “directly” (e.g., by using a generic HTTP library), most executor developers should use a library for their language of choice that manages the details of the HTTP API; see the document on HTTP API client libraries for a list.
The v1 Executor HTTP API was introduced in Mesos 0.28.0. As of Mesos 1.0, it is considered stable and is the recommended way to develop new Mesos executors.
Overview
The executor interacts with Mesos via the /api/v1/executor agent endpoint. We refer to this endpoint with its suffix “/executor” in the rest of this document. This endpoint accepts HTTP POST requests with data encoded as JSON (Content-Type: application/json) or binary Protobuf (Content-Type: application/x-protobuf). The first request that the executor sends to “/executor” endpoint is called SUBSCRIBE and results in a streaming response (“200 OK” status code with Transfer-Encoding: chunked).
Executors are expected to keep the subscription connection open as long as possible (barring network errors, agent process restarts, software bugs, etc.) and incrementally process the response. HTTP client libraries that can only parse the response after the connection is closed cannot be used. For the encoding used, please refer to Events section below.
All subsequent (non-SUBSCRIBE
) requests to the “/executor” endpoint (see details below in Calls section) must be sent using a different connection than the one used for subscription. The agent responds to these HTTP POST requests with “202 Accepted” status codes (or, for unsuccessful requests, with 4xx or 5xx status codes; details in later sections). The “202 Accepted” response means that a request has been accepted for processing, not that the processing of the request has been completed. The request might or might not be acted upon by Mesos (e.g., agent fails during the processing of the request). Any asynchronous responses from these requests will be streamed on the long-lived subscription connection. Executors can submit requests using more than one different HTTP connection.
Calls
The following calls are currently accepted by the agent. The canonical source of this information is executor.proto. When sending JSON-encoded Calls, executors should encode raw bytes in Base64 and strings in UTF-8.
SUBSCRIBE
This is the first step in the communication process between the executor and agent. This is also to be considered as subscription to the “/executor” events stream.
To subscribe with the agent, the executor sends an HTTP POST with a SUBSCRIBE
message. The HTTP response is a stream in RecordIO format; the event stream will begin with a SUBSCRIBED
event (see details in Events section).
Additionally, if the executor is connecting to the agent after a disconnection, it can also send a list of:
- Unacknowledged Status Updates: The executor is expected to maintain a list of status updates not acknowledged by the agent via the
ACKNOWLEDGE
events. - Unacknowledged Tasks: The executor is expected to maintain a list of tasks that have not been acknowledged by the agent. A task is considered acknowledged if at least one of the status updates for this task is acknowledged by the agent.
SUBSCRIBE Request (JSON):
POST /api/v1/executor HTTP/1.1
Host: agenthost:5051
Content-Type: application/json
Accept: application/json
{
"type": "SUBSCRIBE",
"executor_id": {
"value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
},
"framework_id": {
"value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
},
"subscribe": {
"unacknowledged_tasks": [
{
"name": "dummy-task",
"task_id": {
"value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
},
"agent_id": {
"value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
},
"command": {
"value": "ls",
"arguments": [
"-l",
"\/tmp"
]
}
}
],
"unacknowledged_updates": [
{
"framework_id": {
"value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
},
"status": {
"source": "SOURCE_EXECUTOR",
"task_id": {
"value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
},
"state": "TASK_RUNNING",
"uuid": "ZDQwZjNmM2UtYmJlMy00NGFmLWEyMzAtNGNiMWVhZTcyZjY3Cg=="
}
}
]
}
}
SUBSCRIBE Response Event (JSON):
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
<event-length>
{
"type": "SUBSCRIBED",
"subscribed": {
"executor_info": {
"executor_id": {
"value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
},
"command": {
"value": "\/path\/to\/executor"
},
"framework_id": {
"value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
}
},
"framework_info": {
"user": "foo",
"name": "my_framework"
},
"agent_id": {
"value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
},
"agent_info": {
"host": "agenthost",
"port": 5051
}
}
}
<more events>
NOTE: Once an executor is launched, the agent waits for a duration of --executor_registration_timeout
(configurable at agent startup) for the executor to subscribe. If the executor fails to subscribe within this duration, the agent forcefully destroys the container executor is running in.
UPDATE
Sent by the executor to reliably communicate the state of managed tasks. It is crucial that a terminal update (e.g., TASK_FINISHED
, TASK_KILLED
or TASK_FAILED
) is sent to the agent as soon as the task terminates, in order to allow Mesos to release the resources allocated to the task.
The scheduler must explicitly respond to this call through an ACKNOWLEDGE
message (see ACKNOWLEDGED
in the Events section below for the semantics). The executor must maintain a list of unacknowledged updates. If for some reason, the executor is disconnected from the agent, these updates must be sent as part of SUBSCRIBE
request in the unacknowledged_updates
field.
UPDATE Request (JSON):
POST /api/v1/executor HTTP/1.1
Host: agenthost:5051
Content-Type: application/json
Accept: application/json
{
"executor_id": {
"value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
},
"framework_id": {
"value": "9aaa9d0d-e00d-444f-bfbd-23dd197939a0-0000"
},
"type": "UPDATE",
"update": {
"status": {
"executor_id": {
"value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
},
"source": "SOURCE_EXECUTOR",
"state": "TASK_RUNNING",
"task_id": {
"value": "66724cec-2609-4fa0-8d93-c5fb2099d0f8"
},
"uuid": "ZDQwZjNmM2UtYmJlMy00NGFmLWEyMzAtNGNiMWVhZTcyZjY3Cg=="
}
}
}
UPDATE Response:
HTTP/1.1 202 Accepted
MESSAGE
Sent by the executor to send arbitrary binary data to the scheduler. Note that Mesos neither interprets this data nor makes any guarantees about the delivery of this message to the scheduler. The data
field is raw bytes encoded in Base64.
MESSAGE Request (JSON):
POST /api/v1/executor HTTP/1.1
Host: agenthost:5051
Content-Type: application/json
Accept: application/json
{
"executor_id": {
"value": "387aa966-8fc5-4428-a794-5a868a60d3eb"
},
"framework_id": {
"value": "9aaa9d0d-e00d-444f-bfbd-23dd197939a0-0000"
},
"type": "MESSAGE",
"data": "t+Wonz5fRFKMzCnEptlv5A=="
}
MESSAGE Response:
HTTP/1.1 202 Accepted
Events
Executors are expected to keep a persistent connection to the “/executor” endpoint (even after getting a SUBSCRIBED
HTTP Response event). This is indicated by the “Connection: keep-alive” and “Transfer-Encoding: chunked” headers with no “Content-Length” header set. All subsequent events that are relevant to this executor generated by Mesos are streamed on this connection. The agent encodes each Event in RecordIO format, i.e., string representation of length of the event in bytes followed by JSON or binary Protobuf (possibly compressed) encoded event. The length of an event is a 64-bit unsigned integer (encoded as a textual value) and will never be “0”. Also, note that the RecordIO
encoding should be decoded by the executor whereas the underlying HTTP chunked encoding is typically invisible at the application (executor) layer. The type of content encoding used for the events will be determined by the accept header of the POST request (e.g., “Accept: application/json”).
The following events are currently sent by the agent. The canonical source of this information is at executor.proto. Note that when sending JSON-encoded events, agent encodes raw bytes in Base64 and strings in UTF-8.
SUBSCRIBED
The first event sent by the agent when the executor sends a SUBSCRIBE
request on the persistent connection. See SUBSCRIBE
in Calls section for the format.
LAUNCH
Sent by the agent whenever it needs to assign a new task to the executor. The executor is required to send an UPDATE
message back to the agent indicating the success or failure of the task initialization.
The executor must maintain a list of unacknowledged tasks (see SUBSCRIBE
in Calls
section). If for some reason, the executor is disconnected from the agent, these tasks must be sent as part of SUBSCRIBE
request in the tasks
field.
LAUNCH Event (JSON)
<event-length>
{
"type": "LAUNCH",
"launch": {
"framework_info": {
"id": {
"value": "49154f1b-8cf6-4421-bf13-8bd11dccd1f1"
},
"user": "foo",
"name": "my_framework"
},
"task": {
"name": "dummy-task",
"task_id": {
"value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
},
"agent_id": {
"value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
},
"command": {
"value": "sleep",
"arguments": [
"100"
]
}
}
}
}
LAUNCH_GROUP
This experimental event was added in 1.1.0.
Sent by the agent whenever it needs to assign a new task group to the executor. The executor is required to send UPDATE
messages back to the agent indicating the success or failure of each of the tasks in the group.
The executor must maintain a list of unacknowledged tasks (see LAUNCH
section above).
LAUNCH_GROUP Event (JSON)
<event-length>
{
"type": "LAUNCH_GROUP",
"launch_group": {
"task_group" : {
"tasks" : [
"task": {
"name": "dummy-task",
"task_id": {
"value": "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"
},
"agent_id": {
"value": "f1c9cdc5-195e-41a7-a0d7-adaa9af07f81"
},
"command": {
"value": "sleep",
"arguments": [
"100"
]
}
}
]
}
}
}
KILL
The KILL
event is sent whenever the scheduler needs to stop execution of a specific task. The executor is required to send a terminal update (e.g., TASK_FINISHED
, TASK_KILLED
or TASK_FAILED
) back to the agent once it has stopped/killed the task. Mesos will mark the task resources as freed once the terminal update is received.
LAUNCH Event (JSON)
<event-length>
{
"type" : "KILL",
"kill" : {
"task_id" : {"value" : "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"}
}
}
ACKNOWLEDGED
Sent by the agent in order to signal the executor that a status update was received as part of the reliable message passing mechanism. Acknowledged updates must not be retried.
ACKNOWLEDGED Event (JSON)
<event-length>
{
"type" : "ACKNOWLEDGED",
"acknowledged" : {
"task_id" : {"value" : "d40f3f3e-bbe3-44af-a230-4cb1eae72f67"},
"uuid" : "ZDQwZjNmM2UtYmJlMy00NGFmLWEyMzAtNGNiMWVhZTcyZjY3Cg=="
}
}
MESSAGE
Custom message generated by the scheduler and forwarded all the way to the executor. These messages are delivered “as-is” by Mesos and have no delivery guarantees. It is up to the scheduler to retry if a message is dropped for any reason. The data
field contains raw bytes encoded as Base64.
MESSAGE Event (JSON)
<event-length>
{
"type" : "MESSAGE",
"message" : {
"data" : "c2FtcGxlIGRhdGE="
}
}
SHUTDOWN
Sent by the agent in order to shutdown the executor. Once an executor gets a SHUTDOWN
event it is required to kill all its tasks, send TASK_KILLED
updates and gracefully exit. If an executor doesn’t terminate within a certain period MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD
(an environment variable set by the agent upon executor startup), the agent will forcefully destroy the container where the executor is running. The agent would then send TASK_LOST
updates for any remaining active tasks of this executor.
SHUTDOWN Event (JSON)
<event-length>
{
"type" : "SHUTDOWN"
}
ERROR
Sent by the agent when an asynchronous error event is generated. It is recommended that the executor abort when it receives an error event and retry subscription.
ERROR Event (JSON)
<event-length>
{
"type" : "ERROR",
"error" : {
"message" : "Unrecoverable error"
}
}
Executor Environment Variables
The following environment variables are set by the agent that can be used by the executor upon startup:
MESOS_FRAMEWORK_ID
:FrameworkID
of the scheduler needed as part of theSUBSCRIBE
call.MESOS_EXECUTOR_ID
:ExecutorID
of the executor needed as part of theSUBSCRIBE
call.MESOS_DIRECTORY
: Path to the working directory for the executor on the host filesystem (deprecated).MESOS_SANDBOX
: Path to the mapped sandbox inside of the container (determined by the agent flagsandbox_directory
) for either mesos container with image or docker container. For the case of command task without image specified, it is the path to the sandbox on the host filesystem, which is identical toMESOS_DIRECTORY
.MESOS_DIRECTORY
is always the sandbox on the host filesystem.MESOS_AGENT_ENDPOINT
: Agent endpoint (i.e., ip:port to be used by the executor to connect to the agent).MESOS_CHECKPOINT
: If set to true, denotes that framework has checkpointing enabled.MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD
: Amount of time the agent would wait for an executor to shut down (e.g., 60secs, 3mins etc.) after sending aSHUTDOWN
event.
If MESOS_CHECKPOINT
is set (i.e., if framework checkpointing is enabled), the following additional variables are also set that can be used by the executor for retrying upon a disconnection with the agent:
MESOS_RECOVERY_TIMEOUT
: The total duration that the executor should spend retrying before shutting itself down when it is disconnected from the agent (e.g.,15mins
,5secs
etc.). This is configurable at agent startup via the flag--recovery_timeout
.MESOS_SUBSCRIPTION_BACKOFF_MAX
: The maximum backoff duration to be used by the executor between two retries when disconnected (e.g.,250ms
,1mins
etc.). This is configurable at agent startup via the flag--executor_reregistration_timeout
.
NOTE: Additionally, the executor also inherits all the agent’s environment variables.
Disconnections
An executor considers itself disconnected if the persistent subscription connection (opened via SUBSCRIBE request) to “/executor” breaks. The disconnection can happen due to an agent process failure etc.
Upon detecting a disconnection from the agent, the retry behavior depends on whether framework checkpointing is enabled:
- If framework checkpointing is disabled, the executor is not supposed to retry subscription and gracefully exit.
- If framework checkpointing is enabled, the executor is supposed to retry subscription using a suitable backoff strategy for a duration of
MESOS_RECOVERY_TIMEOUT
. If it is not able to establish a subscription with the agent within this duration, it should gracefully exit.
Agent Recovery
Upon agent startup, an agent performs recovery. This allows the agent to recover status updates and reconnect with old executors. Currently, the agent supports the following recovery mechanisms specified via the --recover
flag:
- reconnect (default): This mode allows the agent to reconnect with any of it’s old live executors provided the framework has enabled checkpointing. The recovery of the agent is only marked complete once all the disconnected executors have connected and hung executors have been destroyed. Hence, it is mandatory that every executor retries at least once within the interval (
MESOS_SUBSCRIPTION_BACKOFF_MAX
) to ensure it is not shutdown by the agent due to being hung/unresponsive. - cleanup: This mode kills any old live executors and then exits the agent. This is usually done by operators when making a non-compatible agent/executor upgrade. Upon receiving a
SUBSCRIBE
request from the executor of a framework with checkpointing enabled, the agent would send it aSHUTDOWN
event as soon as it reconnects. For hung executors, the agent would wait for a duration of--executor_shutdown_grace_period
(configurable at agent startup) and then forcefully kill the container where the executor is running in.
Backoff Strategies
Executors are encouraged to retry subscription using a suitable backoff strategy like linear backoff, when they notice a disconnection with the agent. A disconnection typically happens when the agent process terminates (e.g., restarted for an upgrade). Each retry interval should be bounded by the value of MESOS_SUBSCRIPTION_BACKOFF_MAX
which is set as an environment variable.