~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements.  See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License.  You may obtain a copy of the License at
~~
~~     http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
~~
Introduction

  Apache Ambari™ is a monitoring, administration and lifecycle
  management project for Apache Hadoop™ clusters. Hadoop clusters
  require many inter-related components that must be installed,
  configured, and managed across the entire cluster. The set of
  components that are currently supported by Ambari includes:

  * {{{http://hbase.apache.org} Apache HBase™}}

  * {{{http://incubator.apache.org/hcatalog} Apache HCatalog™}}

  * {{{http://hadoop.apache.org/hdfs} Apache Hadoop HDFS™}}

  * {{{http://hive.apache.org} Apache Hive™}}

  * {{{http://hadoop.apache.org/mapreduce} Apache Hadoop MapReduce™}}

  * {{{http://pig.apache.org} Apache Pig™}}

  * {{{http://zookeeper.apache.org} Apache Zookeeper™}}

  []

  Ambari's audience is operators responsible for managing Hadoop clusters.
  It allow them to:

  * Deploy and configure Hadoop

    * Define a set of nodes as a cluster

    * Assign roles to particular nodes or let Ambari pick a mapping for them.

    * Override the default versions of components or configure 
    particular values.

  * Upgrade a cluster

    * Modify the versions or configuration of each component

    * Upgrade easily without losing data

  * Monitoring and other maintenance tasks

    * Check which servers are currently running across the cluster

    * Starting and stopping Hadoop services (like HDFS, MR, HBase)

  * Integrate with other tools

    * Provide a REST interface for defining or manipulating clusters.

  []

  Ambari provides a REST, command line, and graphical interface. The command 
  line and graphical interface are implemented using the REST interface and 
  all three have the same functionality. The graphical interface is 
  browser-based using JSON and JavaScript. 

  Ambari requires that the base operating system has been deployed and
  managed via existing tools, such as Chef or Puppet. Ambari is solely focused
  on simplifying configuring and managing the Hadoop stack. Ambari does support
  adding third party software packages to be deployed as part of the Hadoop 
  cluster.

Key concepts

  * <<Nodes>> are machines in the datacenter that are managed by Ambari to
  run Hadoop clusters.

  * <<Components>> are the individual software products that are
  installed to create a complete Hadoop cluster. Some components
  are active and include servers, such as HDFS, and some are passive
  libraries, such as Pig. The servers of active components provide a 
  <<service>>.

  * Components consist of <<roles>> that represent the different
  configurations required by the component. Components have a client
  role and a role for each server. HDFS roles, for example, are
  'client,' 'namenode,' 'secondary namenode,' and 'datanode.' The
  client role installs the client software and configuration, while
  each server role installs the appropriate software and configuration.

  * <<Stacks>> define the software and configuration for a
  cluster. Stacks can inherit from each and only need to specify
  the part that differ from their parent. Thus, although stacks
  can specify the version for each component, most will not.

  * A <<cluster>> uses a stack and a set of nodes to form a
  cluster. When a cluster is defined, the user may specify the nodes
  for each role or let Ambari automatically assign the roles based on
  the nodes characteristics.  Clusters' state can either be active,
  inactive, or retired. Active clusters will be started, inactive
  clusters have reserved nodes, but and will be stopped. Retired
  clusters will keep their definition, but their nodes are released.

Configuration

  Ambari abstracts cluster configuration into groups of string
  key/value pairs. This abstraction lets us manage and manipulate the
  configurations in a consistent and component agnostic way. The
  groups are named for the file that they end up in, and the groups
  are defined by the set of components. For Hadoop, the groups are:
 
  * hadoop/hadoop-env

  * hadoop/capacity-scheduler

  * hadoop/core-site

  * hadoop/hdfs-site

  * hadoop/log4j.properties

  * hadoop/mapred-queue-acl

  * hadoop/mapred-site

  * hadoop/metrics2.properties

  * hadoop/task-controller

* Configuration example

  Although users will typically define configurations via the web UI,
  it is useful to examine a sample JSON expression that would define a
  configuration in the REST api.

------
{
  "hadoop/hadoop-env": {
    "HADOOP_CONF_DIR": "/etc/hadoop",
    "HADOOP_NAMENODE_OPTS": "-Dsecurity.audit.logger=INFO,DRFAS",
    "HADOOP_CLIENT_OPTS": "-Xmx128m"
  },
  "hadoop/core-site": {
     "fs.default.name" : "hdfs://${namenode}:8020/",
     "hadoop.tmp.dir" : "/grid/0/hadoop/tmp",
     "hadoop.security.authentication" : "kerberos",
  }
  "hadoop/hdfs-site": {
     "hdfs.user": "hdfs"
  }
}
------

Stacks

  Stacks form the basis of defining what software needs to be
  installed and run and the configuration for that software. Rather
  than have the administrator define the entire stack from scratch,
  stacks inherit most of their properties from their parent. This
  allows the administrator to take a default stack and only modify the
  properties that need to be changed without dealing with a lot of
  boilerplate.

  Stacks include a list of repositories that contain the rpms or
  tarballs. The repositories will be searched in the given order and
  if the required component versions are not found, the next one will
  be searched. If the required file isn't found, the parent stack's
  repository list will be searched and so on.

  Stacks define the version of each component that they need. Most
  of the versions will come from the stack, but the operator can
  override the version as needed.

  The stack define the configuration parameters to be used by this
   stack.  To keep the stacks generic, the configuration values may
   refer to the nodes that hold a particular role. Thus,
   <<<fs.default.name>>> may be configured to
   <<<hdfs://${namenode}/>>> and the name of the namenode will be
   filled in during the configuration.  A few configuration settings
   need to set exclusively for particular roles. For example, the
   NameNode needs to enable the https security option.

* Stack example

  Here's a example JSON expression for defining a stack.

------
{
  "parent": "site",        /* declare parent as site, r42 */
  "parent-revision": "42",
  "repositories": {
    "yum": ["http://incubator.apache.org/ambari/stack/yum"],
    "tar": ["http://incubator.apache.org/ambari/stack/tar"]
  },
  "configuration": {    /* define the general configuration */
    "hadoop/hadoop-env": {
      "HADOOP_CONF_DIR": "/etc/hadoop",
      "HADOOP_NAMENODE_OPTS": "-Dsecurity.audit.logger=INFO,DRFAS",
      "HADOOP_CLIENT_OPTS": "-Xmx128m"
    },
    "hadoop/core-site": {
       "fs.default.name" : "hdfs://${namenode}:8020/",
       "hadoop.tmp.dir" : "/grid/0/hadoop/tmp",
       "hadoop.security.authentication" : "kerberos",
    }
    "hadoop/hdfs-site": {
       "hdfs.user": "hdfs"
    }
  }
  "components": {
    "common": {
      "version": "0.20.204.1" /* define a new version for common */
      "arch": "i386"
    },
    "hdfs": {
      "roles": {
        "namenode": { /* override one value on the namenode */
          "hadoop/hdfs-site": {
            "dfs.https.enable": "true"
          }
        }
      }
    },
    "pig": {
      "version": "0.9.0"
    }
  }
}
------

Component Definitions

  We are designing the Ambari infrastructure with a generic interface
  for defining components. The current version of Ambari doesn't
  publicize the interface, but the intention is to open it up to
  support thirrd party components. Ambari will search the configured
  repositories for the component definition and use that definition to
  install, manage, run, and remove the component. To have consistency
  in the architecture, the standard Hadoop services will also be
  plugged in to Ambari using the same mechanism.

  The component definitions are written as a text file that provides
  the commands to perform each kind of action, such as install, start,
  stop, or remove. There will be well defined environment that the
  commands run in to provide consistency between platforms.

Clusters

  Defining a cluster, involves picking a stack and assigning nodes to the
  cluster.

  Clusters have a goal state, which can be one of three values:

  * <<Active>> -- the user wants the cluster to be started

  * <<Inactive>> -- the user wants the cluster to be stopped

  * <<Retired>> -- the user wants the cluster to be stopped, the nodes 
  released, and the data deleted. This is useful, if the user expects
  to recreate the cluster eventually, but wants to release the nodes.

  []

  Clusters also have a list of active components that should be running. This 
  overrides the stack and provides a mechanism for the administrator to
  shutdown a service temporarily.

* Cluster example

------
{
  "description": "alpha cluster",
  "stack": "kryptonite",
  "nodes": ["node000-999", "gateway0-1"],
  "goal": "active",
  "services": ["hdfs", "mapreduce"],
  "roles": {
    "namenode": ["node000"],
    "jobtracker": ["node001"],
    "secondary-namenode": ["node002"],
    "gateway": ["gateway0-1"],
  }
}
------

Stack Deployment

  Ambari will deploy the software for its clusters from either
  OS-specific packages (rpms and debs) or tarballs. Rpms have the
  advantage of putting the software in a user-convenient location,
  such as <<</usr/bin>>>, but they are specific to an OS and don't
  support having multiple versions installed at once, while tarballs
  require rebuilding the entire deployment to change one component
  version.

  The layout on the nodes looks like:

------
${ambari}/clusters/${cluster}-${role}/stack/
                                     /logs/
                                     /data/disk-${0 to N}/
                                     /pkgs/
------

  The software and configuration for the role are installed in
  <<<stack>>>. The logs for the managed cluster are put into
  <<<logs>>>. The cluster's data is in <<<data>>> with symlinks to
  each of the disks that machine should use. Finally, the component
  tarballs are placed in the <<<pkgs>>> directory to be installed by
  the component.

Ambari Installation

  Ambari will be packaged as both OS-specific packages (rpms and debs)
  and tarballs, which need to be installed on each node. The user
  chooses one node as the Ambari controller, which is the point of
  interaction for both the web UI and the REST interface. If the user
  doesn't already have a Zookeeper service for Ambari to use, Ambari
  will run one internally for its own use.

Monitoring

  Monitoring the current state of the cluster is an important part of
  operating Hadoop. Ambari current supports running basic
  health-checks on processes running on nodes. The status will be
  aggregated up as the health of the corresponding Hadoop
  services). Roughly these checks will consist of pinging the RPC port
  of the server to see if it responds.

High-level Design

  Ambari is managed by the Ambari <<Controller>> – a central server, which
  provides the user interface and that directs the agent on each node.
  The agent is responsible for installing, configuring, running and
  cleaning up components of the Hadoop stack on the local node. Each agent will
  contact the controller when it has finished its work or N seconds have 
  passed. The controller stores all of the information about the clusters and
  stacks in Zookeeper, which is highly available and redundant.

  Ambari abstracts out the configuration and software stack in the
  cluster as stack. Every stack release provides a default
  stack. If a site has multiple clusters, they can define a "site"
  stack that provides the site-wide defaults and have the cluster 
  stacks derive from it. Ambari will keep the revision history of
  stacks to enable operators to diagnose problems and track changes.

Roadmap

  In the future, Ambari would integrate with and use existing
  datacenter management and monitoring infrastructure - Nagios,
  etc. The other area where Ambari will focus on is a store for
  metrics data. HBase is a likely candidate for such a store.

  We also need to support adding and removing nodes from a running cluster
  without brining it down first. This will require doing decommissioning
  of nodes before they are removed.

  A lot of support needs to be added for supporting secure clusters,
  including providing a single interface to manage access control lists
  for the cluster.

  Ambari would also host a KDC, especially for servers like the
  tasktracker and the datanode to have their own keytabs generated and
  deployed by Ambari. The native KDC could optionally hook up to the
  Corporate KDC for user management, or host user management within
  itself. Continuing on the security aspects, Ambari would also have a
  convenient way of allowing administrators to specify ACLs for
  services and queues.

  We plan to integrate an SNMP interface for integration with other cluster
  management tools.