The YARN Shared Cache provides the facility to upload and manage shared application resources to HDFS in a safe and scalable manner. YARN applications can leverage resources uploaded by other applications or previous runs of the same application without having to reupload and localize identical files multiple times. This will save network resources and reduce YARN application startup time.
Currently the YARN Shared Cache is released and ready to use. The major components are implemented and have been deployed in a large-scale production setting. There are still some pieces missing (i.e. strong authentication). These missing features will be implemented as part of a follow-up phase 2 effort. Please see YARN-7282 for more information.
The shared cache feature consists of 4 major components:
YARN application developers and users, should interact with the shared cache using the shared cache client. This client is responsible for interacting with the shared cache manager, computing the checksum of application resources, and claiming application resources in the shared cache. Once an application has claimed a resource, it is free to use that resource for the life-cycle of the application. Please see the SharedCacheClient.java javadoc for further documentation.
The shared cache HDFS directory stores all of the shared cache resources. It is protected by HDFS permissions and is globally readable, but writing is restricted to a trusted user. This HDFS directory is only modified by the shared cache manager and the resource uploader on the node manager. Resources are spread across a set of subdirectories using the resources’s checksum: /sharedcache/a/8/9/a896857d078/foo.jar /sharedcache/5/0/f/50f11b09f87/bar.jar /sharedcache/a/6/7/a678cb1aa8f/job.jar
The shared cache manager is responsible for serving requests from the client and managing the contents of the shared cache. It looks after both the meta data as well as the persisted resources in HDFS. It is made up of two major components, a back end store and a cleaner service. The SCM runs as a separate daemon process that can be placed on any node in the cluster. This allows for administrators to start/stop/upgrade the SCM without affecting other YARN components (i.e. the resource manager or node managers).
The back end store is responsible for maintaining and persisting metadata about the shared cache. This includes the resources in the cache, when a resource was last used and a list of applications that are currently using the resource. The implementation for the backing store is pluggable and it currently uses an in-memory store that recreates its state after a restart.
The cleaner service maintains the persisted resources in HDFS by ensuring that resources that are no longer used are removed from the cache. It scans the resources in the cache periodically and evicts resources if they are both stale and there are no live applications currently using the application.
The shared cache uploader is a service that runs on the node manager and adds resources to the shared cache. It is responsible for verifying a resources checksum, uploading the resource to HDFS and notifying the shared cache manager that a resource has been added to the cache. It is important to note that the uploader service is asynchronous from the container launch and does not block the startup of a yarn application. In addition adding things to the cache is done in a best effort way and does not impact running applications. Once the uploader has placed a resource in the shared cache, YARN uses the normal node manager localization mechanism to make resources available to the application.
To support the YARN shared cache, an application must use the shared cache client during application submission. The shared cache client returns a URL corresponding to a resource if it is in the shared cache. To use the cached resource, a YARN application simply uses the cached URL to create a LocalResource object and sets setShouldBeUploadedToSharedCache to true during application submission.
For example, here is how you would create a LocalResource using a cached URL: String localPathChecksum = sharedCacheClient.getFileChecksum(localPath); URL cachedResource = sharedCacheClient.use(appId, localPathChecksum); LocalResource resource = LocalResource.newInstance(cachedResource, LocalResourceType.FILE, LocalResourceVisibility.PUBLIC size, timestamp, null, true);
An administrator can initially set up the shared cache by following these steps:
The configuration parameters can be found in yarn-default.xml and should be set in the yarn-site.xml file. Here are a list of configuration parameters and their defaults:
Name | Description | Default value |
---|---|---|
yarn.sharedcache.enabled | Whether the shared cache is enabled | false |
yarn.sharedcache.root-dir | The root directory for the shared cache | /sharedcache |
yarn.sharedcache.nested-level | The level of nested directories before getting to the checksum directories. It must be non-negative. | 3 |
yarn.sharedcache.store.class | The implementation to be used for the SCM store | org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore |
yarn.sharedcache.app-checker.class | The implementation to be used for the SCM app-checker | org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker |
yarn.sharedcache.store.in-memory.staleness-period-mins | A resource in the in-memory store is considered stale if the time since the last reference exceeds the staleness period. This value is specified in minutes. | 10080 |
yarn.sharedcache.store.in-memory.initial-delay-mins | Initial delay before the in-memory store runs its first check to remove dead initial applications. Specified in minutes. | 10 |
yarn.sharedcache.store.in-memory.check-period-mins | The frequency at which the in-memory store checks to remove dead initial applications. Specified in minutes. | 720 |
yarn.sharedcache.admin.address | The address of the admin interface in the SCM (shared cache manager) | 0.0.0.0:8047 |
yarn.sharedcache.admin.thread-count | The number of threads used to handle SCM admin interface (1 by default) | 1 |
yarn.sharedcache.webapp.address | The address of the web application in the SCM (shared cache manager) | 0.0.0.0:8788 |
yarn.sharedcache.cleaner.period-mins | The frequency at which a cleaner task runs. Specified in minutes. | 1440 |
yarn.sharedcache.cleaner.initial-delay-mins | Initial delay before the first cleaner task is scheduled. Specified in minutes. | 10 |
yarn.sharedcache.cleaner.resource-sleep-ms | The time to sleep between processing each shared cache resource. Specified in milliseconds. | 0 |
yarn.sharedcache.uploader.server.address | The address of the node manager interface in the SCM (shared cache manager) | 0.0.0.0:8046 |
yarn.sharedcache.uploader.server.thread-count | The number of threads used to handle shared cache manager requests from the node manager (50 by default) | 50 |
yarn.sharedcache.client-server.address | The address of the client interface in the SCM (shared cache manager) | 0.0.0.0:8045 |
yarn.sharedcache.client-server.thread-count | The number of threads used to handle shared cache manager requests from clients (50 by default) | 50 |
yarn.sharedcache.checksum.algo.impl | The algorithm used to compute checksums of files (SHA-256 by default) | org.apache.hadoop.yarn.sharedcache.ChecksumSHA256Impl |
yarn.sharedcache.nm.uploader.replication.factor | The replication factor for the node manager uploader for the shared cache (10 by default) | 10 |
yarn.sharedcache.nm.uploader.thread-count | The number of threads used to upload files from a node manager instance (20 by default) | 20 |