Apache Hadoop YARN
YARN (Yet Another Resource Negotiator) is a framework that provides computational resources to execution engines. It was introduced with the release of Hadoop 2.0, as its Cluster Manager, and is now one of the core components of the Hadoop Ecosytem. It maintains a multi-tenant system, handles Hadoop’s high availability capabilities, and enforces security measures while monitoring and managing workloads. It improves performance by separating the processing engine and management function of MapReduce.
In Hadoop 1.0, MapReduce was responsible for resource management (Job Tracker/Task Tracker model) and data processing, which had a huge impact on batch processing times and performance. This was because a single Job Tracker was responsible for both assigning and tracking task execution progress, which caused a bottleneck in terms of scalability. In addition, Hadoop was designed to run MapReduce jobs only, and a problem arises when a non-MapReduce application tries to run in this framework.
YARN was designed to address these issues and to open up Hadoop to other types of distributed applications beyond MapReduce. Along with batch processing (MapReduce), Hadoop clusters are now able to run multiple data processing engines such as graph processing (Giraph), stream processing (Storm), in-memory processing (Spark), interactive processing (Tez) and many more. We no longer need to run multiple clusters to support different processing engines. This single-cluster approach, enabled by YARN, has several advantages:
- Optimized cluster utilization: resources are allocated efficiently according to their use.
- Less operational expenses: we only need to manage one cluster.
- Less data movement: we don’t need to move data between clusters depending on the processing tool we want to use.
This also introduces the idea of centralized resource management.
The basic design concept of YARN was to isolate the resource management and task scheduling/monitoring functions into independent daemons. The main components of the YARN architecture are as follows:
|Resource Manager||The Resource Manager (one per cluster), or RM, is the master service. It manages resources across the cluster and has the final say on resource allocation for applications.|
|Application Master||The Application Master, or AM, is an application-specific process that is started when an application is submitted to the framework. It negotiates resources with the Resource Manager, manages the status and monitors the progress of that particular application. The Application Master communicates with the Node Managers to launch application containers on the associated node (this is done by providing a Container Launch Context (CLC) that indicates all the configuration needed to run an application).|
|Node Manager||The Node Manager (one per node in the cluster), or NM, is the slave service. It is responsible for a specific node in the cluster and communicates with the Resource Manager by sending it the health status of that node. It is also responsible for managing the containers and applications running in the specific node and monitoring their resource usage.|
|Container||A container is a set of physical resources (CPU, disks, RAM, etc) used to run a process in a specific node. For example, let’s take a container with 4GB and 2 CPUs. It is launched by providing a Container Launch Context with the relevant dependencies to the Node Managers.|
|Client||It submits applications to the framework.|
Here is a diagram that summarizes the architecture of YARN
Let’s take a closer look at some of these components.
The resource manager is one of the most important components of the YARN architecture. It has two main components:
- The Scheduler: It is the service responsible for allocating resources for the applications running within the cluster. It does this according to the resource requirements of these applications as well as the capacities and queues. Its sole purpose is resource allocation and it has no responsibility for monitoring, managing and restarting applications.
- The Application Manager: It is the service responsible for job submissions (it can accept them or not). It negotiates the first container that starts the Application Master for that specific application. It is also able to restart the container in case of application or hardware failure.
It tracks the addition and deletion of nodes through the Resource Tracking Service, which receives heartbeats from node managers. It also tracks the status of these nodes (healthy or not) through the NM Liveliness Monitor and the Nodes List Manager.
Each application has its own Application Master instance that is created when the application submission is accepted. This is a specific framework that supports a large number of languages. The Application Masters within the cluster are managed by the Application Master Service.
The AM Liveliness Monitor is the service responsible for maintaining a list of the various Application Masters in the cluster. It also tracks their health by informing the Resource Manager of their last heartbeat (an Application Master will be rescheduled if no heartbeat is sent in a time interval).
We can find most of these components in TDP getting started (Resource Manager, Node Manager, Client). The App Timeline Server is also present on TDP. Its main purpose is to collect information about historical and current YARN applications. This information can be classified as follows:
- Application Specific Information: framework or application specific information, such as the number of map and reduce jobs for MapReduce applications.
- Generic Information about Completed Applications: generic information, like the list of containers, and their information, run under each application-attemps, queue-name, etc. This information is not framework specific.
Another component used by YARN on TDP is Ranger. It is used for fine-tuned access control and resource delegation.
Similar to how the NameNode was the single point of failure for HDFS. The Resource Manager is also the single point of failure for YARN. TDP is only available in High-Availability (HA) mode, meaning that there is an Active/Standby Resource Manager pair. In the event of a failure, the Standby RM is ready to take over the responsibilities of the Active RM.
The following are the 7 main steps in submitting an application in a YARN cluster:
- An application, a single job or a directed acyclic graph (DAG) of jobs, is submitted to the Resource Manager by a client.
- The Resource Manager assigns a Node Manager, based on the resource allocation in the cluster, and allocates the first container to launch the Application Master.
- The Application Master takes responsibility for executing the job and negotiates resources (in the form of containers) with the Resource Manager to run the application.
- Once the containers are allocated, the Application Master reaches out to Node Managers to launch the containers and execute the tasks.
- Node Managers launch the containers and monitor the resource usage of the tasks in those containers.
- Containers run the application tasks and the tasks report back their status and progress directly to the Application Master.
- Once all the tasks are complete, all containers, including Application Master, perform the necessary clean up and terminate.
Here is a diagram that summarizes the workflow
There are several tools available to monitor and manage YARN applications:
- YARN Web UI: A user interface for YARN Resouce Manager to monitor running applications.
- YARN CLI: Yarn command line interface to monitor applications without using a web interface.
Now that you’re familiar with YARN, you can try to submit your first jobs with this YARN tutorial.
Additionally, here is a list of articles to complement your knowledge on TDP components: