In the earlier version of Hadoop, the JobTracker works as a master process and coordinates all the activity with the TaskTracker. Each node has a Task-Tracker process that manages tasks on the individual node. The TaskTrackers communicate with and are controlled by the JobTracker. The JobTracker is responsible of resource management (managing job life-cycle, tracking resource availability, coordinating with task-trackers etc).
All in all, in this Hadoop architecture:
1. MR is more focused on cluster management and data processing.
2. It is limited to batch oriented process technique and unable to support other available interactive, real time services, graph, machine learning and or other memory intensive algorithms.
So, with the introduction of YARN, it solves the above 2 key issues and shifted the single use from to multi purpose system and promotes loosely coupled architecture and divides the prime key responsibilities namely resource Management and Scheduling/monitoring into separate areas or daemons.
In YARN, JobTracker and TaskTracker is no longer exist and they get replaced by the following components:
1. ResourceManager
2. ApplicationMaster (AM) on per-application basis
3. NodeManager (NM) on per-node slave.
In addition to above, “container” terminology also gets introduced.
Apart from YARN, HDFS also gets revamped and turns out as HDFS Federation (HDFS2) and provides the following features:
1. NameNode HA
2. Snapshots
3. Federation
Hadoop1 Vs Hadoop2
Let’s quickly understand more on the Hadoop2 components:
1. ResourceManager
– is primarily a pure scheduler.
– It is strictly limited to arbitrating requests for available resources in the system made by the competing applications.
– It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).
2. NodeManager (per-node slave)
– runs on each node in the cluster and takes direction from the ResourceManager.
– It is responsible for managing resources available on a single node.
– NM also oversees container’s life-cycle management; monitoring resource usage (memory, CPU) of individual containers.
Before getting into ApplicationMaster, understanding “container” is very important.
A container is a collection of physical resources such as RAM, CPU cores, and disks on a single node. A single node can host multiple containers with a minimum size of memory (e.g., 512 MB or 1 GB) and CPU. The ApplicationMaster can request any container so as to occupy a multiple of the minimum size. A container is supervised by the NodeManager and scheduled by the ResourceManager. A Container life cycle (request and release of containers) may happen in a dynamic fashion at run time.
For instance, request and release of containers during map and reduce phase in a dynamic fashion.
3. ApplicationMaster
– is an instance of a framework-specific library or a processing-framework-specific library (E.g. Storm, Spark etc).
– AM negotiate resources from the ResourceManager and works effectively with the NodeManager(s) to execute and monitor the containers and their resource consumption. Generally, the containers are logical holders for the processes that actually perform the work. – The actual data processing occurs within the Containers executed by the ApplicationMaster. A Container grants rights to an application to use a specific amount of resources (memory, cpu etc.) on a specific host.
– The ApplicationMaster must take the container and present it to the NodeManager managing the host, on which the container was allocated, to use the resources for launching its tasks.
– AM provides fault tolerance to the resources within the cluster rather than managing by RM and hence it scales with a greater extent.
