In this article, we will see a top level insight on Hadoop and its ecosystem.
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Following are the modules that comprises Hadoop framework:
1. Hadoop Common: contains libraries and utilities needed by other Hadoop modules.
2. Hadoop Distributed File System (HDFS): a distributed file-system, which provides very high aggregate bandwidth across the cluster.
3. Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
4. Hadoop MapReduce: a programming model for large scale data processing.
HDFS
The Hadoop distributed file system (HDFS) is a distributed file system (when running on large clusters of a commodity machine) for the Hadoop framework while providing high scalability,streaming access, throughput and reliability and able to store massive amounts of data for the Single Writer/Multiple Reader operation, while running on large clusters of a commodity machine.
Few HDFS Concept:
- Block: A disk has a block size, which is the minimum amount of data that it can read or write. Block size in HDFS are 64 MB by default. Each block is replicated to a small number of physically separate machines (typically three).
- NameNode and DataNode: An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (master) and a number of datanodes (workers).
- NameNode: The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located. One important aspect of NameNode is that, it is a single point of failure.
- DataNode: DataNode is a commodity machine (less expensive) to store a large amount of data. It executes all commands driven by NameNode, such as physically creation, deletion, and replication of a block and also does low-level operation for I/O requests served for the HDFS client. By nature, the dataNode is a slave and it sends a heartbeat to NameNode in every three seconds, reporting the health of the HDFS cluster and a block report to NameNode. These block reports contain information regarding which block belongs to which file. DataNode also enables pipelining of data and can be used to forward data to another data node that exists in the same cluster.
- Secondary DataNode: This node also known as the CheckPointNode or HelperNode. It’s a separate, highly reliable machine with lots of CPU power and RAM. This node is generally the snapshot of DataNode