Hadoop for beginners

In this article, we will see a top level insight on Hadoop and its ecosystem.

Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Following are the modules that comprises Hadoop framework:

1. Hadoop Common: contains libraries and utilities needed by other Hadoop modules.
2. Hadoop Distributed File System (HDFS): a distributed file-system, which provides very high aggregate bandwidth across the cluster.
3. Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
4. Hadoop MapReduce: a programming model for large scale data processing.

HDFS

The Hadoop distributed file system (HDFS) is a distributed file system (when running on large clusters of a commodity machine) for the Hadoop framework while providing high scalability,streaming access, throughput and reliability and able to store massive amounts of data for the Single Writer/Multiple Reader operation, while running on large clusters of a commodity machine.

Few HDFS Concept:

  1. Block: A disk has a block size, which is the minimum amount of data that it can read or write. Block size in HDFS are 64 MB by default. Each block is replicated to a small number of physically separate machines (typically three).
  2. NameNode and DataNode: An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (master) and a number of datanodes (workers).
    1. NameNode:  The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.   This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located. One important aspect of NameNode is that, it is a single point of failure.
    2. DataNode: DataNode is a commodity machine (less expensive) to store a large amount of data. It executes all commands driven by NameNode, such as physically creation, deletion, and replication of a block and also does low-level operation for I/O requests served for the HDFS client. By nature, the dataNode is a slave and it sends a heartbeat to NameNode in every three seconds, reporting the health of the HDFS cluster and a block report to NameNode. These block reports contain information regarding which block belongs to which file. DataNode also enables pipelining of data and can be used to forward data to another data node that exists in the same cluster.
    3. Secondary DataNode: This node also known as the CheckPointNode or HelperNode. It’s a separate, highly reliable machine with lots of CPU power and RAM. This node is generally the snapshot of DataNode

Map Reduce without jargons

Map Reduce
In simple terms, a list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values.

E.g:

[A] dataset1.txt: Hadoop was created by Doug Cutting and Mike Cafarella
[B] dataset2.txt: Cutting, who was working at Yahoo! at the time, named it after his son’s toy elephant

This two data-sets [A & B] as an input will get divided into splits. Each split will have a key, value pair. In this case the key will be the offset/line number and the value will be the content of the respective split for that offset/line number. The map function discards the line number and produces a per-line (word, count) pair for each word in the input line.

So the mapper output (from (line number, text) as an input ——> (word, count) pair —-> as an input to map phase):

[(“Hadoop”, 1),(“was”, 1),(“created”, 1),(“by”, 1),(“Dough”,1),(“and”, 1),(“Mike”, 1),(“Cafarella”, 1),(“who”, 1),(“was”, 1),(“working”, 1),(“at”, 1),(“Yahoo”, 1),(“at”, 1),(“the”, 1),(“time”, 1),(“named”, 1),(“it”, 1),(“after”, 1),(“his”, 1),(“son’s”,1),(“toy”,1),(“elephant”,1)]

The output of mapper contains multiple key-value pairs with the same key. So before entering into reducer phase, the map-reduce framework will consolidate all the values for similar key. So the input to the reducer is actually (key, value) pairs. Below is the output from the shuffle phase:

{“Hadoop”: [1] ,”was”: [1,1], “created”: [1], “by”: [1], “Dough”:[1] , “and”: [1], “Mike”: [1], “Cafarella”: [1] ,
“who”: [1], “working”: [1], “at”: [1,1], “Yahoo”: [1], “the”: [1] , “time”: [1], “named”: [1], “it”: [1], “after”: [1], “his”: [1], “son’s”:[1], “toy”:[1], “elephant”:[1] }

Post shuffle, the reducer will take the above consolidate key-value pair input and simply sums up the list of intermediate values and produce the intermediate key and the sum as output:

[(“Hadoop”, 1),(“was”, 2),(“created”, 1),(“by”, 1),(“Dough”,1),(“and”, 1),(“Mike”, 1),(“Cafarella”, 1),(“who”, 1),(“was”, 1),(“working”, 1),(“at”, 2),(“Yahoo”, 1),(“at”, 1),(“the”, 1),(“time”, 1),(“named”, 1),(“it”, 1),(“after”, 1),(“his”, 1),(“son’s”,1),(“toy”,1),(“elephant”,1)]