Summary -

In this topic, we described about the below sections -

Hadoop is an open source framework, distributed, scalable, batch processing and fault- tolerance system that can store and process the huge amount of data (Bigdata). Hadoop efficiently stores large volumes of data on a cluster of commodity hardware.

Hadoop not only a storage system but also platform for processing large data along with storage. There are mainly five building blocks inside this runtime environment (from top to bottom) -

Architecture

Cluster: -

A Hadoop cluster is designed specifically for storing and analysing huge amounts of unstructured data in a distributed computing environment. Cluster is the set of nodes which are also known as host machines. Cluster is the hardware part of the infrastructure.

Hadoop's open source distributed processing software on low-cost commodity computers run by these clusters. Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects them. Hadoop clusters are known for boosting the data analysis applications speed.

Hadoop clusters are highly scalable: If a cluster's processing data growing by volumes, new additional cluster nodes can be added to increase throughput. Hadoop clusters are highly resistant to failure because the data always copied onto multiple cluster nodes, which ensures that the data is not lost if one node fails. This is the hardware part of the infrastructure.

YARN Infrastructure: -

YARN is abbreviated as Yet Another Resource Negotiator. Apache Yarn is a part or outside of Hadoop that can act as a standalone resource manager.

YARN is the framework responsible for providing the computational resources needed for application executions. Yarn consists of two important elements are: Resource Manager and Node Manager.

Resource Manager: -

One resource manager can be assigned to one cluster per the master. Resource manager has the information where the slaves are located and how many resources they have. Resource manager runs several services.

The most important services is the Resource Scheduler that decides how to assign the resources. The Resource Manager does this with the Scheduler and Applications Manager.

Node Manager: -

More than one Node Managers can be assigned to one Cluster. Node Manager is the slave of the infrastructure. Node Manager sends a heartbeat to the Resource Manager periodically.

Node Manager takes instructions from the Yarn scheduler to decide which node should run which task. The Node Manager reports CPU, memory, disk and network usage to the Resource Manager to decide where to direct new tasks.

HDFS Federation: -

HDFS federation is the framework responsible for providing permanent, reliable and distributed storage. This is typically used for storing inputs and output (but not intermediate ones). It enables support for multiple namespaces in the cluster to improve scalability and isolation.

In order to scale the name service horizontally, federation uses multiple independent name nodes/namespaces. The name nodes are federated, i.e, the name nodes are independent and don’t require coordination with each other. The data nodes are used as common storage for blocks by all the name nodes.

Each data node registers with all the name nodes in the cluster. Data nodes send periodic heartbeats and handles commands from the name nodes.

Storage Solutions: -

Hadoop uses different storage solutions to store and process the data and the techniques will be discussed in the further chapters.

MapReduce Framework: -

MapReduce framework is the software layer implementing the MapReduce paradigm. Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). A MapReduce framework is usually composed of three steps -

  • Map: Each node applies the map function to the local data and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed.
  • Shuffle: Each node redistribute data based on the output keys, such that all data belonging to one key is located on the same node.
  • Reduce: Each node processes each group of output data, per key, in parallel.

The YARN infrastructure and the HDFS federation are completely decoupled and independent. The YARN provides resources for running an application while the HDFS federation provides storage. The MapReduce framework is only one which runs on top of YARN.