Hadoop Introduction
Why Hadoop?
Today, we create massive amounts of data—from social media posts and e-commerce logs to sensor readings and videos. Traditional systems struggle with storing and processing this data because it's too big or grows too fast. This vast amount of data is known as Big Data, and it requires processing to extract meaningful information. The necessity of processing Big Data is the basic reason for the creation of Hadoop. Hadoop serves as a solution for analyzing Big Data and deriving valuable insights from it.
What is Hadoop?
Hadoop is an open-source framework that is distributed, scalable, capable of batch processing, and designed for fault tolerance. It can store and process vast amounts of data, commonly referred to as big data.
Hadoop supports programming languages such as Java, C, C++, Perl, Python, and Ruby. Its efficiency comes from its ability to run jobs on multiple machines simultaneously, utilizing a parallel processing model.
Hadoop is an open-source framework developed by Apache that provides a large-scale distributed batch processing infrastructure for handling vast amounts of data. It can scale to hundreds or even thousands of computers, each equipped with multiple processing cores. Hadoop efficiently distributes significant workloads across this network of computers.
The Hadoop framework consists of three main core components:
- HDFS - It is responsible for storing large amounts of data within the Hadoop cluster, serving as the storage layer.
- MAP Reduce - It is responsible for processing vast quantities of data on the cluster, functioning as the data processing layer of Hadoop.
- Yarn - It manages resources across the cluster.
Advantages -
- Scalable - Hadoop is a highly scalable storage platform that can store and distribute very large data sets across hundreds of systems or servers operating in parallel. Hadoop supports horizontal scalability, allowing new nodes to be added during processing without any system downtime.
- Cost effective - Hadoop offers a cost-effective storage solution for businesses exploding data sets.
- Flexible - Hadoop effectively manages both structured and unstructured data, regardless of its encoding or format.
- Faster - Hadoop uses a unique storage method based on a distributed file system. The data processing tools are typically located on the same servers as the data, which leads to significantly faster data processing.
- Fault Tolerant - Data sent to an individual node is also replicated on other nodes within the same cluster. If the individual node fails to process the data, other available nodes in the cluster can take over the processing.
Disadvantages -
- Security Concerns - The Hadoop security model is disabled by default due to its complexity and the absence of encryption at both the storage and network levels.
- Vulnerable By Nature - The framework is primarily written in Java, a widely used programming language. However, Java has been heavily exploited, leading to numerous security breaches.
- Not Fit for Small Data - Due to its high capacity design, HDFS is inefficient for reading small files. Consequently, it is not recommended for organizations dealing with small amounts of data.
- Potential Stability Issues - As open source software, Hadoop has had its stability issues.
History -
- Hadoop began in early 2006 as part of the Apache Nutch project and later split into its own under the Apache Foundation.
- Its design was inspired by Google's research on the Google File System and MapReduce, published in the early 2000s.
- Named after co-founder Doug Cutting's son's toy elephant, Hadoop version 0.1.0 launched in April 2006—with core components like HDFS and MapReduce.
- Over time, new parts like YARN (resource manager) were added, making Hadoop more flexible and powerful.