Summary -

In this topic, we described about the below sections -

What is the distributed File System?

The File system uses to store the data permanently. File system supports the features like concurrency, distribution, replication access to files and remote servers. Distributed file systems follow network-based approach to store the files across systems on the network.

What is the HDFS?

The HDFS full form is Hadoop Distributed File System and is a distributed file system. The HDFS is designed to run on product hardware and to hold very large amounts of data like terabytes or petabytes of data. HDFS has many similarities with existing distributed file systems and has significant differences from other distributed file systems.

HDFS is specially designed to deploy on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now developed as an Apache Hadoop subproject.

HDFS is a highly fault tolerant and self-healing distributed file system. HDFS is designed to turn a cluster of industry standard servers into a highly scalable pool of storage.

HDFS was developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical. HDFS accepts data in any format regardless of schema and optimizes for high bandwidth streaming and scales to proven deployments of 100PB and beyond. Every file split by default as 64 MB blocks and could be configured.

In HDFS, files are stored in sequential redundant manner over the multiple machines. This approach guaranteed the below ones -

  1. High availability
  2. Durability

HDFS Features -

  • Supports very large files.
  • Usage of distributed storage and processing.
  • Is a Commodity hardware.
  • Optimization is used for throughput over latency.
  • Supports High-latency data access.
  • Provides streaming access to the file system data.
  • Process data by using lots of small files.
  • Provides file authentication and permissions.

HDFS Advantages -

  1. HDFS can store large amount of information.
  2. HDFS is a simple & robust coherency model.
  3. HDFS is scalable and fast access to this information.
  4. HDFS also to serve substantial number of clients by adding more machines to the cluster.
  5. HDFS provides streaming read access.
  6. HDFS used to read data stored multiple times but the data will be written to the HDFS once.
  7. The overhead of cashing is helps the data should simple be read from HDFS source.
  8. The recovery techniques will be applied very quickly.
  9. Processing logic close to the data rather than the data close to the processing logic.
  10. Hardware and operating systems portability across is heterogeneous commodity.
  11. High Economy by distributing data and processing across clusters of commodity personal computers.
  12. High Efficiency by distributing data, logic on parallel nodes to process it from where data is located.
  13. High Reliability by automatically maintaining multiple copies of data and automatically redeploying processing logic in the event of failures.

HDFS Dis-advantages -

The files are distributed across the system and this behavior had some disadvantages like below.

  • HDFS does not give any reliability if that machine goes down.
  • Enormous number of clients must be handled if all the client’s needs the data stored on single machine.
  • Clients need to copy the data to their local machines before they can operate on it.

Goals of HDFS -

  1. Large distributed file system: - Considering HDFS is large distributed system needs high number of nodes like 10k nodes to process high number of file like 100 million files with high volume of memory available like 10 PB.
  2. Commodity hardware: - Files are replicated to handle or detect hardware failure and recover the system from the failure in minimum time. This needs the Hardware configuration with high specifications.
  3. Optimized for batch processing: - Data locations that needs to be exposed. As a result, the computations can move to where data resides. It provides very high aggregate bandwidth.