Apache Hadoop HDFS

HDFS stands for Hadoop Distributed File System. It is a highly available, distributed file system for storing very large amounts of data. Data is stored on multiple nodes within a cluster. It is achieved by dividing files into chuncks which are fixed-length blocks. Chunks are distributed and replicated across the cluster’s nodes.

Features

The main features of HDFS are:

  • Fault-tolerance: Each file is split into blocks which are replicated across different machines, based on a replication factor, ensuring that if one machine fails, the data in its blocks can be served by another machine.
  • Write Once Read Many: Data is transferred as a continuous stream when an HDFS client requests it. This implies that HDFS does not wait to read the entire file and sends the data as soon as it reads it, which makes data processing efficient. On the other hand, HDFS is immutable: no update/modification can be performed on a file hosted in HDFS.
  • Scalable: There is no limitation on the number of machines in terms of storage capacity. Also note that HDFS is designed for large files. Large numbers of small files consume unused RAM space on the master nodes and can reduce the performance of HDFS operations.
  • Highly Available: HDFS is built with an automated mechanism to detect and recover from faults. Its distributed structure makes it resilient.

HDFS architecture

HDFS is optimized for storing large files, from a few megabytes to several gigabytes and more. Files are divided into chunks. Chunks are stored at different locations in the cluster and each chunk is replicated multiple times. The size of the chunks and their number of replications is configurable. In TDP, they default to 128MB and 3 respectively. These values can have a significant impact on the performance of Hadoop. This configuration allows for parallel processing, so the number of records processed in a given amount of time, called throughput, is quite high. On the other hand, due to the large number of records present in HDFS, the time required to access the data, called latency, is also quite high. HDFS favors high throughput at the expense of low latency.

The HDFS architecture is composed of master nodes (NameNode, Secondary NameNode and Standby NameNode) and slave nodes (DataNodes). The NameNode is responsible for processing all incoming requests and organizing the storage of files and the associated metadata in the Datanodes. The system is designed for large amounts of data and can handle the storage of several million files.

Note that the Secondary NameNode is not a backup NameNode, but a dedicated node in HDFS, whose primary function is to take checkpoints of the file system metadata present on the NameNode. The Standby NameNode is a backup NameNode that comes into action in the event of an unexpected event such as a machine crash. This is a typical architecture for high-availability Hadoop clusters, where we remove the single point of failure (the NameNode) by providing the ability to run two NameNodes in an active/passive configuration.

HDFS in TDP

To speed up initial developments, the TDP team agreed to only target production environments only, which is why it is only available in HA (High-Avaibility) mode with Kerberos security enabled. In the TDP architecture, HDFS consists of these components:

  • HDFS NameNode: Processes incoming requests and manages storage.
  • HDFS DataNode: Stores data.
  • HDFS JournalNode: Daemon used in a high-availability Hadoop cluster that keeps track of the status of the active NameNode and synchronizes it with the Standby NameNode.
  • HDFS Client: An interface used to communicate with HDFS and perform basic file-related tasks. It connects directly to DataNodes to read/write block data and uses the ClientProtocol to communicate with a NameNode daemon.

Another component used by HDFS in TDP is Ranger. It is used for fine-grained access control and delegation to resources.