HDFS is Hadoop’s File System. It is a distributed file system in that it uses a multitude of machines to implement its functionality. Contrast that with NTFS, FAT32, ext3 etc. which are all single machine filesystems.
HDFS is architected such that the metadata, i.e. the information about file names, directories, permissions, etc. is separated from the user data. HDFS consists of the NameNode, which is HDFS’s metadata server, and DataNodes, where user data is stored. There can be only one active instance of the NameNode. A number of DataNodes (a handful to several thousand) can be part of this HDFS served by the single NameNode.
Here is how a client RPC request to the Hadoop HDFS NameNode flows through the NameNode. This pertains to the Hadoop trunk code base on Dec 2, 2012, i.e. a few months after Hadoop 2.0.2-alpha was released.
The Hadoop NameNode receives requests from HDFS clients in the form of Hadoop RPC requests over a TCP connection. Typical client requests include mkdir, getBlockLocations, create file, etc. Remember – HDFS separates metadata from actual file data, and that the NameNode is the metadata server. Hence, these requests are pure metadata requests – no data transfer is involved. The following diagram traces the path of a HDFS client request through the NameNode. The various thread pools used by the system, locks taken and released by these threads, queues used, etc. are described in detail in this message.
- As shown in the diagram, a Listener object listens to the TCP port serving RPC requests from the client. It accepts new connections from clients, and adds them to the Server object’s connectionList
- Next, a number of RPC Reader threads read requests from the connections in connectionList, decode the RPC requests, and add them to the rpc call queue – Server.callQueue.
- Now, the actual worker threads kick in – these are the Handler threads. The threads pick up RPC calls and process them. The processing involves the following:
- First grab the write lock for the namespace
- Change the in-memory namespace
- Write to the in-memory FSEdits log (journal)
- Now, release the write lock on the namespace. Note that the journal has not been sync’d yet – this means we cannot return success to the RPC client yet
- Next, each handler thread calls logSync. Upon returning from this call, it is guaranteed that the logfile modification have been sync’d to disk. Exactly how this is guaranteed is messy. Here are the details:
- Everytime an edit entry is written to the edits log, a unique txid is assigned for this specific edit. The Handler retrieves this log txid and saves it. This is going to be used to verify whether this specific edit log entry has been sync’d to disk
- When logSync is called by a Handler, it first checks to see if the last sync’d log edit entry is greater than the txid of the edit log just finished by the Handler. If the Handler’s edit log txid is less than the last sync’d txid, then the Handler can mark the RPC call as complete. If the Handler’s edit log txid is greater than the last sync’d txid, then the Handler has to do one of the following things:
- It has to grab the sync lock and sync all transcations
- If it cannot grab the sync lock, then it waits 1000ms and tries again in a loop
- At this point, the log entry for the transaction made by this Handler has been persisted. The Handler can now mark the RPC as complete.
- Now, the single Responder thread picks up completed RPCs and returns the result of the RPC call to to the RPC client. Note that the Responder thread uses NIO to asynchronously send responses back to waiting clients. Hence one thread is sufficient.
There is one thing about this design that bothers me:
- The threads that wait for their txid to sync sleep 1000ms, wait, sleep 1000ms, wait and continue with this poll. It may make sense to remove the polling mechanism and to replace by an actual sleep/notify mechanism.
That’s all in this writeup, folks.