The code I am writing is using HDFS doing the following operations:
- open a file,
- seeking to a position in the file (hint: this is the wrong thing to do),
- reading a few KB, sometimes up to some MB.
During my tests, I wanted to measure how many such operations could be done from a multi-threaded client (not a data-node nor a name-node) and I got the following results:
- I am able to do ~1000 such random IOs per seconds (reading less than 1 KB.
- At this rate, I am using ~10MB/s of outbound name-node bandwidth (block size of 64MB, replication factor of 3 and filesize of ~1GB).
- I am using ~115MB/s (all the 1Gb/s) of inbound client bandwidth.
The name-node bandwidth consumption is an interesting finding by itself:
- The meta-data of a 1GB file with block size of 64MB and replication factor of 3 is ~10KB.
But the most interesting finding is that I saturate the inbound network interface of the client. Part of this comes from the name-node but the rest comes from the data-nodes. It looks like the data-nodes are sending much more data than what is needed (this is confirmed by a tcpdump). This is surprising as I am using a very small (1024) buffer size when opening the file and I am reading less than 1KB from the file after seeking, where is this data coming from ?
Note: those tests were done using CDH 4.4 which is using HDFS (Hadoop) 2.0.0.
To understand where this bandwidth consumption is coming from, I had to open the HDFS sources. This led me to the org.apache.hadoop.hdfs.DFSInputStream class. In this class, we have 2 calls to getBlockReader:
blockReader = getBlockReader(targetAddr, chosenNode, src, blk,
accessToken, offsetIntoBlock, blk.getNumBytes() - offsetIntoBlock,
buffersize, verifyChecksum, dfsClient.clientName);
reader = getBlockReader(targetAddr, chosenNode, src, block.getBlock(),
blockToken, start, len, buffersize, verifyChecksum,
The second is used when calling read(long, byte, int, int) and the first one is used when doing other reads (from the beginning of file or after a seek). Notice the "blk.getNumBytes() - offsetIntoBlock" of the first call, it will try to read up to the end of the HDFS block, filling the TCP buffer of the client operating system and consuming lot of bandwidth. Hadoop 2.0.0 basically assumes, when you are not using the special "read at position" from FSDataInputStream, that you will read up to the end of the current block. This might be good for most of HDFS use-cases but is very bad for my random IOs.
A Partial Solution:
With this knowledge, I changed my code to remove the seek+read and use read(long, byte, int, int) provided by FSDataInputStream. With this change, I have the following results:
- I am able to do ~6500 random read IOs per seconds (less than 1 KB read),
- I am using ~25MB/s of inbound client bandwidth,
- most of this is coming from the name-node (block size of 512MB).
I am still looking for a better solution to control the amount of data streamed by data-nodes. I like the idea of having some data available in the operating system TCP buffers, as this allows prefetching, but sending the whole HDFS block is too much for my use-case. I will try to play with the TCP receive window, more to come on that later.
Update 2014-01-02: playing with the TCP receive window did not help, I will explain why in a follow up post.
Update 2014-01-04: the follow up post is here.