Sunday, January 4, 2015

Follow up on HDFS Client Bandwidth Utilisation

In a previous post, I explained the source of unexpected bandwidth consumption in the HDFS client.  This is a follow up post on HDFS client bandwidth utilization.  Sadly, at this point, I do not have new solutions to keep the bandwidth utilization low for random "small" reads with the HDFS client but I have new insight on how the HDFS client protocol works.

In my previous post, I showed that Hadoop 2.0.0 (CDH 4.x) is feeding up to the end of the HDFS block when a client is reading data in that block (from the beginning of file or after seeking).  To avoid that, one can use  read(long, byte[], int, int) from org.apache.hadoop.fs.FSDataInputStream but its usage puts the burden of buffer management on the user.  I evoked the possibility of using the TCP window size to solve that: how to do this is described in the first section below.  It does not work as well as I thought but I learned new thinks that are explained in the second section.

Lowering the TCP Window Size:

The TCP window size is a way to advertise to the other peer what amount of data can be accepted by this peer.  Every packet of a TCP connection contains a window size field that is part of tcp flow control.  I naively thought that lowering the window size would block the data-node flooding me with data.  The rest of this section explains how I did (and failed) that, if you are only interested in the finding, skip to the next sections.

In org.apache.hadoop.fs.FSDataInputStream, the function that open a TCP connection to a data-node is below.  Adding a line, I am able to lower the TCP window size of this connection.
private Peer newTcpPeer(InetSocketAddress addr) throws IOException {
  Socket sock = null;
  try {
    sock = dfsClient.socketFactory.createSocket();
    sock.setReceiveBufferSize(1024); // <<<===--- I added this line.
    NetUtils.connect(sock, addr,
But using this version of FSDataInputStream does not lead to significant bandwidth savings.  When doing a tcpdump to examine the exchange with the data-node, I could see that the window size was effectively lowered but that did not prevent lot of useless data arriving from the data-node.  Less data is still coming though, explaining a small saving in bandwidth but a little over ~66KB was still received from the data-node vs ~123KB without lowering the TCP window size.  At this point, my hypothesis was that the client was reading most of this data and I was right.  I just needed to find where.

HDFS Transfers Data in Packets of 64KB:

While keeping searching for some sense for why reducing the TCP receive window does not lead to bandwidth saving and trying to confirm that the client is reading more data than used, my attention went to the org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver class.  When reading the doRead method, it became clear: the data-node is sending data in packets of 64KB (packet is the name used in the code, hence the name of the class) and the HDFS API is reading the whole packet before feeding the first byte of data to the reader.

So now it makes sense: when I lowered the TCP window size, one packet still needed to be read by the client, which match the ~66KB from the tcpdump.

I was not able to find any ways for the client to ask for smaller packets.  I could modify the server code to send smaller packets, but it would be for all clients and might not be the optimal solution for all cases.  A solution would be for the client to be able to tell the server which packet size it wants, but this would need a protocol change.

Another solution could be, instead of reading all the packet, to lazily read it from the local operating system TCP buffer.  This would allow the TCP receive window tuning to work.  I might hack on this one day.

It is all for now: no new solution for using the HDFS client to optimize bandwidth but a better understanding of how HDFS send data through TCP (64KB packets).

No comments:

Post a Comment