The output below is produced by the command hdfs dfs -ls
hanging.
[gpadmin@etl1 ~]$ hdfs dfs -ls / Java config name: null Native config name: /etc/krb5.conf Loaded from native config >>>KinitOptions cache name is /tmp/krb5cc_500 >>>DEBUG client principal is gpadmin/[email protected] >>>DEBUG server principal is krbtgt/[email protected] >>>DEBUG key type: 18 >>>DEBUG auth time: Mon Aug 18 11:57:12 PDT 2014 >>>DEBUG start time: Mon Aug 18 11:57:12 PDT 2014 >>>DEBUG end time: Tue Aug 19 11:57:12 PDT 2014 >>>DEBUG renew_till time: Mon Aug 18 11:57:12 PDT 2014 >>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL; >>>DEBUG client principal is gpadmin/[email protected] >>>DEBUG server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/[email protected] >>>DEBUG key type: 0 >>>DEBUG auth time: Wed Dec 31 16:00:00 PST 1969 >>>DEBUG start time: null >>>DEBUG end time: Wed Dec 31 16:00:00 PST 1969 >>>DEBUG renew_till time: null >>> CCacheInputStream: readFlags() 14/08/18 12:12:48 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo. Not retrying because failovers (15) exceeded maximum allowed (15) org.apache.hadoop.net.ConnectTimeoutException: Call From etl1.phd.local/10.110.127.23 to hdw1.phd.local:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/10.110.127.23:45865 remote=hdw1.phd.local/172.28.17.4:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:688) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1796) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1116) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1112) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1112) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1622) at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326) at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224) at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:255) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:305) Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/10.110.127.23:45865 remote=hdw1.phd.local/172.28.17.4:8020] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:547) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:642) at org.apache.hadoop.ipc.Client$Connection.access$2600(Client.java:314) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1399) at org.apache.hadoop.ipc.Client.call(Client.java:1318) ... 28 more ls: Call From etl1.phd.local/10.110.127.23 to hdw1.phd.local:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/10.110.127.23:45865 remote=hdw1.phd.local/172.28.17.4:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
In this case, the user kerberos principal includes the hostname "etl1.phd.local". When the kerberos principal includes the hostname, HDFS will resolve that hostname to an IP address. Then HDFS will bind the socket to the resolved IP interface regardless of what the NameNode hostname resolves to. This is by design as per HDFS-7215.
[gpadmin@etl1 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_500 Default principal: gpadmin/[email protected] Valid starting Expires Service principal 08/18/14 11:57:12 08/19/14 11:57:12 krbtgt/[email protected] renew until 08/18/14 11:57:12
In the diagram ETL client will attempt to connect to NameNode on vlan 123 from vlan0. In this case ETL node is able to reach NameNode but eventually times out. This is because the NameNode will send the TCP SYN/ACK back through the public default route which does not know how to get back to vlan0 on subnet 172.