Hadoop HA active namenode outage cannot automatically switch to standby namenode

problems encountered

recently HA, the company"s Hadoop experimental cluster, but found that if you directly kill the active namenode process, you can automatically switch to standby namenode,. If the active namenode node goes down directly (init 0), you cannot automatically switch to standby namenode
check the zkfc log of the standby node. It is found that a node is trying to connect to the original active namenode node through ssh, but the node has been down and cannot be connected through ssh. So a loop reports an error, isn"t hadoop"s HA designed for this scenario

< H2 > the error code is as follows < / H2 >
135884 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
135885 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
135886 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to shell04...
135887 2018-12-03 19:11:47,484 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to shell04 port 22
135888 2018-12-03 19:11:50,488 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to shell04 as user root
135889 com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host
135890         at com.jcraft.jsch.Util.createSocket(Util.java:394)
135891         at com.jcraft.jsch.Session.connect(Session.java:215)
135892         at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
135893         at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
135894         at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
135895         at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
135896         at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
135897         at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
135898         at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
135899         at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
135900         at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
135901         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
135902         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
135903 Caused by: java.net.NoRouteToHostException: No route to host
135904         at java.net.PlainSocketImpl.socketConnect(Native Method)
135905         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
135906         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
135907         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
135908         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
135909         at java.net.Socket.connect(Socket.java:579)
135910         at java.net.Socket.connect(Socket.java:528)
135911         at java.net.Socket.<init>(Socket.java:425)
135912         at java.net.Socket.<init>(Socket.java:208)
135913         at com.jcraft.jsch.Util$1.run(Util.java:362)
135914         at java.lang.Thread.run(Thread.java:745)
135915 2018-12-03 19:11:50,490 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
135916 2018-12-03 19:11:50,490 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
135917 2018-12-03 19:11:50,490 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
135918 java.lang.RuntimeException: Unable to fence NameNode at shell04/192.168.254.143:9000
135919         at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
135920         at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
135921         at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
135922         at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
135923         at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
135924         at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
135925         at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
135926         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
135927         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Jan.08,2022

< H1 > solution < / H1 >

since the cause of this problem is caused by sshfencing, what happens if I try to use shell for fence, so I add shell to dfs.ha.fencing.methods (I didn't know it could exist with sshfence at first)

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence
shell(/bin/true)
</value>
</property>

/ bin/true is used because there is no need for shell to actually perform the task of kill namenode, because if the active node is reachable, it has been isolated by sshfence, and if the active node is not reachable, it is executed by the shell (its main purpose is to make this step possible). After testing, the zkfc log display of standby namenode after active node init 0 will still implement sshfence mode for isolation, followed by shell mode for isolation, which successfully solves this problem

Menu