VPXD crash caused by VCHA node running GetNetworkIfaceInfoPeer function
search cancel

VPXD crash caused by VCHA node running GetNetworkIfaceInfoPeer function

book

Article ID: 389466

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

- VCHA failover occurrence repeatedly more than 90 days after deploying vcha configuration.

- From  /var/log/vmware/vcha/vcha.log on previous active node, this failover initiated by vMon.

[DATE/TIME] info vcha[03167] [Originator@6876 sub=Agent] Triggered vMon initiated failover
[DATE/TIME] info vcha[02995] [Originator@6876 sub=Agent] Processing event kVmonFailover
[DATE/TIME] info vcha[02995] [Originator@6876 sub=Agent] vMon initiated failover

- From /var/log/vmware/vmon/vmon log, there is crashing vpxd service on previous active node.

[DATE/TIME] Wa(03) host-2528 <vpxd> Service exited. Exit code 1
[DATE/TIME] Wa(03) host-2528 <vpxd> Service exited unexpectedly. Crash count 0. Taking configured recovery action.
[DATE/TIME] In(05) host-2528 SOCKET creating new socket, connecting to /storage/vmware-vmon/vchalistener
[DATE/TIME] In(05) host-2528 <vpxd> Initiated VCHA failover for service.

- In /var/core directory, there are vpxd core dump generated on active/passive nodes. 

 (Sometimes there are many dump files due to failing over many times.)

-rw-rw-r-- 1 [USER] [GROUP] 824M [DATE/TIME] core.vpxd-worker.108474
-rw-rw-r-- 1 [USER] [GROUP] 918M [DATE/TIME] core.vpxd-worker.114953
-rw-rw-r-- 1 [USER] [GROUP] 1.3G [DATE/TIME] core.vpxd-worker.98069

- From the backtrace of core dump, it's similar with this.

 You can check there is 'GetNetworkIfaceInfoPeer' function like in frame #9

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007f88fba30546 in __GI_abort () at abort.c:79
#2  0x00007f8901d062c2 in Vmacore::System::SignalTerminateHandler (info=0x7f88f9f535b0, ctx=0x7f88f9f53480) at bora/vim/lib/vmacore/posix/defSigHandlers.cpp:62
#3  <signal handler called>
#4  std::char_traits<char>::copy (__n=37, __s2=0x90 <error: Cannot access memory at address 0x90>, __s1=0x7f882c02dec0 "}o\222ԏ\177")
    at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/char_traits.h:431
#5  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__d=0x7f882c02dec0 "}o\222ԏ\177", __s=0x90 <error: Cannot access memory at address 0x90>, __n=37)
    at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/basic_string.h:423
#6  0x000056553a43103e in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7f88f9f54038, __str=<error: Cannot access memory at address 0x90>)
    at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/basic_string.h:234
#7  0x000056553b519e37 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign (__str=<error: Cannot access memory at address 0x90>, this=0x7f88f9f54038)
    at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/basic_string.h:1571
#8  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__str=<error: Cannot access memory at address 0x90>, this=0x7f88f9f54038)
    at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/basic_string.h:805
#9  Vpxd::Vcha::GetNetworkIfaceInfoPeer (peerIp="[WITNESS_IP]", ifName="eth0", nicInfo=...) at bora/vpx/vpxd/vcha/utils.cpp:448
#10 0x000056553b511e30 in Vpxd::Vcha::PopulatePlacement (info=std::shared_ptr<Com::Vmware::Vcenter::Vcha::ClusterSvc::Info> (use count 1, weak count 0) = {...}, configInfo=<optimized out>,
    vcAccess=0x7f882c2c7920, partial=...)
    at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/new_allocator.h:80
#11 Vpxd::Vcha::FailoverClusterOperator::GetClusterInfo (this=<optimized out>, vcSpec=..., partial=...,
    activation=std::shared_ptr<Vapi::Core::AsyncActivation> (use count 3, weak count 0) = {...},
    resultInfo=std::shared_ptr<Com::Vmware::Vcenter::Vcha::ClusterSvc::Info> (use count 1, weak count 0) = {...}) at bora/vpx/vpxd/vcha/failoverClusterOperator.cpp:2277
#12 0x000056553b4f2117 in Vpxd::Vcha::ClusterSvc::AsyncClusterImpl::Get (this=<optimized out>, vcSpec=..., partial=...,
    activation_=std::shared_ptr<Vapi::Core::AsyncActivation> (use count 3, weak count 0) = {...}, resultCb_=...) at bora/vpx/vpxd/vcha/AsyncClusterImpl.cpp:159
  •  Check /etc/vmware-vcha/vcha.cfg file, and found witness's section, there is wrong character like 'Changing password for root' in UUID sub section.

    <vcha>
      ...
        <witness>
         <ip>[WITNESS_IP]</ip>
         <uuid>Changing password for root.
         ########-####-####-####-##########</uuid>
       </witness>
       <witnessIP>[WITNESS_IP]</witnessIP>

  • There was 'Expired root password message' and 'Changing password for root' messages instead of connecting node directly when you try to connect witness node using private key file.

 (Please change [WITNESS_IP] value to real ip of witness node when doing the test.)

# ssh vcha@[WITNESS_IP] -i /home/vcha/.ssh/id_rsa

FIPS mode initialized

VMware vCenter Server 8.0.1.00000

Type: vCenter Server with an embedded Platform Services Controller

Last login: 
sudo: Account or password is expired, reset your password and try again
Changing password for root.
Current password:

 

Environment

vCenter 8.x

Cause

VCHA replicates the whole /etc directory. 

This means /etc/shadow is synced from the active to the passive and therefore any changes to the password 

(including changing dates for min age, max age, warn age etc.) are all replicated to the passive node. This way if the root account is not expired on the active, 

it will not be expired on the passive either and we will only see this issue if the root password expires on the active and therefore it also expires on the passive.

The witness node on the other hand does not sync with the active node and it therefore does not receive any updates to the password i.e. it does not get the updated /etc/shadow file. 

This means that despite the root password being updated on the active node, it will not be updated on the witness node and hence the password will eventually become expired on the witness node.

Resolution

This issue has been fixed in vCenter Server 8.0 Update 3e.

 

For alternative workaround of this issue.

  1. Updating witness node root password to another one will resolve crashing vpxd service.
    If you want to make it permanently, please refer to below kb and adjusting value on witness node.
    (VCHA is using private key for inter-node communication so changing root password does not make any trouble for vcha.)

    # sudo chage -I -1 -m 0 -M 99999 -E -1 root
    https://knowledge.broadcom.com/external/article/322247/resetting-root-password-in-vcenter-serve.html

    After updating witness node root pasword, you can also check it again via this command from active node.

    # ssh vcha@[WITNESS_IP] -i /home/vcha/.ssh/id_rsa

  2. You also need to check /etc/vmware-vcha/vcha.cfg file

      <vcha>
        ...
          <witness>
            <ip>[WITNESS_IP]</ip>
            <uuid>Changing password for root.
           ########-####-####-####-##########</uuid>
          </witness>
          <witnessIP>[WITNESS_IP]</witnessIP>

    If it's still showing up like this even after changing witness's root password following like this.
    Because it could make trouble to find witness VM when you try to destroy vcha via vsphere-ui.

    1) Go vSphere Client -> [Configure] tab -> [vCenter HA] -> [VM Settings] tab, which should trigger the update. 
        (The update could take some minute.)

    2) If it's not updated automatically, you can edit /etc/vmare-vcha/vcha.cfg file like this.
       Before modification, you need to check the owner and permission of this file and backing up this file would be fine.
       The correct UUID value should be existed in <uuid></uuid> object.
      <vcha>
        ...
          <witness>
            <ip>[WITNESS_IP]</ip>
            <uuid>########-####-####-####-##########</uuid>
          </witness>
          <witnessIP>[WITNESS_IP]</witnessIP>

 

Additional Information

VCHA is using private key for communication on each node so updating witness root password doesn't affect VCHA fuction.