VPXD crashes due to race condition of host member runtime in DVS
search cancel

VPXD crashes due to race condition of host member runtime in DVS

book

Article ID: 379923

calendar_today

Updated On:

Products

VMware vCenter Server 7.0 VMware vCenter Server 8.0

Issue/Introduction

  • The VPXD crashes in vCenter shortly after start.
  • A core.vpxd-worker generates each time after crashed.
  • The call stack of core file may be similar with below trace:

    (gdb) bt
    #0  0x00007f57dfa9c041 in raise () from /lib/libc.so.6
    #1  0x00007f57dfa85536 in abort () from /lib/libc.so.6
    #2  0x00007f57e5a7e3c0 in Vmacore::System::SignalTerminateHandler (info=0x7f56de09b1f0, ctx=0x7f56de09b0c0) at bora/vim/lib/vmacore/posix/defSigHandlers.cpp:62
    #3  <signal handler called>
    #4  0x00005643861db305 in PersistStrTab::Insert (str=<error reading variable: Cannot access memory at address 0x0>, this=0x7f56de09bee8)
        at bora/vim/lib/public/stringTable/StringTable.h:54
    #5  PersistStrTab::Insert (str=..., this=0x7f56de09bee8) at bora/vim/lib/public/stringTable/StringTable.h:55
    #6  Vpxd::VmDvs::InitializeHostMemberStatus (this=this@entry=0x7f56d92196f0, checkState=checkState@entry=0x7f56de09bd10) at bora/vpx/vpxd/vmcheck/vmDvs.cpp:204
    #7  0x00005643861dc475 in Vpxd::VmDvs::VmDvs (this=0x7f56d92196f0, dvs=<optimized out>, checkState=0x7f56de09bd10) at bora/vpx/vpxd/vmcheck/vmDvs.cpp:97
    #8  0x00005643861c5d8d in Vpxd::CompatCheckState::CreateDvs (this=this@entry=0x7f56de09bd10, datacenter=datacenter@entry=0x7f576c01d7e0, uuid=...)
        at bora/vim/lib/public/vmacore/Ref.h:239
    #9  0x00005643861e1b2b in Vpxd::VmHost::VmHost (this=this@entry=0x7f57cc946070, host=host@entry=0x7f56f08c6a70, checkState=checkState@entry=0x7f56de09bd10,
        isXvc=isXvc@entry=false) at bora/vim/lib/public/vmacore/Ref.h:239
    #10 0x00005643861c3e7a in Vpxd::CompatCheckState::CreateHost (this=this@entry=0x7f56de09bd10, host=0x7f56f08c6a70, isXvc=isXvc@entry=false)
        at bora/vpx/vpxd/vmcheck/compatCheckState.cpp:83
    #11 0x00005643861f03b8 in Vpxd::ConstructVmHosts (hosts=..., checkState=checkState@entry=0x7f56de09bd10, isXvc=isXvc@entry=false, hostStates=...)
        at bora/vpx/vpxd/vmcheck/vmTestDriver.cpp:157
    #12 0x00005643861f1911 in Vpxd::FastVmTestDriver (hosts=..., vms=..., setType=setType@entry=HOST_SET_FOR_VMOTION, testOptions=testOptions@entry=0x7f56de09bcd8,
        testFamily=testFamily@entry=Vpxd::VMTESTFAMILY_PROV, opType=opType@entry=Vpxd::VmOperation::relocate, checkState=..., compatible=<optimized out>,
        dasCompatible=<optimized out>) at bora/vpx/vpxd/vmcheck/vmTestDriver.cpp:203
    #13 0x00005643861c6a60 in Vpxd::MoVmCompatChecker::ComputeCompatSetWithDrmReason (vms=..., allHosts=..., type=type@entry=HOST_SET_FOR_VMOTION,
        strictDrsCheck=strictDrsCheck@entry=true, fromHa=fromHa@entry=false, drmReason=drmReason@entry=kUnspecified, result=...)
        at bora/vpx/vpxd/vmcheck/moVmCompatChecker.cpp:603
    #14 0x00005643861c6aec in Vpxd::MoVmCompatChecker::ComputeCompatSet (vms=..., allHosts=..., type=type@entry=HOST_SET_FOR_VMOTION,
        strictDrsCheck=strictDrsCheck@entry=true, fromHa=fromHa@entry=false, result=...) at bora/vpx/vpxd/vmcheck/moVmCompatChecker.cpp:479
    #15 0x00005643864b58cf in DrsDumpWriter::GetVMSnapshot[abi:cxx11](std::vector<Vmacore::Ref<VmMo>, std::allocator<Vmacore::Ref<VmMo> > > const&, std::vector<Vmacore::Ref<HostMo>, std::allocator<Vmacore::Ref<HostMo> > > const&) (vms=..., hosts=...) at bora/vpx/drs/interface/drsDump.cpp:243
    #16 0x00005643864b86b1 in DrsDumpWriter::DumpClusterSnapshot (cluster=<optimized out>) at bora/vpx/drs/interface/drsDump.cpp:318
    #17 0x0000564386411aa0 in operator() (__closure=0x7f56f124e970) at bora/vim/lib/public/vmacore/Ref.h:239
    #18 std::__invoke_impl<void, CdrsLoadBalancer::DoAsynchronousClusterDump()::<lambda()>&> (__f=...)
        at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/invoke.h:61
    #19 std::__invoke_r<void, CdrsLoadBalancer::DoAsynchronousClusterDump()::<lambda()>&> (__fn=...)
        at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/invoke.h:111
    #20 std::_Function_handler<void(), CdrsLoadBalancer::DoAsynchronousClusterDump()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
        at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/std_function.h:290
    #21 0x000056438655287d in VpxUtil_InvokeWithOpId (opID=..., funcName=funcName@entry=0x56438468db0a "ClusterSnapshot", functor=...)
        at bora/vpx/common/vpxAppsUtil.cpp:395
    #22 0x000056438655293c in VpxUtil_InvokeWrapper (opID=..., funcName=0x56438468db0a "ClusterSnapshot", functor=..., doNotCatchVmacoreException=<optimized out>)
        at bora/vpx/common/vpxAppsUtil.cpp:420
    #23 0x00007f57e5918be6 in Vmacore::System::ThreadPoolFair::InvokeItem(std::function<void ()>&) const (this=<optimized out>, item=...)
        at bora/vim/lib/vmacore/asio/ThreadPoolFair.cpp:641
    #24 0x00007f57e591e4f9 in Vmacore::System::ThreadPoolFair::RunWorkerThread (this=0x564388247780) at bora/vim/lib/vmacore/asio/ThreadPoolFair.cpp:1298
    #25 0x00007f57e5ab8093 in std::function<void ()>::operator()() const (this=0x7f5728808728)
        at external/cayman_esx_toolchain_gcc12/usr/bin/../lib/gcc/x86_64-vmk-linux-gnu/12.1.0/../../../../x86_64-vmk-linux-gnu/include/c++/12.1.0/bits/std_function.h:591
    #26 Vmacore::System::ThreadPosix::ThreadBegin (data=0x7f5728808720) at bora/vim/lib/vmacore/posix/thread.cpp:122
    #27 0x00007f57dfc2deae in start_thread () from /lib/libpthread.so.0
    #28 0x00007f57dfb5ce2f in clone () from /lib/libc.so.6

Environment

VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x

Cause

When multiple ESXi hosts join a new DVS, a race condition in the vCenter host sync handler causes ESXi hosts to fail saving the new runtime of member status of DVS. 

Resolution

This is a known issue and fixed in vCenter Server 8.0 U3b.

To workaround the issue:

  1. SSH to vCenter Server.
  2. Disconnect all hosts connected to vCenter in vPostgres:

    /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -c "UPDATE vpx_host SET enabled = 0"
  3. Start VPXD service:

    service-control --start vmware-vpxd

  4. Finding which ESXi host has invalid member status in DVS runtime. 
    • input the below URL to browser: 

      https://<vc-fqdn-or-ip>/mob/?moid=<dvs-moid>&doPath=runtime.hostMemberRuntime

    • check which ESXi host contains the 'status' = 'Unset'

  5. Disable DRS of the cluster which contains the ESXi host.
  6. Manually connected the ESXi host back to vCenter Server. See Changing an ESXi host's connection status in vCenter Server
  7. Repeat Step#6 for the other ESXi hosts.
  8. Enabled DRS again.