Multiple VMs crash or reboot around the time witness node disconnects from data nodes in vSAN Cluster
search cancel

Multiple VMs crash or reboot around the time witness node disconnects from data nodes in vSAN Cluster

book

Article ID: 414080

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms: 

  • Its vSAN stretched cluster running on 8.0 U3.

  • Multiple VMs crash or reboot themselves on frequent basis.

  • Witness node frequently disconnects from data nodes especially around time when VMs crash or reboot.

  • VMs that crash or reboot (at times) do report following error in vSphere Client few minutes prior to crash/reboot :

    "Configuration file cannot be found"

  • Most of crashed Windows VMs report BSOD and following error observed in vmware.log of crashed virtual machine:

    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03) vcpu-5 - WinBSOD: Synthetic MSR[0x40000100] 0x7a
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03)+ vcpu-5 -
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03) vcpu-5 - WinBSOD: Synthetic MSR[0x40000101] 0xffffe44c84ee8878
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03)+ vcpu-5 -
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03) vcpu-5 - WinBSOD: Synthetic MSR[0x40000102] 0xffffffffc0000185
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03)+ vcpu-5 -
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03) vcpu-5 - WinBSOD: Synthetic MSR[0x40000103] 0x20009f6cbbe0
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03)+ vcpu-5 -
    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(03) vcpu-5 - WinBSOD: Synthetic MSR[0x40000104] 0xffff9909dd10f000

  • VM namespace object fails to open and throws 'Input/output error':

    (from /var/run/log/hostd.log):

    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2102143]: [Originator@6876 sub=Libs] DictionaryLoad: Cannot open file "/vmfs/volumes/vsan:################-################/########-####-####-####-#############/################.vmx": Input/output error.

    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2111502]: [Originator@6876 sub=Libs] OBJLIB-LIB: ObjLib_GetNameSpaceObjectUniqueIdFromPath : failed to obtain canonical path for "/vmfs/volumes/vsan:################-################/########-####-####-####-#############/################.vmx": Input/output error. (5)

  • VM namespace object shows up as not found and reports heartbeat timeout:

    (from /var/run/log/hostd.log)

    YYYY-MM-DDTHH:MM:SS.XXXZ Wa(164) Hostd[2102141]: [Originator@6876 sub=Hostsvc.VmkVprobSource] Can't find datastore '########-####-####-####-#############'

    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2102141]: [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    key = 127,
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    chainId = 0,
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    createdTime = "1970-01-01T00:00:00Z",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    userName = "",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    host = (vim.event.HostEventArgument) {
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->       name = "ESXi-host.domain.com",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->       host = 'vim.HostSystem:ha-host'
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    },
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    eventTypeId = "esx.problem.vmfs.heartbeat.timedout",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    arguments = (vmodl.KeyAnyValue) [
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->       (vmodl.KeyAnyValue) {
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->          key = "1",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->          value = "########-####-####-####-#############"
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->       },
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->       (vmodl.KeyAnyValue) {
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->          key = "2",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->          value = "########-####-####-####-#############'"
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->       }
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    ],
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    objectId = "ha-host",
    YYYY-MM-DDTHH:MM:SS.XXXZ In(166) Hostd[2101673]: -->    objectType = "vim.HostSystem",

  • VM name space object reports need for repair as one of data component is absent or degraded.

    (from /var/run/log/clomd.log)

    YYYY-MM-DDTHH:MM:SS.XXXZ No(29) clomd[12612580]: [Originator@6876] CLOM_AddToQueuedObjectList: Queueing workitem for ########-####-####-####-############# type: REPAIR, priority: 19 with delay 0
    YYYY-MM-DDTHH:MM:SS.XXXZ No(29) clomd[12612580]: [Originator@6876] CLOMCleanup_ConfigNeedsCleanup: Object ########-####-####-####-############# needsCleanup: 0, needsDeltaCleanup: 0, needsRepairCleanup: 0, needsConsolidateCleanup: 0, needsTransientCleanup: 0, needsRefinalize: 0, inCrawler: 0
    YYYY-MM-DDTHH:MM:SS.XXXZ No(29) clomd[12612580]: [Originator@6876] CLOM_PostWorkItem: Posted a work item opID:1804388030 for ########-####-####-####-############# group: 00000000-0000-0000-0000-000000000000 Type: REPAIR delay 0 (Success)
    YYYY-MM-DDTHH:MM:SS.XXXZ No(29) clomd[12612580]: [Originator@6876 opID=1804388030] CLOMReconfigure: Reconfiguring ########-####-####-####-############# workItem type REPAIR
    YYYY-MM-DDTHH:MM:SS.XXXZ Er(27) clomd[12612580]: [Originator@6876 opID=1804388030] CLOMReplacementPreWorkRepair: Repair needed. 1 absent/degraded data components for ########-####-####-####-#############


  • VM Namespace UUID reports heartbeat timeout:

    (from /var/run/log/vobd.log)

    YYYY-MM-DDTHH:MM:SS.XXXZ In(14) vobd[2097811]: [vmfsCorrelator] 10682910301000us: [vob.vmfs.heartbeat.timedout] ########-####-####-####-############# ########-####-####-####-#############
    YYYY-MM-DDTHH:MM:SS.XXXZ In(14) vobd[2097811]: [vmfsCorrelator] 10683827650869us: [esx.problem.vmfs.heartbeat.timedout] ########-####-####-####-############# ########-####-####-####-#############

Environment

VMware vSAN 8.x

Cause

  • Network connectivity issue of Witness node with Data nodes where it fails to respond within stipulated time of 500 ms.

  • vSAN Native traces report Witness node response received ~90 seconds which ideally should be <500 ms:

YYYY-MM-DDTHH:MM:SS.XXXZ [536236267] [cpu0] [4549e681bf08] RDTTraceSlowMessageTx:5183: {'newTxState': 'Acked', 'oldTxState': 'Sent', 'type': 'Response', 'opId': 0x4ae09942, 'requestId': 2, 'lastTimeMS': 89680, 'totalMS': 0, 'bytesAtEnd': 3776, 'bytes': 132, 'assoc': 0x4316c7b0b880, 'conn': 0x4316c7966500}
YYYY-MM-DDTHH:MM:SS.XXXZ [536236270] [cpu0] [4549e6825c08] RDTTraceSlowMessageTx:5183: {'newTxState': 'Acked', 'oldTxState': 'Sent', 'type': 'Response', 'opId': 0x4ae09948, 'requestId': 2, 'lastTimeMS': 89675, 'totalMS': 0, 'bytesAtEnd': 3888, 'bytes': 112, 'assoc': 0x4316c7e9a9c0, 'conn': 0x4316c7966500}
YYYY-MM-DDTHH:MM:SS.XXXZ [536236273] [cpu0] [4549e6803a08] RDTTraceSlowMessageTx:5183: {'newTxState': 'Acked', 'oldTxState': 'Sent', 'type': 'Response', 'opId': 0x4ae098fb, 'requestId': 2, 'lastTimeMS': 89671, 'totalMS': 0, 'bytesAtEnd': 4020, 'bytes': 132, 'assoc': 0x4316c83c8e00, 'conn': 0x4316c7966500}

Resolution

Investigate network disconnection between data nodes and Witness node to understand why network drops are observed between them and why Witness node takes longer than expected to respond.