vSAN node may experience cluster partition and/or data unable to resync due to corruption in transit of CMMDS and/or RDT data packets.
search cancel

vSAN node may experience cluster partition and/or data unable to resync due to corruption in transit of CMMDS and/or RDT data packets.

book

Article ID: 414383

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

vSAN is reliant on a stable network for transmission of data between the nodes for the purpose of transmitting object-data IOs between nodes and CMMDS heartbeat membership packets.

If there is impairment of data in transit on the inter-node network (e.g. data reaches destination node with content/state not as it was when transmitted from source node) then this will result in vSAN being unable to use this data and subsequent impact on data-state and cluster membership.

Environment

vSAN 8.x
vSAN 9.x

Cause

This issue can be caused by issues on any network component that results in data being transmitted unfaithfully from the point where it is offloaded from vmkernel on source node to it being received on destination node.

There are detection mechanisms in place whereby vSAN can detect impairment of data (e.g. due to checksum/data not matching on source vs destination) and rejects it.

vmkernel.log may log messages such as the below indicating data has been impaired while in transit on the vSAN-network:                                                              

(where '0xXXXXXXXXXXXX(0xYYYYYYYYYYYY)' indicates different calculated checksum values on source vs destination)                                          

vmkwarning: cpuXX:XXXXXXXX)WARNING: RDTTCPConn: RDTTCPRxParseHeader:XXXXX: 0xXXXXXXXXXXXX(0xYYYYYYYYYYYY): RDT checksum does not match (mode 1 type 0) - corrupted header. ctr: 45

vmkernel: cpuXX:XXXXXXXX)CMMDS: CMMDSAgentlikeRxLeaderBatchUpdate:XXXX: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX: Receive update failed with error VMK_CHECKSUM_MISMATCH
vmkernel: cpuXX:XXXXXXXX)CMMDS: CMMDSStateDestroyNode:708: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX: Destroying node XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX: Entry content checksum mismatch

Resolution

The source of the data being corrupted in transit on the network needs to be identified and addressed - this may be determined by first narrowing down the scope of the issue e.g. if data is being impaired from/to all nodes then switch or some other network infrastructure component used by all nodes is the likely source, whereas if only being noted for data to/from a single node then network components connected to only that node (e.g. network card/cable/switchport) are implicated.

If the issue is occurring from/to just a single node, this may be further narrowed down to a source such as being caused by hardware/other issue on the network card by validating whether the issue occurs when (only) any single vmnic from that card is actively used (assuming these different vmnics are configured going to different switches as per configuration best practices) as if the issue is occurring on both/all vmnics then the network card is likely problematic.