ESXi hosts intermittently disconnect from vCenter with "SSL3_GET_RECORD : decryption failed or bad record mac" error in logs
search cancel

ESXi hosts intermittently disconnect from vCenter with "SSL3_GET_RECORD : decryption failed or bad record mac" error in logs

book

Article ID: 430250

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi host(s) intermittently change state to NO_RESPONSE in vCenter Server due to missed heartbeats. Log analysis reveals a specific cryptographic failure during the SSL handshake process.

  • ESXi hosts repeatedly disconnect and reconnect.

  • vpxd.log shows info messages like

    <timedatestamp> info vpxd[#####] [Originator@#### sub=InvtHostCnx opID=HeartbeatStartHandler-########] Missed 11 heartbeats for host [vim. HostSystem:host-######, ###.###.###]
  • vpxd.log shows warning messages like:

    <timedatestamp> warning rhttpproxy [#######] [Originator@#### sub=IO. Connection] Failed to read buffer from stream; SSL(<io_obj p:0x000000##########, h:21, <TCP '##.##.##.## : 443'>, <TCP ##.##.##.## : 40116'>>) e: 104(Connection reset by peer), async: true, duration: 1059msec
    <timedatestamp> warning rhttpproxy [#######] [Originator@#### sub=Proxy Req #####] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE (Connection reset by peer: The connection is terminated by the remote end with a reset packet. Usually, this is a sign of a network problem, timeout, or service overload.)

  • vpxd.log shows error messages like:

    <timedatestamp> error vpxd[#####] [Originator@#### sub=IO.Http opID=TaskLoop-host-######] User agent failed to send request; SSL(<io_obj p:0x0000########, h:-1, <TCP '##.#.#.## : 50150'>, <TCP '##.##.#.## : 443'>>), N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)

Environment

VMware Cloud Foundation (VCF)

vCenter Server 7.x / 8.x

ESXi 7.x / 8.x

Cause

Network Packet Corruption is occurring in the transit path between vCenter and the ESXi hosts.

  • The transit path is composed of both the virtual part and the physical part (external to the vmnics).
  • To find out the root cause, each part must be investigated.
  • Broadcom VCF Support can assist in investigating the virtual part.
  • The physical part is the responsibility of the team(s) managing the physical infrastructure external to the vmnics.

The bad record mac error specifically indicates that the SSL/TLS payload was altered or truncated after the sender calculated the Message Authentication Code (MAC), causing the receiver to reject the packet due to integrity failure. 

  • In the context of the error message, the MAC refers to the "message authentication code" not "media access control" address.

Resolution

To investigate the virtual path, you must determine if the packets are being altered between when they enter the host at the vmnic, and when the packets are delivered to the vCenter guest VM

Refer to KB 341568 Packet capture on ESXi using the pktcap-uw tool for instructions on how to perform packet captures:

  1. Locate the ESXi host on which the vCenter server in question is running.

  2. SSH into the ESXi host on which the vCenter server in question is running.

  3. Use the following command to determine the switchport number of the vCenter server:

    net-stats -l | grep <name of the vCenter server>

    where <name of the vCenter server> is the name of the vCenter server in question

  4. Identify the vmnic# that is carrying the traffic for the vCenter VM (the switchport number of the vCenter server is the one you are looking for) by typing:

    esxtop

    and pressing the [n] key

  5. Determine a suitable VMFS volume and create a folder for the packet captures as described in KB 341568 Packet capture on ESXi using the pktcap-uw tool.

  6. Capture the packets simultaneously in the following ways:

    1. Packet Capture where the packets enter the host at the vmnic

      pktcap-uw --uplink vmnic# --capture UplinkRcvKernel --ip <IP Address of the vCenter Server> -o /vmfs/volumes/<Datastore Name>/<Folder Name>/<Host Name>.vmnic#.UplinkRcvKernel.<IP Address of the vCenter Server>.pcapng

      • vmnic# is the vmnic identified in the esxtop display 
      •  <IP Address of the vCenter Server> is the IP address of the vCenter Server in question
      • <Datastore Name> is the name of the data store where this pcap will be saved
      • <Folder Name> is the name of the folder where this pcap will be saved (example: case_###### where ###### is the Broadcom case number)
      • <Host Name> is the name of the ESXi host where the vCenter Server vm in question resides
      • <Host Name>.vmnic#.UplinkRcvKernel.<IP Address of the vCenter Server>.pcapng will be the file name of the resulting pcap

    2. Packet capture where the packets are delivered to the vCenter guest VM  

      pktcap-uw --switchport <Switchport Number> --capture VnicRx -o /vmfs/volumes/<Datastore Name>/<Folder Name>/<Host Name>.switchport<Switchport Number>.VnicRx.pcapng

      • <Switchport Number> is from net-stats -l output
      • <Datastore Name> is the name of the data store where this pcap will be saved
      • <Folder Name> is the name of the folder where this pcap will be saved (example: case_###### where ###### is the Broadcom case number)
      • <Host Name> is the name of the ESXi host where the vCenter Server vm in question resides
      • <Host Name>.switchport<Switchport Number>.VnicRx.pcapng will be the file name of the resulting pcap

 

If the packets seen in the --uplink capture (where the packets enter the host at the vmnic) are also seen in the --switchport capture (where the packets are delivered to the vCenter guest VM) and the contents of each packet match each other, then you can rule out the ESXi networking stack (the virtual part of the transit path) as a possible cause of the symptoms. 

  • If the packets match, you will see any given packet appear in both captures with the only difference being the timestamp which is typically different by microseconds.

    You should then request that the team(s) responsible for the infrastructure external to the vmnics investigate for possible root cause of the symptoms (the physical part of the transit path).

 

If the packets do not match as described above, then you should log a case with Broadcom Support to further investigate the virtual part of the transit path as per KB Creating and managing Broadcom cases and attach the pcap files capture above as well as a support bundle for the ESXi host where the vCenter resides.