PSOD with "tcp_slowtimo" in the stack trace due to "Spin count exceeded - possible deadlock"
search cancel

PSOD with "tcp_slowtimo" in the stack trace due to "Spin count exceeded - possible deadlock"

book

Article ID: 316422

calendar_today

Updated On:

Products

VMware vSphere ESX 7.x VMware vSphere ESXi 8.0

Issue/Introduction

  • ESXi Server may crash with back trace similar to the one listed here/
Panic Details: Crash at yyyy-mm-ddThh:mm:ss.msZ on CPU 12 running world 2098438 - tq:tcpip4. VMK Uptime:4:18:53:03.040
Panic Message: @BlueScreen: Spin count exceeded - possible deadlock
Backtrace:
 0x453962e1bcb0:[0x42000bcfee4f]PanicvPanicInt@vmkernel#nover+0x327 stack: 0x453962e1bd88, 0x0, 0x42000bcfee4f, 0x453962e1be00, 0x453962e1bcb0
 0x453962e1bd80:[0x42000bcff3a8]Panic_NoSave@vmkernel#nover+0x4d stack: 0x453962e1bde0, 0x453962e1bda0, 0x453962e1be38, 0x453962e1be38, 0xc
 0x453962e1bde0:[0x42000bc1b9a7]Lock_CheckSpinCount@vmkernel#nover+0x2a0 stack: 0xfffffffffffffff8, 0x2ef6eb37ff1e8, 0x3ff, 0x431a5f7ec008, 0x1
 0x453962e1be30:[0x42000bd08ea1]SP_WaitReadLock@vmkernel#nover+0xba stack: 0x431a5f7ec008, 0x431a5f7ec00c, 0x0, 0x431a5f4dcf50, 0x431a5f4dcf50
 0x453962e1be70:[0x42000bd08f3e]SPAcqWriteLockWork@vmkernel#nover+0x33 stack: 0x41ffd4014b40, 0x42000d207631, 0x41ffd400b7a0, 0x42000d271760, 0x41ffd40030b0
 0x453962e1be90:[0x42000d207630]rw_wlock@(tcpip4)#<None>+0x8d stack: 0x41ffd40030b0, 0x41ffd4002f60, 0x16, 0x41ffd4008760, 0x0
 0x453962e1bea0:[0x42000d27175f]tcp_slowtimo@(tcpip4)#<None>+0xa8 stack: 0x16, 0x41ffd4008760, 0x0, 0x42000d21d2d3, 0x42000d200584
 0x453962e1bed0:[0x42000d21d2d2]pfslowtimo@(tcpip4)#<None>+0x27 stack: 0x0, 0x42000d2007d2, 0x8000000, 0x0, 0x431a5f212670
 0x453962e1bef0:[0x42000d2007d1]callout_timer@(tcpip4)#<None>+0x24e stack: 0x431a5f212670, 0x2770bfe00002001, 0x431a5f4d8118, 0x431a5f61ce00, 0x4e
 0x453962e1bf50:[0x42000bc2c4cb]VmkTimerQueueWorldFunc@vmkernel#nover+0x258 stack: 0x7e2c431a5f4d80c7, 0x0, 0x0, 0x431a5f61ce10, 0x4e
 0x453962e1bfe0:[0x42000bfb33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0, 0x42000bcc4b50, 0x0, 0x0, 0x0
 0x453962e1c000:[0x42000bcc4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

 

  • In the ESXi host located at /var/log/run/vmkernel.log may contain following statements recorded prior to the crash indicating the lock wait.
yyyy-mm-ddThh:mm:ss.msZ cpu77:2457091)WARNING: Heartbeat: 827: PCPU 4 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x1120b2(0x42000bc00000):0x3ff:0xf48] (Src 0x1, CPU4)
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1be28:[0x42000bd120b1]Timer_GetCycles@vmkernel#nover+0x2 stack: 0x2ef5983c218d3
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1be30:[0x42000bd08e84]SP_WaitReadLock@vmkernel#nover+0x9d stack: 0x431a5f7ec008
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1be70:[0x42000bd08f3e]SPAcqWriteLockWork@vmkernel#nover+0x33 stack: 0x41ffd4014b40
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1be90:[0x42000d207630]rw_wlock@(tcpip4)#<None>+0x8d stack: 0x41ffd40030b0
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1bea0:[0x42000d27175f]tcp_slowtimo@(tcpip4)#<None>+0xa8 stack: 0x16
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1bed0:[0x42000d21d2d2]pfslowtimo@(tcpip4)#<None>+0x27 stack: 0x0
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1bef0:[0x42000d2007d1]callout_timer@(tcpip4)#<None>+0x24e stack: 0x431a5f2125c0
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1bf50:[0x42000bc2c4cb]VmkTimerQueueWorldFunc@vmkernel#nover+0x258 stack: 0x7e2c431a5f4d80c7
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1bfe0:[0x42000bfb33e1]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)0x453962e1c000:[0x42000bcc4b4f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
yyyy-mm-ddThh:mm:ss.msZ cpu4:2098438)WARNING: Lock: 1658: (held by -1: Spin count exceeded 2 time(s) - possible deadlock.

Environment

VMware ESXi 7.0 U3

VMware ESXi 8.0 GA

 

Cause

The TCP slow timer might starve TCP input processing while it covers the list of connections in TIME_WAIT closing expired connections due to contention on the global TCP pcbinfo lock. As a result, the VMkernel might fail with a purple diagnostic screen and the error Spin count exceeded - possible deadlock while cleaning up TCP TIME_WAIT sockets. The backtrace points to tcpip functions such as tcp_slowtimo() or tcp_twstart().

Resolution

This issue is resolved in VMware  ESXi 7.0 Update 3o.To download go to - Download Broadcom products and software .Refer to VMware ESXi 7.0 Update 3o release notes -PR 3165374: ESXi hosts might become unresponsive and fail with a purple diagnostic screen during TCP TIME_WAIT.

This issue is resolved in VMware  ESXi 8.0 Update 1.To download go to - Download Broadcom products and software