ESXi host fails with PSOD "#PF Exception 14 in world xxxx:nsx-cfgagent" during bulk vMotions in a NSX-T Environment
search cancel

ESXi host fails with PSOD "#PF Exception 14 in world xxxx:nsx-cfgagent" during bulk vMotions in a NSX-T Environment

book

Article ID: 318942

calendar_today

Updated On:

Products

VMware Cloud Foundation VMware NSX Networking VMware vSphere ESXi

Issue/Introduction

Symptoms:
  • NSX version installed NSX-T Data Center 3.1.3 and 3.1.3.x
  • ESXi Host fails with PSOD "#PF Exception 14 in world xxxx:nsx-cfgagent"
  • This issue is observed when bulk vMotions occur in the NSX-T environment, following are some of the probable scenarios:
    • Migration of multiple VMs with each VM comprising of multiple vNICs
    • Multiple IP sets configured in CIDR form per rule
    • Multiple rules containing same IP Sets
    • VMs from a non-upgraded NSX-T host migrated to an upgraded NSX-T host
  • Above scenarios may lead to PSOD with following Back trace :
Panic Message: @BlueScreen: #PF Exception 14 in world 58524293:nsx-cfgagent IP 0x4180320430b3 addr 0x10
Backtrace:
  0x451ac5c1b158:[0x4180320430b3]rn_walktree@(nsxt-vsip-19068435)#<None>+0x5b stack: 0x43261e082c18, 0x451ac5c1b288, 0x43261e082c18, 0x418031fb0890, 0x0

OR
 
Panic Message: @BlueScreen: #PF Exception 14 in world 99566160:NetWorld-VM- IP 0x41801bf58a33 addr 0x0
Backtrace:
2021-12-01T01:38:16.127Z cpu19:14797741)0x45393ac18ec0:[0x42002a413ef2]pfp_policy_lookup@(nsxt-vsip-18504670)#<None>+0xcbe stack: 0x43314afbd860
2021-12-01T01:38:16.151Z cpu19:14797741)0x45393ac19450:[0x42002a3b3dc3]pf_test_tcp@(nsxt-vsip-18504670)#<None>+0x5ac stack: 0x45dab513d7a8
2021-12-01T01:38:16.175Z cpu19:14797741)0x45393ac1abc0:[0x42002a3bcd87]pf_validate_state@(nsxt-vsip-18504670)#<None>+0x6c0 stack: 0x14
2021-12-01T01:38:16.201Z cpu19:14797741)0x45393ac1af00:[0x42002a3bd29b]pf_validate_session@(nsxt-vsip-18504670)#<None>+0x158 stack: 0x45393ac1af42
2021-12-01T01:38:16.227Z cpu19:14797741)0x45393ac1afd0:[0x42002a3bebe8]pf_test_state_tcp@(nsxt-vsip-18504670)#<None>+0x389 stack: 0x45da00000000
2021-12-01T01:38:16.251Z cpu19:14797741)0x45393ac1b0d0:[0x42002a3c53e7]pf_test@(nsxt-vsip-18504670)#<None>+0x25c4 stack: 0x45393ac1b160
2021-12-01T01:38:16.273Z cpu19:14797741)0x45393ac1b2e0:[0x42002a44bfb7]PFFilterPacket@(nsxt-vsip-18504670)#<None>+0x754 stack: 0x0
2021-12-01T01:38:16.298Z cpu19:14797741)0x45393ac1b5b0:[0x42002a372dd3]VSIPDVFProcessPacketsInt@(nsxt-vsip-18504670)#<None>+0x450 stack: 0x0
2021-12-01T01:38:16.324Z cpu19:14797741)0x45393ac1bc10:[0x420028f353b6][email protected]#v2_8_0_0+0xa3 stack: 0x1
2021-12-01T01:38:16.346Z cpu19:14797741)0x45393ac1bc50:[0x420028608cbd]IOChain_Resume@vmkernel#nover+0x2e6 stack: 0x43057b6ab3f0
2021-12-01T01:38:16.365Z cpu19:14797741)0x45393ac1bcf0:[0x42002864c946]Port_InputResume@vmkernel#nover+0xbf stack: 0x2
2021-12-01T01:38:16.386Z cpu19:14797741)0x45393ac1bd40:[0x4200286b8b9d]Vmxnet3VMKDevTQDoTx@vmkernel#nover+0xeca stack: 0x80
2021-12-01T01:38:16.407Z cpu19:14797741)0x45393ac1bee0:[0x4200286c1e5b]Vmxnet3VMKDev_AsyncTx@vmkernel#nover+0xb0 stack: 0x330
2021-12-01T01:38:16.427Z cpu19:14797741)0x45393ac1bf50:[0x4200286378a0]NetWorldPerVMCB@vmkernel#nover+0x5b9 stack: 0x0
2021-12-01T01:38:16.447Z cpu19:14797741)0x45393ac1bfe0:[0x420028781e69]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2021-12-01T01:38:16.467Z cpu19:14797741)0x45393ac1c000:[0x4200284c2c23]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
  • VMKernel log file /var/run/log/vmkernel.log on ESXi host will show similar to below entries : 
2022-01-05T17:17:05.259Z cpu39:2098345)ImportStateTLV entry type 12, len 52, cnt 1
2022-01-05T17:17:05.259Z cpu39:2098345)Importing from source version RELEASEbuild-19068435
2022-01-05T17:17:05.259Z cpu39:2098345)ImportStateTLV entry type 1, len 566915, cnt 2081
2022-01-05T17:17:05.259Z cpu39:2098345)pfr_unroute_kentry: delete failed.
2022-01-05T17:17:05.260Z cpu39:2098345)pfr_unroute_kentry: delete failed.
2022-01-05T17:17:05.260Z cpu39:2098345)pfr_unroute_kentry: delete failed.
2022-01-05T17:17:05.260Z cpu39:2098345)pfp_add_table_one_addr: failed to add ke
rn_addmask: mask impossibly already in tree2022-01-05T17:17:05.262Z cpu4:26296491)pfp_add_addr_with_rule: failed
2022-01-05T17:17:05.262Z cpu4:26296491)pfp_add: failed for dst
2022-01-05T17:17:05.262Z cpu4:26296491)pfp_del_addr_with_rule: cannot find matching entry flags 2
2022-01-05T17:17:05.262Z cpu4:26296491)pfp_del_port: fpp NULL, port 443, flags 8
2022-01-05T17:17:05.262Z cpu4:26296491)pfp_del_ruleid: rule not found 26238 rs 1
2022-01-05T17:17:05.262Z cpu4:26296491)pfioctl: failed to add rules (0)
2022-01-05T17:17:05.262Z cpu4:26296491)VSIPConversionCreateRuleSet: Cannot insert #1060 rule 26238: 22
2022-01-05T17:17:05.341Z cpu39:2098345)ImportStateTLV entry type 2, len 2086977, cnt 3
2022-01-05T17:17:05.341Z cpu4:26296491)pf_rollback_rules: rs_num: 1, anchor: mainrs
2022-01-05T17:17:05.342Z cpu4:26296491)pf_rollback_rules: rs_num: 2, anchor: mainrs
2022-01-05T17:17:05.342Z cpu4:26296491)pf_rollback_rules: rs_num: 4, anchor: mainrs
2022-01-05T17:17:05.342Z cpu4:26296491)pf_rollback_rules: rs_num: 5, anchor: mainrs
2022-01-05T17:17:05.342Z cpu4:26296491)pf_rollback_rules: rs_num: 6, anchor: mainrs
2022-01-05T17:17:05.342Z cpu39:2098345)configured filter nic-28500556-eth0-vmware-sfw.2
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 5, len 804, cnt 20
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 6, len 24, cnt 0
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 13, len 24, cnt 0
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 3, len 17874, cnt 35
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 9, len 24, cnt 0
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 11, len 45, cnt 4
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 8, len 24, cnt 0
2022-01-05T17:17:05.370Z cpu39:2098345)ImportStateTLV entry type 7, len 2224, cnt 5
2022-01-05T17:17:05.370Z cpu39:2098345)Importing succeeded

Note: The preceding log excerpts are only examples.Date,time and environmental variables may vary depending on your environment

Sample PSOD Screenshot:
PSOD_cfagent.PNG


Environment

VMware NSX-T Data Center
VMware vSphere ESXi 6.7
VMware vSphere ESXi 7.x
VMware NSX-T Data Center 3.x
VMware Cloud Foundation 4.x

Cause

This issue is caused due to a corruption in internal data structure of the firewall code.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.1.3.7 and later releases, available at Broadcom Downloads.

Please be aware of the following known issue in NSX-T Data Center 3.1.3.6. This issue impacts customers with L4 Load Balancers configured. Please review the following KB for details.
NSX-T 3.1.3.6 Edge configured with an L4 LB stops passing all traffic (87627) 


For VCF Environments, this issue is resolved in VCF 4.4.1 (NSX-T Version 3.1.3.7.4)
To address this issue in a VCF environment, please upgrade to VCF 4.4.1


If the VCF environment cannot be upgraded to 4.4.1, and you need to stay on your current VCF release, while still upgrading NSX-T to 3.1.3.7, please create a Support Request with VMware Support to design a manual update plan specific to your environment, that will accommodate all VCF considerations.


Workaround:

  • Set DRS to manual on the ESXi Cluster and avoid performing bulk vMotions
Refer to this link to set DRS to Manual



Additional Information

NSX-T 3.1.3.6 Release Notes:
https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.1/rn/VMware-NSX-T-Data-Center-3136-Release-Notes.html