VDR instance on multiple ESXi hosts had incomplete routing table resulting in packet drops
search cancel

VDR instance on multiple ESXi hosts had incomplete routing table resulting in packet drops

book

Article ID: 319972

calendar_today

Updated On: 10-21-2024

Products

VMware NSX

Issue/Introduction

Symptoms:
  • VDR instance on multiple ESXi hosts had incomplete routing table resulting in packet drops.
  • Noticed VDR instance on multiple ESXI host had incomplete route updates.
  • Could be seen whenever there is full sync up. Example; VMotion of Control-VM, flap of netcpa connection controller refresh
  • Below sample messages from the DLR control vm can be seen:

##### MSR logs of DLR Control VM shows “bundle exceeds max buffer size, code 4108" #####

 

 **** PROBLEM 0x3d02 - 75 (0000) **** I:000f9a64 F:00000001
vmw_sync.c 2043 :at <time> <Date> (362408300 ms)
Sync layer problem: Type: VMCI, misconfiguration append eom: bundle exceeds max buffer size, code 4108

**** AUDIT 0x3d02 - 76 (0000) **** I:000f9a64 F:00000002
vmw_sync.c 1246 :at <time> <Date> (362408300 ms)
Sync layer info: Type: VMCI, vmw_sync_fsm_transit: state SYNCING event SCAN_STOP

**** AUDIT 0x3d02 - 76 (0000) **** I:000f9a64 F:00000002
vmw_sync.c 792 :at <time> <Date> (362408300 ms)
Sync layer info: Type: VMCI, unfreeze sync

  • Below sample messages from the ESXI host can be seen:

##### vmkernel.log on the ESXI hosts receives VDR flush timer set messages and deleted#####

<Timestamp> cpu4:22387124)vdrb: VdrRtCreateFlushTimer:1309: INST:[I:0xc355] Route defer flush timer is set to 300 Secs (status = 0)
<Timestamp> cpu67:6685725)vdrb: VdrDeleteRtSFList:1114: CP:[I:0xc355]: Deleted rt prefix:0x0088d90a prefix len:0x00000017

 

Environment

VMware NSX-V Datacenter

Cause

  • The DLR Control VM does not push all the routes to controllers in one go. Instead, it makes chunks of routes in bundles of size of 4096 bytes to the host.
  • The last bundle is supposed to carry the "eom" header. However, in a specific case this "eom" header is not added and this causes the issue in the state machine of the control-VM.
  • With 2-way ECMP, the message can accommodate multiples of 116 routes
  • With no ECMP, the message can accommodate multiples of 140 routes.
  • This issue only happens when there are routes in multiples of 116 (2 way ECMP) or 140 (no ECMP) and the last route is a connected route.

 

 

Resolution

There is no Resolution


Workaround:

Option:1

  • Add a dummy static route to the DLR that could fit into the last chunk of routes. This will cause the last chunk to be of different size and eom will be added to the message being sent to the controller. Make sure the static route is just a dummy route and no traffic will hit that route. Ensure that the next hop of the dummy static route is a valid network that remains always reachable from the DLR.

Option:2

  • Ensure that the last route is not a connected route on the DLR control VM and this will prevent the issue from happening.

 

Reactive workaround: Logging can be set to identify and monitor the issue in the environment (with error keywords identified in below logs) and workaround can be added reactively. 

 

##### MSR logs of DLR Control VM shows “bundle exceeds max buffer size, code 4108" #####

 **** PROBLEM 0x3d02 - 75 (0000) **** I:000f9a64 F:00000001
vmw_sync.c 2043 :at <time> <Date> (362408300 ms)
Sync layer problem: Type: VMCI, misconfiguration append eom: bundle exceeds max buffer size, code 4108

**** AUDIT 0x3d02 - 76 (0000) **** I:000f9a64 F:00000002
vmw_sync.c 1246 :at <time> <Date> (362408300 ms)
Sync layer info: Type: VMCI, vmw_sync_fsm_transit: state SYNCING event SCAN_STOP

**** AUDIT 0x3d02 - 76 (0000) **** I:000f9a64 F:00000002
vmw_sync.c 792 :at <time> <Date> (362408300 ms)
Sync layer info: Type: VMCI, unfreeze sync

##### vmkernel.log on the ESXI hosts receives VDR flush timer set messages and deleted#####

<Timestamp> cpu4:22387124)vdrb: VdrRtCreateFlushTimer:1309: INST:[I:0xc355] Route defer flush timer is set to 300 Secs (status = 0)
<Timestamp> cpu67:6685725)vdrb: VdrDeleteRtSFList:1114: CP:[I:0xc355]: Deleted rt prefix:0x0088d90a prefix len:0x00000017


Additional Information

 

 


Impact/Risks:

Data path traffic interrupted