VMs do not process firewall rules even though they are apart of the specified NSX Security Group.
search cancel

VMs do not process firewall rules even though they are apart of the specified NSX Security Group.

book

Article ID: 314281

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

  • Traffic is being blocked by the DFW when there are rules to allow traffic through the Firewall.

  • You see log messages in vsm.log simlar to:

WARN SimpleAsyncTaskExecutor-1AbstractMessageListenerContainer:431 - Execution of Rabbit message listenerfailed, and no ErrorHandler has been set.org.springframework.amqp.rabbit.listener.ListenerExecutionFailedException:
Listener threw exception

  • You find that there is no IP listed for the VM in the address set.

Get filter name:

# summarize-dvfilter | grep "<VMNAME>" -A 4
world 4311603 vmm0:<VMNAME> vcUuid:'50 04 a8 76 ## ## ## ##-## ## ## ## 45 ef 1a f0'
 port 67109049 <VMNAME>.eth0
  vNic slot 2
   name: nic-4311603-eth0-vmware-sfw.2

Get Rule ID:

# vsipioctl getrules -f nic-1173309-eth0-vmware-sfw.2 | grep <Rule ID>
    rule <Rule ID> at 1468 inout protocol any from addrset ip-securitygroup-15 to addrset ip-securitygroup-15 accept;

Get Ip in Address set:

# vsipioctl getaddrsets -f nic-4311603-eth0-vmware-sfw.2 -a ip-securitygroup-15 | grep <IPaddress>  


If no output is seen then there is no ip for that vm in this specific address set. 

  • ​You see log messages on NSX Manager for firewall updates, but you never see the final message that the update completed successfully. 

    Start of task:

2018-05-22 11:37:27.626 CDT INFO TaskFrameworkExecutor-18 NotificationProcessor:240 - Setting rule update for domain-c7003 , update number 6 , generation number 1527007047480, object generation number 1527007047480.
2018-05-22 11:37:27.638 CDT INFO TaskFrameworkExecutor-18 NotificationProcessor:240 - Setting rule update for domain-c93544 , update number 4 , generation number 1527007047480, object generation number 1527007047480.

 

End of task:

2018-05-22 08:36:25.056 CDT INFO TaskFrameworkExecutor-8 NotificationProcessor:417 - Processing Context domain-c7003 : 1 rule updates, 3/4 container updates, 0 spoofguard updates.
2018-05-22 08:36:25.092 CDT INFO TaskFrameworkExecutor-23 NotificationProcessor:417 - Processing Context domain-c93544 : 1 rule updates, 3/4 container updates, 0 spoofguard updates.​



Resolution

This issue is resolved in VMware NSX-T Data Center for vSphere 6.4.3.
 
Firewall Publish Section API is invoked (by either Service Composer, DFW UI or REST API)
This API internally calls into SimpleTaskManager, which creates a new Job in the database for each cluster-id which needs rule update. Then the cache is updated for cluster-id to job-id mappings.

Note that at this point, the job object is just created in the database and it will not start execution until this transaction is committed to the database. After transaction commits, a thread will be allocated for this job to execute. Once the job starts executing, this entry is removed from the cache.

Next time publish is triggered, the cache in the SimpleTaskManager is referred for any existing entry for given cluster-id. If an entry is found, no new job is created, instead, just a counter is updated and the flow returns. The entry in the cache means that there already a job created to publish rules on this cluster, which is yet to start and it will take care of publishing rules for this cluster-id.

The exception that is seen in the vsm.logs, resulted in rolling back of the transaction and the job objects created in the database were also rolled back, but the job-ids still remained in the cache, resulting in cache corruption. From this point onwards, whenever there will be a publish request for any of these clusters (have corrupt entry in the cache), only the counter will be incremented and no new task will be scheduled. The end result is that the IP for the VM is never pushed down to the address set local to ESXi. 

The only option to recover from this situation is to restart the NSX Manager, which resets the cache.




Additional Information