In NAPP, sa-scheduler-services pods may encounter multiple restarts

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Problem: Service sa-scheduler-services is degraded on NSX Application Platform (NAPP).

Impact: Verdicts for various submissions may not be up to date in MPS.
Events that got rescored in malscape may not get updated in MPS service running on NAPP.

Environment

All NAPP versions 4.x and before.

Cause

The Analyst sync service running as part of sa-scheduler-service in NSX Application platform restarts frequently when large number of rescored events are received from Malscape's Analyst Sync API due to code issue.

Symptoms :

1. SSH into one of the NSX Manager nodes. Check the status of sa-scheduler-service pod running in NAPP with the following command

root:~# napp-k get pods | grep sa-scheduler-services

An example output can be seen below. Observe the restart count.

root:~# napp-k get pods | grep sa-scheduler-services
NAME                                                              READY        STATUS               RESTARTS        
sa-scheduler-services-7fb585897f-7gz5x                             1/1         Running             998 (83m ago)

2. We need to check logs of scheduler services pod to understand why it is continuously restarting. Kubernetes allows us to check logs of containers running inside pods using the 'log' command. To check the logs of the 'sa-scheduler-services' pod, use the following command.

SSH into one of the NSX Manager nodes.
Run 'napp-k get pods -n nsxi-platform | grep sa-scheduler-services'
Run 'napp-k logs -f <pod name from above command> -n nsxi-platform'
Run 'napp-k get pods -n nsxi-platform | grep -w postgresql-ha-postgresql-0 | grep -v metrics'
Run 'napp-k logs -f <pod name from above command> -n nsxi-platform'

Scheduler services pod logs


?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) and verdict_source = ?]; An I/O error occurred while sending to the backend.; nested exception is org.postgresql
.util.PSQLException: An I/O error occurred while sending to the backend.
        at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:107)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
        at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1575)
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:667)
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:713)
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:738)
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:794)
        at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.query(NamedParameterJdbcTemplate.java:212)
        at com.vmware.nsx.sa.analystsync.dao.AnalystInterceptedEntitiesDao.getTasksByAnalystUuids(AnalystInterceptedEntitiesDao.java:56)
        at com.vmware.nsx.sa.analystsync.dao.AnalystInterceptedEntitiesDao$$FastClassBySpringCGLIB$$13c9052d.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:792)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:762)
        at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:762)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:707)
        at com.vmware.nsx.sa.analystsync.dao.AnalystInterceptedEntitiesDao$$EnhancerBySpringCGLIB$$222b9595.getTasksByAnalystUuids(<generated>)
        at com.vmware.nsx.sa.analystsync.service.AnalystSyncService.updateVerdictInDb(AnalystSyncService.java:241)
        at com.vmware.nsx.sa.analystsync.service.AnalystSyncService.updateVerdictInDbAndPublishToTopics(AnalystSyncService.java:268)
        at com.vmware.nsx.sa.analystsync.scraper.AnalystSyncDataScraper.sync(AnalystSyncDataScraper.java:70)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
postgresql-ha-postgresql-0.log  logs shows connection requests for the Worker Node "napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m" on which both postgres and sa pod resides :
 
2024-11-13 08:39:44.891 GMT [10082] LOG:  could not receive data from client: Connection reset by peer","kubernetes":{"pod_name":"postgresql-ha-postgresql-0","namespace_name":"nsxi-platform","pod_id":"11d8c8eb-6d68-499c-a90c-bdef23d23f24","host":"ns-napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m","container_name":"postgresql","docker_id":"074bf9b8d1258cee3dd980d5d3995f5af97d6f8d297686ba55a74b9abc363296","container_hash":"projects
.registry.vmware.com/nsx_application_platform/clustering/third-party/postgresql-repmgr@sha256:d4407bf09643709bf1ecc63ceb4c42cf49eba0ece823befdfd388d50b259d732","container_image":"sha256:bd7a898d40959c78d0f32f19c4f2a8781ae2f2a7d928b5ba08d9c3c47de03529"}}
{"log":"2024-11-13T08:40:10.947976661Z stdout F 2024-11-13 08:40:10.947 GMT [16160] LOG:  could not receive data from client: Connection reset by peer","kubernetes":{"pod_name":"postgresql-ha-postgresql-0","namespace_name":"nsxi-platform","pod_id":"11d8c8eb-6d68-499c-a90c-bdef23d23f24","host":"ns-napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m","container_name":"postgresql","docker_id":"074bf9b8d1258cee3dd980d5d3995f5af97d6f8d297686ba55a74b9abc363296","container_hash":"projects
.registry.vmware.com/nsx_application_platform/clustering/third-party/postgresql-repmgr@sha256:d4407bf09643709bf1ecc63ceb4c42cf49eba0ece823befdfd388d50b259d732","container_image":"sha256:bd7a898d40959c78d0f32f19c4f2a8781ae2f2a7d928b5ba08d9c3c47de03529"}}
{"log":"2024-11-13T08:40:11.339627339Z stdout F 2024-11-13 08:40:11.337 GMT [13551] LOG:  could not receive data from client: Connection reset by peer","kubernetes":{"pod_name":"postgresql-ha-postgresql-0","namespace_name":"nsxi-platform","pod_id":"11d8c8eb-6d68-499c-a90c-bdef23d23f24","host":"ns-napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m","container_name":"postgresql","docker_id":"074bf9b8d1258cee3dd980d5d3995f5af97d6f8d297686ba55a74b9abc363296","container_hash":"projects
.registry.vmware.com/nsx_application_platform/clustering/third-party/postgresql-repmgr@sha256:d4407bf09643709bf1ecc63ceb4c42cf49eba0ece823befdfd388d50b259d732","container_image":"sha256:bd7a898d40959c78d0f32f19c4f2a8781ae2f2a7d928b5ba08d9c3c47de03529"}}

Resolution

Workaround:

Remove the rescoring sync time field from the Postgres database on the NAPP platform to reset it to the most recent time. This action ensures the problematic event is bypassed in the backend. The sa-scheduler-service processes events at 5-minute intervals. Following the deletion of the sync time, the events from past 5 minutes will get processed successfully.

Steps:
1. Access Postgres Database
a. From NSX manager CLI, execute:
napp-k exec -it postgresql-ha-postgresql-0 -- /bin/bash

b. Fetch the Postgres password:
echo $POSTGRES_PASSWORD

c. Launch psql CLI and enter password:
psql -U postgres -h localhost

2. On psql CLI, connect to the relevant database:
"\c malwareprevention"

3. Execute the deletion command:
DELETE FROM sa_configurations WHERE key = 'rescoring-sync-time';

This action resets the rescoring sync time and skips the problematic event. Events are processed by the sa-scheduler-service every 5 minutes.

We should not observe the pod restart issue post applying this workaround.