Problem: Service sa-scheduler-services is degraded on NSX Application Platform (NAPP).
Impact: Verdicts for various submissions may not be up to date in MPS.
Events that got rescored in malscape may not get updated in MPS service running on NAPP.
All NAPP versions 4.x and before.
The Analyst sync service running as part of sa-scheduler-service in NSX Application platform restarts frequently when large number of rescored events are received from Malscape's Analyst Sync API due to code issue.
Symptoms :
1. SSH into one of the NSX Manager nodes. Check the status of sa-scheduler-service pod running in NAPP with the following command
root:~# napp-k get pods | grep sa-scheduler-services
An example output can be seen below. Observe the restart count.
root:~# napp-k get pods | grep sa-scheduler-services
NAME READY STATUS RESTARTS
sa-scheduler-services-7fb585897f-7gz5x 1/1 Running 998 (83m ago)
2. We need to check logs of scheduler services pod to understand why it is continuously restarting. Kubernetes allows us to check logs of containers running inside pods using the 'log' command. To check the logs of the 'sa-scheduler-services' pod, use the following command.
Scheduler services pod logs
?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) and verdict_source = ?]; An I/O error occurred while sending to the backend.; nested exception is org.postgresql
.util.PSQLException: An I/O error occurred while sending to the backend.
at org.springframework.jdbc.support.SQLStateSQLExceptionTranslator.doTranslate(SQLStateSQLExceptionTranslator.java:107)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:82)
at org.springframework.jdbc.core.JdbcTemplate.translateException(JdbcTemplate.java:1575)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:667)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:713)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:738)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:794)
at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.query(NamedParameterJdbcTemplate.java:212)
at com.vmware.nsx.sa.analystsync.dao.AnalystInterceptedEntitiesDao.getTasksByAnalystUuids(AnalystInterceptedEntitiesDao.java:56)
at com.vmware.nsx.sa.analystsync.dao.AnalystInterceptedEntitiesDao$$FastClassBySpringCGLIB$$13c9052d.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:792)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:762)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:762)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:707)
at com.vmware.nsx.sa.analystsync.dao.AnalystInterceptedEntitiesDao$$EnhancerBySpringCGLIB$$222b9595.getTasksByAnalystUuids(<generated>)
at com.vmware.nsx.sa.analystsync.service.AnalystSyncService.updateVerdictInDb(AnalystSyncService.java:241)
at com.vmware.nsx.sa.analystsync.service.AnalystSyncService.updateVerdictInDbAndPublishToTopics(AnalystSyncService.java:268)
at com.vmware.nsx.sa.analystsync.scraper.AnalystSyncDataScraper.sync(AnalystSyncDataScraper.java:70)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
postgresql-ha-postgresql-0.log logs shows connection requests for the Worker Node "napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m" on which both postgres and sa pod resides :
2024-11-13 08:39:44.891 GMT [10082] LOG: could not receive data from client: Connection reset by peer","kubernetes":{"pod_name":"postgresql-ha-postgresql-0","namespace_name":"nsxi-platform","pod_id":"11d8c8eb-6d68-499c-a90c-bdef23d23f24","host":"ns-napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m","container_name":"postgresql","docker_id":"074bf9b8d1258cee3dd980d5d3995f5af97d6f8d297686ba55a74b9abc363296","container_hash":"projects
.registry.vmware.com/nsx_application_platform/clustering/third-party/postgresql-repmgr@sha256:d4407bf09643709bf1ecc63ceb4c42cf49eba0ece823befdfd388d50b259d732","container_image":"sha256:bd7a898d40959c78d0f32f19c4f2a8781ae2f2a7d928b5ba08d9c3c47de03529"}}
{"log":"2024-11-13T08:40:10.947976661Z stdout F 2024-11-13 08:40:10.947 GMT [16160] LOG: could not receive data from client: Connection reset by peer","kubernetes":{"pod_name":"postgresql-ha-postgresql-0","namespace_name":"nsxi-platform","pod_id":"11d8c8eb-6d68-499c-a90c-bdef23d23f24","host":"ns-napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m","container_name":"postgresql","docker_id":"074bf9b8d1258cee3dd980d5d3995f5af97d6f8d297686ba55a74b9abc363296","container_hash":"projects
.registry.vmware.com/nsx_application_platform/clustering/third-party/postgresql-repmgr@sha256:d4407bf09643709bf1ecc63ceb4c42cf49eba0ece823befdfd388d50b259d732","container_image":"sha256:bd7a898d40959c78d0f32f19c4f2a8781ae2f2a7d928b5ba08d9c3c47de03529"}}
{"log":"2024-11-13T08:40:11.339627339Z stdout F 2024-11-13 08:40:11.337 GMT [13551] LOG: could not receive data from client: Connection reset by peer","kubernetes":{"pod_name":"postgresql-ha-postgresql-0","namespace_name":"nsxi-platform","pod_id":"11d8c8eb-6d68-499c-a90c-bdef23d23f24","host":"ns-napp-worker-nodepool-a1-ws577-84fd4695dcxdcghw-vfb9m","container_name":"postgresql","docker_id":"074bf9b8d1258cee3dd980d5d3995f5af97d6f8d297686ba55a74b9abc363296","container_hash":"projects
.registry.vmware.com/nsx_application_platform/clustering/third-party/postgresql-repmgr@sha256:d4407bf09643709bf1ecc63ceb4c42cf49eba0ece823befdfd388d50b259d732","container_image":"sha256:bd7a898d40959c78d0f32f19c4f2a8781ae2f2a7d928b5ba08d9c3c47de03529"}}
Workaround:
Remove the rescoring sync time field from the Postgres database on the NAPP platform to reset it to the most recent time. This action ensures the problematic event is bypassed in the backend. The sa-scheduler-service processes events at 5-minute intervals. Following the deletion of the sync time, the events from past 5 minutes will get processed successfully.
Steps:
1. Access Postgres Database
a. From NSX manager CLI, execute:
napp-k exec -it postgresql-ha-postgresql-0 -- /bin/bash
b. Fetch the Postgres password:
echo $POSTGRES_PASSWORD
c. Launch psql CLI and enter password:
psql -U postgres -h localhost
2. On psql CLI, connect to the relevant database:
"\c malwareprevention"
3. Execute the deletion command:
DELETE FROM sa_configurations WHERE key = 'rescoring-sync-time';
This action resets the rescoring sync time and skips the problematic event. Events are processed by the sa-scheduler-service every 5 minutes.
We should not observe the pod restart issue post applying this workaround.