Vol VMs are failing to migrate with error: "A general System Error Occurred" due to Storage Policy Service (SPS) instability.
Log evidence:
In Envoy-access logs, we observed occurrences of http2.remote_reset and stream_idle_timeout errors.
----------------
Envoy-access logs
-----------------
envoy-access.log: YYYY-MM-DDT16:56:30.237Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT16:26:28.081Z >POST /sms/sdk 0 http2.remote_reset - 450 0 - 1800004 - - 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:43902 HTTP/1.1 - 127.0.0.1:22000 - "QuerySessionManager"
envoy-access.log: YYYY-MM-DDT16:56:44.261Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT16:56:38.121Z POST /sms/sdk 500 via_upstream - 1050 625 - 2 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:33856 HTTP/1.1 - 127.0.0.1:22000 - "SelectCapabilities"
envoy-access.log: YYYY-MM-DDT16:57:13.790Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT16:57:08.159Z GET /sms/sdk/service 404 via_upstream - 0 745 - 1 0 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:59986 HTTP/1.1 - 127.0.0.1:22000 - -
envoy-access.log: YYYY-MM-DDT17:26:44.402Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT16:56:38.159Z POST /sms/sdk 0 http2.remote_reset - 450 0 - 1800004 - - 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:33872 HTTP/1.1 - 127.0.0.1:22000 - "QuerySessionManager"
envoy-access.log: YYYY-MM-DDT17:26:55.295Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT17:26:48.190Z POST /sms/sdk 500 via_upstream - 1050 625 - 1 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:41374 HTTP/1.1 - 127.0.0.1:22000 - "SelectCapabilities"
envoy-access.log: YYYY-MM-DDT17:27:25.308Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT17:27:18.232Z GET /sms/sdk/service 404 via_upstream - 0 745 - 2 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:57986 HTTP/1.1 - 127.0.0.1:22000 - -
envoy-access.log: YYYY-MM-DDT20:57:50.857Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT20:27:48.763Z POST /sms/sdk 0 http2.remote_reset - 450 0 - 1800004 - - 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:56340 HTTP/1.1 - 127.0.0.1:22000 - "QuerySessionManager"
envoy-access.log: YYYY-MM-DDT20:57:59.775Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT20:57:58.807Z POST /sms/sdk 500 via_upstream - 1050 625 - 1 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:56052 HTTP/1.1 - 127.0.0.1:22000 - "SelectCapabilities"
envoy-access.log: YYYY-MM-DDT20:58:29.743Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT20:58:28.846Z GET /sms/sdk/service 404 via_upstream - 0 745 - 1 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:36460 HTTP/1.1 - 127.0.0.1:22000 - -
envoy-access.log: YYYY-MM-DDT21:27:59.716Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT20:57:58.845Z POST /sms/sdk 0 http2.remote_reset - 450 0 - 1800005 - - 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:56056 HTTP/1.1 - 127.0.0.1:22000 - "QuerySessionManager"
envoy-access.log: YYYY-MM-DDT21:28:11.069Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT21:28:08.887Z POST /sms/sdk 500 via_upstream - 1050 625 - 1 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:34576 HTTP/1.1 - 127.0.0.1:22000 - "SelectCapabilities"
envoy-access.log: YYYY-MM-DDT21:28:40.204Z info envoy[2754] [Originator@6876 sub=Default] YYYY-MM-DDT21:28:38.928Z GET /sms/sdk/service 404 via_upstream - 0 745 - 2 1 0 127.0.0.1:51516 HTTP/2 - 127.0.0.1:1080 127.0.0.1:57946 HTTP/1.1 - 127.0.0.1:22000 - -
envoy-access.log: YYYY-MM-DDT21:52:22.028Z info envoy[2762] [Originator@6876 sub=Default] YYYY-MM-DDT13:52:19.825Z POST /sms/sdk 504 stream_idle_timeout SI 450 14 - 28799999 - - #.#.#.#:47280 HTTP/1.1 TLSv1.2 #.#.#.#:443 127.0.0.1:47238 HTTP/1.1 - 127.0.0.1:22000 - "QueryStorageManager"
The SMS service received the last QueryProviderVersionForUpgrade call. However, this call did not completed, as we do not see the corresponding ending timer set.
--------
SPS logs
---------
sps.log: YYYY-MM-DDT02:53:46.360-04:00 [pool-36-thread-5] INFO opId=q-858238:h5ui-getProperties:urn:vmomi:Folder:g######:#####-46c5-4674-86c4-afb4166264b5:1871466536:SmsPropertyProvider:3457197-7duj9-h5:70770818 com.vmware.vim.sms.provider.vasa.VersionHandler - Upgrade of https://#.#.#.#:8084/vasa is not possible
opId.log: [YYYY-MM-DD 03:13:45.327] [ERROR] data-service-pool-773 70770818 131087 202159 Invocation of 'void com.vmware.vim.binding.sms.StorageManager.queryProviderVersionForUpgrade(java.lang.String[],com.vmware.vim.vmomi.core.Future)' to http://localhost:1080/external-vecs/http1/###########/443/sms/sdk for opId 'q-858238:h5ui-getProperties:urn:vmomi:Folder:######:#####-46c5-4674-86c4-afb4166264b5:1871466536:SmsPropertyProvider:3457197-7duj9-h5:70770818' failed in ? ms, error: com.vmware.vim.vmomi.client.exception.ConnectionException: http://localhost:1080/external-vecs/http1/###########/443/sms/sdk invocation failed with "java.net.SocketTimeoutException: Read timed out"
Since, SMS endpoint is not working properly, Vpxd gets below error during vMotion of VMs:
--------
VPXD logs
---------
vpxd.log: YYYY-MM-DDT23:30:00.042-04:00 warning vpxd[10823] [Originator@6876 sub=VmProv opID=WorkQueue-4ddcf94b-214ab08c-01-01] Received exception from SMS: N7Vmacore21InvalidStateExceptionE(Invalid state)
vCenter 8.x
The SPS service became unresponsive due to blocked or incomplete callback operations with VASA providers.
These failures cascaded through Envoy as HTTP/2 resets and idle timeouts, preventing vCenter from completing SMS storage policy queries during vMotion.
Workaround:
Post this, collect the log bundles and reach out to Broadcom Technical Support for further investigation.