An application performance degradation event impacting critical Guest OS services (including Aerospike, ckey-mariaDB, and core application engines), which caused dropped success rates across communication paths.
Application-layer monitoring dashboards (Grafana) recorded high write latency spikes on multiple nodes.
Infrastructure-level telemetry does not reflect any corresponding storage latency spikes or backend congestion during the reported event timeframe.
Environment
ESXi: 7.0.3 EP13
VCF: 4.5.2
TCP: 2.7
Cause
Discrepancy in metric aggregation where application-layer tracking calculates cumulative processing time (including internal application processing threads, Guest OS kernel queuing, and scheduling delays) rather than the actual physical storage subsystem I/O round-trip time.
Resolution
Validate infrastructure-layer storage health by reviewing vSphere Performance Charts or VMware Aria Operations metrics for the specific timestamp of the reported latency event.
Confirm that backend vSAN write latencies remain within normal operational baselines (typically <7ms).
Analyze Guest OS internal statistics to verify if disk I/O wait thresholds inside the virtual machine nodes exceeded the expected baseline
Review the configuration and metric gathering mechanism of the application-level monitoring tool to verify how latency is computed at the user space tier versus the kernel level.
If infrastructure performance data confirms storage latency remained within nominal limits engage the application vendor to isolate Guest OS thread scheduling anomalies.