GPU temperature metrics show incorrect values at cluster level in vCenter Server
search cancel

GPU temperature metrics show incorrect values at cluster level in vCenter Server

book

Article ID: 406640

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • When viewing GPU performance charts in VMware vCenter Server, you observe cluster-level GPU temperature metrics that show incorrect values. These charts display dramatic temperature spikes to 100°C or higher. The spikes appear consistently at the end of chart periods.
  • Individual host-level GPU temperature readings show normal values during the same timeframes. Host readings typically remain around 30-40°C. Cluster-level temperature charts appear in the interface despite temperature metrics being designed for host-level monitoring only.
  • This creates misleading performance data that can trigger false temperature alarms. You may investigate hardware issues that do not exist. Your ability to trust cluster-level GPU monitoring becomes compromised.

Environment

VMware vCenter Server 8.0 Update 2b and newer, managing ESXi hosts with graphics processing unit (GPU) hardware in clustered configurations

Cause

Cluster-level GPU temperature display shows incorrect values because no aggregation logic was implemented for temperature metrics. vCenter 8.0 Update 2b introduced cluster-level aggregation for GPU memory and utilization metrics only. Temperature metrics were not included in this implementation and remain host-level only. However, cluster-level temperature charts still appear in the interface and produce false readings that do not reflect actual hardware temperatures.

Resolution

Workaround

  1. Log in to the vSphere Client.
  2. Navigate to the ESXi host experiencing temperature monitoring issues.
  3. Click Monitor in the host navigation menu.
  4. Select Performance from the Monitor options.
  5. Click Advanced to access detailed performance charts.
  6. In the Chart Options, select GPU Temperature from the available metrics.
  7. Configure the desired time range for temperature monitoring.
  8. Review the accurate host-level GPU temperature readings.
  9. Repeat these steps for each ESXi host in the cluster to monitor individual GPU temperatures.
  10. Configure host-level alerting for GPU temperature thresholds if automated monitoring is required.

Additional Information