Understanding VMware vSphere Bitfusion 2.x Health Checks
search cancel

Understanding VMware vSphere Bitfusion 2.x Health Checks

book

Article ID: 336868

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

When interacting with VMware vSphere Bitfusion via the main vCenter UI or the Bitfusion CLI, you have access to health checks that help determine the status of the Bitfusion infrastructure. This article will describe some common health checks that you may see.

For more information, see the VMware vSphere Bitfusion Documentation .

Environment

VMware vSphere Bitfusion 2.x

Resolution

Common health checks are:
  • Shadow Memory Check
    • Determines whether there is adequate system memory to accommodate the GPU memory managed by the Bitfusion server. Ideally, there should be at least as much system memory as there is combined GPU memory managed by Bitfusion, If this check fails, increase the amount of system memory available to be somewhat greater than the total GPU memory under management.
  • MTU Size Check
    • Bitfusion recommends a 4,096 byte MTU for networking used by Bitfusion. This check validates that large MTUs can be successfully handled by Bitfusion servers. If this check fails, verify that the interface MTU is correctly configured in the Bitfusion server and that all intermediate switches are configured to pass large frames.
  • Network Error Check
    • Bitfusion tracks networking artifacts like ongoing error statistics and dropped packets. If this health check is flagged as a problem, refer to the Bitfusion Documentation for specific steps to examine the Linux networking statistics in the appliance OS. If errors are present but not incrementing, the issue was transient and the counters can be cleared by rebooting the appliance during a maintenance window. If errors are present and are incrementing, there is an ongoing issue. Verify that the underlying networking (ESXi host to physical infrastructure) is healthy and working as expected without ongoing error conditions like packets dropped by the physical network interfaces, incrementing CRC error counts or other deleterious networking artifacts.