Understanding and Responding to Out-of-Memory Errors in Bitfusion 2.0
search cancel

Understanding and Responding to Out-of-Memory Errors in Bitfusion 2.0

book

Article ID: 336867

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

When running large models or with large batch sizes, ML frameworks like TensorFlow can report out-of-memory (OOM) errors. For example:
W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:211] Ran out of memory trying to allocate 877.38MiB.  See logs for memory state
W tensorflow/core/kernels/cwise_ops_common.cc:56] Resource exhausted: OOM when allocating tensor with shape[10000,23000]


Environment

VMware vSphere Bitfusion 2.x

Cause

OOM conditions in the ML framework can occur when the batch size or model size exceed the amount of available GPU memory in the Bitfusion server. Bitfusion itself consumes some GPU memory for operations, so somewhat less GPU memory is available than the total amount installed.

Resolution

If you routinely encounter OOM conditions, there are several possible resolutions depending on configuration:

  • Reduce the batch size to avoid memory contention
  • Change GPUs to models with more available memory
  • Increase parallelism by adapting the model to use smaller chunks spread across more GPUs