When running large models or with large batch sizes, ML frameworks like TensorFlow can report out-of-memory (OOM) errors. For example:
W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:211] Ran out of memory trying to allocate 877.38MiB. See logs for memory state
W tensorflow/core/kernels/cwise_ops_common.cc:56] Resource exhausted: OOM when allocating tensor with shape[10000,23000]
OOM conditions in the ML framework can occur when the batch size or model size exceed the amount of available GPU memory in the Bitfusion server. Bitfusion itself consumes some GPU memory for operations, so somewhat less GPU memory is available than the total amount installed.
If you routinely encounter OOM conditions, there are several possible resolutions depending on configuration: