TKGm - Adding multiple GPUs to single node for LLMs that require more memory(vRAM)
search cancel

TKGm - Adding multiple GPUs to single node for LLMs that require more memory(vRAM)

book

Article ID: 382439

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Management

Issue/Introduction

LLMs in your environment require to utilize more memory per worker node to perform successfully.

Environment

TKGm 2.5.2

Resolution

At deploying a cluster , as per Deploy a GPU-Enabled Workload Cluster in the Procedure see step 3.

Multiple PCI devices can be added to the VSPHERE_WORKER_PCI_DEVICES.
To add multiple PCI devices you must simply add the vendor and device IDs separated by a comma ",".

See example below:

VSPHERE_WORKER_PCI_DEVICES="0x<VENDOR_ID1>:0x<DEVICE_ID1>,0x<VENDOR_ID2>:0x<DEVICE_ID2>"

The GPU size must also be set appropriately ie it must accomodate the multiple of GPU you wish to add.

pciPassthru.64bitMMIOSizeGB=<TotalGPUGiBSize>
-- example below which best explains required GPU GiB size for multiple GPUs.

mmio-space-in-gb
The required amount of MMIO space in GiB that you calculated previously. For example, if you are assigning four GPUs to a VM that each use a total of 128 GiB of BAR1 memory, the amount of MMIO space that is required for all the GPUs is 512 GiB.
pciPassthru.64bitMMIOSizeGB = "512"


Alternative to updating the an existing cluster, is to edit the cluster's yaml to include the multiple of deviceId/vendorId combination similar to below:

kubectl get cluster <cluster_name> -o yaml } grep -A 10 "name: pci"

- name: pci
  value:
    worker:
      devices:
      - deviceId: xxxx
         vendorId: xxxx
      - deviceId: xxxx
         vendorId: xxxx
.....


Similarly the GPU GB Size will also need to be updated.