LLMs in your environment require to utilize more memory per worker node to perform successfully.
TKGm 2.5.2
At deploying a cluster , as per Deploy a GPU-Enabled Workload Cluster in the Procedure see step 3.
Multiple PCI devices can be added to the VSPHERE_WORKER_PCI_DEVICES.
To add multiple PCI devices you must simply add the vendor and device IDs separated by a comma ",".
See example below:
VSPHERE_WORKER_PCI_DEVICES="0x<VENDOR_ID1>:0x<DEVICE_ID1>,0x<VENDOR_ID2>:0x<DEVICE_ID2>"
The GPU size must also be set appropriately ie it must accomodate the multiple of GPU you wish to add.
pciPassthru.64bitMMIOSizeGB=<TotalGPUGiBSize>
-- example below which best explains required GPU GiB size for multiple GPUs.
mmio-space-in-gb
The required amount of MMIO space in GiB that you calculated previously. For example, if you are assigning four GPUs to a VM that each use a total of 128 GiB of BAR1 memory, the amount of MMIO space that is required for all the GPUs is 512 GiB.
pciPassthru.64bitMMIOSizeGB = "512"
Alternative to updating the an existing cluster, is to edit the cluster's yaml to include the multiple of deviceId/vendorId combination similar to below:
kubectl get cluster <cluster_name> -o yaml } grep -A 10 "name: pci"
- name: pci
value:
worker:
devices:
- deviceId: xxxx
vendorId: xxxx
- deviceId: xxxx
vendorId: xxxx
.....
Similarly the GPU GB Size will also need to be updated.