Deploying an AI Kubernetes RAG Cluster through the Private AI Automation Services in VMware Aria Automation Catalog fails.
The NVIDIA Inference Microservice (NIM) pod fails to start with the following error, "RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported"
You see the followin events in the pod logs,
TensorRT-LLM][INFO] Engine version 0.10.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.[TensorRT-LLM][INFO] MPI size: 1, rank: 0[TensorRT-LLM][INFO] Rank 0 is using GPU 0[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 16384[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module> engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 290, in from_engine_args return cls( File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 267, in __init__ self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine return engine_class(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 124, in __init__ self._tllm_engine = TrtllmModelRunner( File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/trtllm_model_runner.py", line 280, in __init__ self._tllm_exec, self._cfg = self._create_engine( File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine return create_trt_executor( File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", line 224, in create_trt_executor trtllm_exec = trtllm.Executor(RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/jenkins/agent/workspace/LLM/release-0.10/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:211)1 0x7f146fffb29e void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 942 0x7f147192c85c tensorrt_llm::runtime::BufferManager::initMemoryPool(int) + 1243 0x7f147192e897 tensorrt_llm::runtime::BufferManager::BufferManager(std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool) + 6634 0x7f14719eebbf tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 3355 0x7f1471bf7e0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 12906 0x7f1471bba018 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 9687 0x7f1471c1e937 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 7118 0x7f1471c1f559 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 21059 0x7f1471c15e12 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 5010 0x7f14e6842802 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xae802) [0x7f14e6842802]11 0x7f14e67ea53c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5653c) [0x7f14e67ea53c]12 0x55b62b4b7c9e python3(+0x15ac9e) [0x55b62b4b7c9e]13 0x55b62b4ae3cb _PyObject_MakeTpCall + 60314 0x55b62b4c6540 python3(+0x169540) [0x55b62b4c6540]15 0x55b62b4c2c87 python3(+0x165c87) [0x55b62b4c2c87]16 0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]17 0x7f15523d69a7 /usr/local/lib/python3.10/dist-packages/triton/_C/libtriton.so(+0x2d59a7) [0x7f15523d69a7]18 0x55b62b4ae3cb _PyObject_MakeTpCall + 60319 0x55b62b4a6fab _PyEval_EvalFrameDefault + 2825120 0x55b62b4b859c _PyFunction_Vectorcall + 12421 0x55b62b4a0827 _PyEval_EvalFrameDefault + 175122 0x55b62b4b859c _PyFunction_Vectorcall + 12423 0x55b62b4a096e _PyEval_EvalFrameDefault + 207824 0x55b62b4b859c _PyFunction_Vectorcall + 12425 0x55b62b4ad60d _PyObject_FastCallDictTstate + 36526 0x55b62b4c2705 python3(+0x165705) [0x55b62b4c2705]27 0x55b62b4ae36c _PyObject_MakeTpCall + 50828 0x55b62b4a763b _PyEval_EvalFrameDefault + 2993129 0x55b62b4b859c _PyFunction_Vectorcall + 12430 0x55b62b4ad60d _PyObject_FastCallDictTstate + 36531 0x55b62b4c2664 python3(+0x165664) [0x55b62b4c2664]32 0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]33 0x55b62b4c6d4b PyObject_Call + 18734 0x55b62b4a2a9d _PyEval_EvalFrameDefault + 1057335 0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]36 0x55b62b4c6db2 PyObject_Call + 29037 0x55b62b4a2a9d _PyEval_EvalFrameDefault + 1057338 0x55b62b4b859c _PyFunction_Vectorcall + 12439 0x55b62b4ad60d _PyObject_FastCallDictTstate + 36540 0x55b62b4c2664 python3(+0x165664) [0x55b62b4c2664]41 0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]42 0x55b62b4c6d4b PyObject_Call + 18743 0x55b62b4a2a9d _PyEval_EvalFrameDefault + 1057344 0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]45 0x55b62b4a659a _PyEval_EvalFrameDefault + 2567446 0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]47 0x55b62b4a1b77 _PyEval_EvalFrameDefault + 669548 0x55b62b49cf96 python3(+0x13ff96) [0x55b62b49cf96]49 0x55b62b592c66 PyEval_EvalCode + 13450 0x55b62b59881d python3(+0x23b81d) [0x55b62b59881d]51 0x55b62b4b87f9 python3(+0x15b7f9) [0x55b62b4b87f9]52 0x55b62b4a0827 _PyEval_EvalFrameDefault + 175153 0x55b62b4b859c _PyFunction_Vectorcall + 12454 0x55b62b4a0827 _PyEval_EvalFrameDefault + 175155 0x55b62b4b859c _PyFunction_Vectorcall + 12456 0x55b62b5b061d python3(+0x25361d) [0x55b62b5b061d]57 0x55b62b5af2c8 Py_RunMain + 29658 0x55b62b585a3d Py_BytesMain + 4559 0x7f162f6ccd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f162f6ccd90]60 0x7f162f6cce40 __libc_start_main + 12861 0x55b62b585935 _start + 37
VMware Private AI Foundation
Issue is specific to the TRT-LLM inference engine. Unified Memory should be enabled.
To resolve the issue,
pciPassthru0.cfg.enable_uvm = 1# dcli +i +show-unreleased
# update vmclass guaranteed-xlarge-v100-4c as an example
dcli> com vmware vcenter namespacemanagement virtualmachineclasses update --vm-class guaranteed-xlarge-v100-4c --config-spec '{"_typeName":"VirtualMachineConfigSpec","extraConfig":[{"_typeName":"OptionValue","key":"pciPassthru0.cfg.enable_uvm","value":{"_typeName":"string","_value":"1"}}],"deviceChange":[{"_typeName":"VirtualDeviceConfigSpec","device":{"_typeName":"VirtualPCIPassthrough","key":0,"backing":{"_typeName":"VirtualPCIPassthroughVmiopBackingInfo","vgpu":"grid_v100-4c"}},"operation":"add"}]}'
dcli> com vmware vcenter namespacemanagement virtualmachineclasses get --vm-class guaranteed-xlarge-v100-4c
devices:
dynamic_direct_path_IO_devices:
vgpu_devices:
- profile_name: grid_v100-4c
instance_storage:
description:
config_status: READY
cpu_count: 4
cpu_reservation: 100
config_spec:
extraConfig:
- _typeName: OptionValue
value:
_typeName: string
_value: 1
key: pciPassthru0.cfg.enable_uvm
_typeName: VirtualMachineConfigSpec
deviceChange:
- _typeName: VirtualDeviceConfigSpec
device:
backing:
vgpu: grid_v100-4c
_typeName: VirtualPCIPassthroughVmiopBackingInfo
_typeName: VirtualPCIPassthrough
key: 0
operation: add
For a Workstation VM, add the following parameter to the vmx file and then power on the VM, pciPassthru0.cfg.enable_uvm = 1