Deploying an AI Kubernetes RAG Cluster through the Private AI Automation Services in VMware Aria Automation Catalog fails.
The NVIDIA Inference Microservice (NIM) pod fails to start with the following error, "RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported"
You see the followin events in the pod logs,
TensorRT-LLM][INFO] Engine version 0.10.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 16384
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 290, in from_engine_args
return cls(
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 267, in __init__
self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 124, in __init__
self._tllm_engine = TrtllmModelRunner(
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/trtllm_model_runner.py", line 280, in __init__
self._tllm_exec, self._cfg = self._create_engine(
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine
return create_trt_executor(
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", line 224, in create_trt_executor
trtllm_exec = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/jenkins/agent/workspace/LLM/release-0.10/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:211)
1 0x7f146fffb29e void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 94
2 0x7f147192c85c tensorrt_llm::runtime::BufferManager::initMemoryPool(int) + 124
3 0x7f147192e897 tensorrt_llm::runtime::BufferManager::BufferManager(std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool) + 663
4 0x7f14719eebbf tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 335
5 0x7f1471bf7e0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1290
6 0x7f1471bba018 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 968
7 0x7f1471c1e937 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 711
8 0x7f1471c1f559 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2105
9 0x7f1471c15e12 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 50
10 0x7f14e6842802 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xae802) [0x7f14e6842802]
11 0x7f14e67ea53c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5653c) [0x7f14e67ea53c]
12 0x55b62b4b7c9e python3(+0x15ac9e) [0x55b62b4b7c9e]
13 0x55b62b4ae3cb _PyObject_MakeTpCall + 603
14 0x55b62b4c6540 python3(+0x169540) [0x55b62b4c6540]
15 0x55b62b4c2c87 python3(+0x165c87) [0x55b62b4c2c87]
16 0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]
17 0x7f15523d69a7 /usr/local/lib/python3.10/dist-packages/triton/_C/libtriton.so(+0x2d59a7) [0x7f15523d69a7]
18 0x55b62b4ae3cb _PyObject_MakeTpCall + 603
19 0x55b62b4a6fab _PyEval_EvalFrameDefault + 28251
20 0x55b62b4b859c _PyFunction_Vectorcall + 124
21 0x55b62b4a0827 _PyEval_EvalFrameDefault + 1751
22 0x55b62b4b859c _PyFunction_Vectorcall + 124
23 0x55b62b4a096e _PyEval_EvalFrameDefault + 2078
24 0x55b62b4b859c _PyFunction_Vectorcall + 124
25 0x55b62b4ad60d _PyObject_FastCallDictTstate + 365
26 0x55b62b4c2705 python3(+0x165705) [0x55b62b4c2705]
27 0x55b62b4ae36c _PyObject_MakeTpCall + 508
28 0x55b62b4a763b _PyEval_EvalFrameDefault + 29931
29 0x55b62b4b859c _PyFunction_Vectorcall + 124
30 0x55b62b4ad60d _PyObject_FastCallDictTstate + 365
31 0x55b62b4c2664 python3(+0x165664) [0x55b62b4c2664]
32 0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]
33 0x55b62b4c6d4b PyObject_Call + 187
34 0x55b62b4a2a9d _PyEval_EvalFrameDefault + 10573
35 0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]
36 0x55b62b4c6db2 PyObject_Call + 290
37 0x55b62b4a2a9d _PyEval_EvalFrameDefault + 10573
38 0x55b62b4b859c _PyFunction_Vectorcall + 124
39 0x55b62b4ad60d _PyObject_FastCallDictTstate + 365
40 0x55b62b4c2664 python3(+0x165664) [0x55b62b4c2664]
41 0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]
42 0x55b62b4c6d4b PyObject_Call + 187
43 0x55b62b4a2a9d _PyEval_EvalFrameDefault + 10573
44 0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]
45 0x55b62b4a659a _PyEval_EvalFrameDefault + 25674
46 0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]
47 0x55b62b4a1b77 _PyEval_EvalFrameDefault + 6695
48 0x55b62b49cf96 python3(+0x13ff96) [0x55b62b49cf96]
49 0x55b62b592c66 PyEval_EvalCode + 134
50 0x55b62b59881d python3(+0x23b81d) [0x55b62b59881d]
51 0x55b62b4b87f9 python3(+0x15b7f9) [0x55b62b4b87f9]
52 0x55b62b4a0827 _PyEval_EvalFrameDefault + 1751
53 0x55b62b4b859c _PyFunction_Vectorcall + 124
54 0x55b62b4a0827 _PyEval_EvalFrameDefault + 1751
55 0x55b62b4b859c _PyFunction_Vectorcall + 124
56 0x55b62b5b061d python3(+0x25361d) [0x55b62b5b061d]
57 0x55b62b5af2c8 Py_RunMain + 296
58 0x55b62b585a3d Py_BytesMain + 45
59 0x7f162f6ccd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f162f6ccd90]
60 0x7f162f6cce40 __libc_start_main + 128
61 0x55b62b585935 _start + 37
VMware Private AI Foundation
Issue is specific to the TRT-LLM inference engine. Unified Memory should be enabled.
To resolve the issue,
pciPassthru0.cfg.enable_uvm = 1
# dcli +i +show-unreleased
# update vmclass guaranteed-xlarge-v100-4c as an example
dcli> com vmware vcenter namespacemanagement virtualmachineclasses update --vm-class guaranteed-xlarge-v100-4c --config-spec '{"_typeName":"VirtualMachineConfigSpec","extraConfig":[{"_typeName":"OptionValue","key":"pciPassthru0.cfg.enable_uvm","value":{"_typeName":"string","_value":"1"}}],"deviceChange":[{"_typeName":"VirtualDeviceConfigSpec","device":{"_typeName":"VirtualPCIPassthrough","key":0,"backing":{"_typeName":"VirtualPCIPassthroughVmiopBackingInfo","vgpu":"grid_v100-4c"}},"operation":"add"}]}'
dcli> com vmware vcenter namespacemanagement virtualmachineclasses get --vm-class guaranteed-xlarge-v100-4c
devices:
dynamic_direct_path_IO_devices:
vgpu_devices:
- profile_name: grid_v100-4c
instance_storage:
description:
config_status: READY
cpu_count: 4
cpu_reservation: 100
config_spec:
extraConfig:
- _typeName: OptionValue
value:
_typeName: string
_value: 1
key: pciPassthru0.cfg.enable_uvm
_typeName: VirtualMachineConfigSpec
deviceChange:
- _typeName: VirtualDeviceConfigSpec
device:
backing:
vgpu: grid_v100-4c
_typeName: VirtualPCIPassthroughVmiopBackingInfo
_typeName: VirtualPCIPassthrough
key: 0
operation: add
For a Workstation VM, add the following parameter to the vmx file and then power on the VM, pciPassthru0.cfg.enable_uvm = 1