CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported
search cancel

CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported

book

Article ID: 386585

calendar_today

Updated On:

Products

VMware Private AI Foundation

Issue/Introduction

Deploying an AI Kubernetes RAG Cluster through the Private AI Automation Services in VMware Aria Automation Catalog fails.

The NVIDIA Inference Microservice (NIM) pod fails to start with the following error, "RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported"

You see the followin events in the pod logs,

TensorRT-LLM][INFO] Engine version 0.10.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 16384
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 290, in from_engine_args
    return cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 267, in __init__
    self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 124, in __init__
    self._tllm_engine = TrtllmModelRunner(
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/trtllm_model_runner.py", line 280, in __init__
    self._tllm_exec, self._cfg = self._create_engine(
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine
    return create_trt_executor(
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", line 224, in create_trt_executor
    trtllm_exec = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/jenkins/agent/workspace/LLM/release-0.10/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:211)
1       0x7f146fffb29e void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 94
2       0x7f147192c85c tensorrt_llm::runtime::BufferManager::initMemoryPool(int) + 124
3       0x7f147192e897 tensorrt_llm::runtime::BufferManager::BufferManager(std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool) + 663
4       0x7f14719eebbf tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 335
5       0x7f1471bf7e0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1290
6       0x7f1471bba018 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 968
7       0x7f1471c1e937 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 711
8       0x7f1471c1f559 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2105
9       0x7f1471c15e12 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 50
10      0x7f14e6842802 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xae802) [0x7f14e6842802]
11      0x7f14e67ea53c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5653c) [0x7f14e67ea53c]
12      0x55b62b4b7c9e python3(+0x15ac9e) [0x55b62b4b7c9e]
13      0x55b62b4ae3cb _PyObject_MakeTpCall + 603
14      0x55b62b4c6540 python3(+0x169540) [0x55b62b4c6540]
15      0x55b62b4c2c87 python3(+0x165c87) [0x55b62b4c2c87]
16      0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]
17      0x7f15523d69a7 /usr/local/lib/python3.10/dist-packages/triton/_C/libtriton.so(+0x2d59a7) [0x7f15523d69a7]
18      0x55b62b4ae3cb _PyObject_MakeTpCall + 603
19      0x55b62b4a6fab _PyEval_EvalFrameDefault + 28251
20      0x55b62b4b859c _PyFunction_Vectorcall + 124
21      0x55b62b4a0827 _PyEval_EvalFrameDefault + 1751
22      0x55b62b4b859c _PyFunction_Vectorcall + 124
23      0x55b62b4a096e _PyEval_EvalFrameDefault + 2078
24      0x55b62b4b859c _PyFunction_Vectorcall + 124
25      0x55b62b4ad60d _PyObject_FastCallDictTstate + 365
26      0x55b62b4c2705 python3(+0x165705) [0x55b62b4c2705]
27      0x55b62b4ae36c _PyObject_MakeTpCall + 508
28      0x55b62b4a763b _PyEval_EvalFrameDefault + 29931
29      0x55b62b4b859c _PyFunction_Vectorcall + 124
30      0x55b62b4ad60d _PyObject_FastCallDictTstate + 365
31      0x55b62b4c2664 python3(+0x165664) [0x55b62b4c2664]
32      0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]
33      0x55b62b4c6d4b PyObject_Call + 187
34      0x55b62b4a2a9d _PyEval_EvalFrameDefault + 10573
35      0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]
36      0x55b62b4c6db2 PyObject_Call + 290
37      0x55b62b4a2a9d _PyEval_EvalFrameDefault + 10573
38      0x55b62b4b859c _PyFunction_Vectorcall + 124
39      0x55b62b4ad60d _PyObject_FastCallDictTstate + 365
40      0x55b62b4c2664 python3(+0x165664) [0x55b62b4c2664]
41      0x55b62b4ae77b python3(+0x15177b) [0x55b62b4ae77b]
42      0x55b62b4c6d4b PyObject_Call + 187
43      0x55b62b4a2a9d _PyEval_EvalFrameDefault + 10573
44      0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]
45      0x55b62b4a659a _PyEval_EvalFrameDefault + 25674
46      0x55b62b4c6111 python3(+0x169111) [0x55b62b4c6111]
47      0x55b62b4a1b77 _PyEval_EvalFrameDefault + 6695
48      0x55b62b49cf96 python3(+0x13ff96) [0x55b62b49cf96]
49      0x55b62b592c66 PyEval_EvalCode + 134
50      0x55b62b59881d python3(+0x23b81d) [0x55b62b59881d]
51      0x55b62b4b87f9 python3(+0x15b7f9) [0x55b62b4b87f9]
52      0x55b62b4a0827 _PyEval_EvalFrameDefault + 1751
53      0x55b62b4b859c _PyFunction_Vectorcall + 124
54      0x55b62b4a0827 _PyEval_EvalFrameDefault + 1751
55      0x55b62b4b859c _PyFunction_Vectorcall + 124
56      0x55b62b5b061d python3(+0x25361d) [0x55b62b5b061d]
57      0x55b62b5af2c8 Py_RunMain + 296
58      0x55b62b585a3d Py_BytesMain + 45
59      0x7f162f6ccd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f162f6ccd90]
60      0x7f162f6cce40 __libc_start_main + 128
61      0x55b62b585935 _start + 37

Environment

VMware Private AI Foundation

Cause

Issue is specific to the TRT-LLM inference engine. Unified Memory should be enabled.

Resolution

To resolve the issue,

  1. SSH to the vCenter Server.
  2. Update the vmclass to add the parameter, pciPassthru0.cfg.enable_uvm = 1

    For eg. To update a vmclass, guaranteed-xlarge-v100-4c
    # dcli +i +show-unreleased
     
    # update vmclass guaranteed-xlarge-v100-4c as an example
    dcli> com vmware vcenter namespacemanagement virtualmachineclasses update --vm-class guaranteed-xlarge-v100-4c --config-spec '{"_typeName":"VirtualMachineConfigSpec","extraConfig":[{"_typeName":"OptionValue","key":"pciPassthru0.cfg.enable_uvm","value":{"_typeName":"string","_value":"1"}}],"deviceChange":[{"_typeName":"VirtualDeviceConfigSpec","device":{"_typeName":"VirtualPCIPassthrough","key":0,"backing":{"_typeName":"VirtualPCIPassthroughVmiopBackingInfo","vgpu":"grid_v100-4c"}},"operation":"add"}]}'
     
    dcli> com vmware vcenter namespacemanagement virtualmachineclasses get --vm-class guaranteed-xlarge-v100-4c
    devices:
       dynamic_direct_path_IO_devices:
       vgpu_devices:
          - profile_name: grid_v100-4c
     
    instance_storage:
    description:
    config_status: READY
    cpu_count: 4
    cpu_reservation: 100
    config_spec:
       extraConfig:
          - _typeName: OptionValue
            value:
               _typeName: string
               _value: 1
            key: pciPassthru0.cfg.enable_uvm
     
       _typeName: VirtualMachineConfigSpec
       deviceChange:
          - _typeName: VirtualDeviceConfigSpec
            device:
               backing:
                  vgpu: grid_v100-4c
                  _typeName: VirtualPCIPassthroughVmiopBackingInfo
               _typeName: VirtualPCIPassthrough
               key: 0
            operation: add

 

Additional Information

For a Workstation VM, add the following parameter to the vmx file and then power on the VM, pciPassthru0.cfg.enable_uvm = 1