Post PAIF RAG Deployment "rag-application-multiturn-chatbot", "nemollm-inference-microservice" and "nemo-retriever-embedding-microservice" containers are in exited state
search cancel

Post PAIF RAG Deployment "rag-application-multiturn-chatbot", "nemollm-inference-microservice" and "nemo-retriever-embedding-microservice" containers are in exited state

book

Article ID: 399711

calendar_today

Updated On:

Products

VMware Private AI Foundation VMware vRealize Automation 8.x

Issue/Introduction

  • Post PAIF RAG Deployment "rag-application-multiturn-chatbot", "nemollm-inference-microservice" and "nemo-retriever-embedding-microservice" containers are in exited state.

(base) vmware@rcvrag:~$ docker container ps -a
CONTAINER ID   IMAGE                                                                COMMAND                  CREATED        STATUS                      PORTS                                       NAMES
3##########4   nvcr.io/nvidia/aiworkflows/rag-playground:24.08                      "python3.10 -m front…"   12 hours ago   Up 12 hours                 0.0.0.0:3001->3001/tcp, :::3001->3001/tcp   rag-playground
d##########1   nvcr.io/nvidia/aiworkflows/rag-application-multiturn-chatbot:24.08   "uvicorn RAG.src.cha…"   12 hours ago   Exited (3) 12 hours ago                                                 rag-application-multiturn-chatbot
c##########7   nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.0                            "/opt/nvidia/nvidia_…"   12 hours ago   Exited (139) 12 hours ago                                               nemo-retriever-embedding-microservice
4##########c   nvcr.io/nim/meta/llama3-8b-instruct:1.0.0                            "/opt/nvidia/nvidia_…"   12 hours ago   Exited (1) 12 hours ago                                                 nemollm-inference-microservice
7##########d   pgvector/pgvector:pg16                                               "docker-entrypoint.s…"   12 hours ago   Up 12 hours                 0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   pgvector
3##########d   nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04             "/usr/local/dcgm/dcg…"   12 hours ago   Up 12 hours                 0.0.0.0:9400->9400/tcp, :::9400->9400/tcp   romantic_keller

  • Container "nim"/"nemollm-inference-microservice" might fail for two reasons
    • (1) VMClass used for the DLVM for the UVM parameter is not set

      • From the ESXi host, navigate to the RAG VM directory and check the .vmx file for the below parameter
      • pciPassthru0.cfg.enable_uvm = "1"

    • (2) Incorrect NVIDIA vGPU driver is installed on the ESXi host
       
      • [root@ESX##:~] nvidia-smi vgpu -c
        GPU 00000000:C1:00.0
        No vGPUs found on this device
        [root@ESX##:~]
         
      • [root@ESX##:~] nvidia-smi vgpu -s
        GPU 00000000:C1:00.0
            NVIDIA L40S-1B
            NVIDIA L40S-2B
            NVIDIA L40S-1Q
            NVIDIA L40S-2Q
            NVIDIA L40S-3Q
            NVIDIA L40S-4Q
            NVIDIA L40S-6Q
            NVIDIA L40S-8Q
            NVIDIA L40S-12Q
            NVIDIA L40S-16Q
            NVIDIA L40S-24Q
            NVIDIA L40S-48Q
            NVIDIA L40S-1A
            NVIDIA L40S-2A
            NVIDIA L40S-3A
            NVIDIA L40S-4A
            NVIDIA L40S-6A
            NVIDIA L40S-8A
            NVIDIA L40S-12A
            NVIDIA L40S-16A
            NVIDIA L40S-24A
            NVIDIA L40S-48A

    • You should be seeing the output similar to below if the correct NVIDIA driver is installed

      • [root@ESX##:~] nvidia-smi vgpu -c
        GPU 00000000:C1:00.0
            NVIDIA L40S-1B
            NVIDIA L40S-2B
            NVIDIA L40S-1Q
            NVIDIA L40S-2Q
            NVIDIA L40S-3Q
            NVIDIA L40S-4Q
            NVIDIA L40S-6Q
            NVIDIA L40S-8Q
            NVIDIA L40S-12Q
            NVIDIA L40S-16Q
            NVIDIA L40S-24Q
            NVIDIA L40S-48Q
            NVIDIA L40S-1A
            NVIDIA L40S-2A
            NVIDIA L40S-3A
            NVIDIA L40S-4A
            NVIDIA L40S-6A
            NVIDIA L40S-8A
            NVIDIA L40S-12A
            NVIDIA L40S-16A
            NVIDIA L40S-24A
            NVIDIA L40S-48A
            NVIDIA L40S-4C
           
        NVIDIA L40S-6C
           
        NVIDIA L40S-8C

              NVIDIA L40S-12C
            NVIDIA L40S-16C
            NVIDIA L40S-24C
            NVIDIA L40S-48C

 

Environment

vRA 8.18.1 release (Affected version)

Cause

This issue is seen only in the vRA 8.18.1 version. The container nemollm-inference-microservice failed to start due to incorrect vGPU driver installed on the ESXi host. 

Resolution

To resolve the issue, update the blueprint on vRA for the RAG Catalog, so that the containers would pick the latest images.

  • Login to VRA and locate the RAG Catelog
    • In Assembler > Design > Edit the template
    • Search for "softwareCloudInit"
    • Within the "CloudInit" code search for the below line 
      • .services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |
    • Add the below new line to the code to pick up the latest image
      • .services."nemollm-embedding".image = "nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest" |
    • Click on "version" at the bottom of the page and check the box "Release this version to the catalog"


Workaround:
For containers to pick up the latest image

    1. With in the RAG deployed VM, updated the yaml file 
    2. Navigate and locate the yaml file /opt/data/ai-chatbot-docker-workflow_v24.08/docker-compose-nim-ms.yaml
    3. Run the below commands
      • cp /opt/data/ai-chatbot-docker-workflow_v24.08/docker-compose-nim-ms.yaml /var/tmp/docker-compose-nim-ms.yaml.orig
      • vi /opt/data/ai-chatbot-docker-workflow_v24.08/docker-compose-nim-ms.yaml
      • Change the ending '1.0.0 to 'latest' for the below line
      • From
        • 'image: nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.0
      • To
        • 'image: nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest'
    4. Run the below command to stop and start the containers
      • To remove the containers 
      • docker compose -f /opt/data/ai-chatbot-docker-workflow_v24.08/rag-app-multiturn-chatbot/docker-compose.yaml --profile local-nim --profile pgvector down
      • To start the containers
      • /opt/dlvm/dl_app.sh 
    5. Run the docker ps -a command to validate if all the containers are started with the latest image.
      • (base) vmware@airagworkstation6:/opt/data/ai-chatbot-docker-workflow_v24.08$ docker ps -a
        CONTAINER ID   IMAGE                                                                COMMAND                  CREATED              STATUS                        PORTS                                       NAMES
        0##########0   nvcr.io/nvidia/aiworkflows/rag-playground:24.08                      "python3.10 -m front…"   About a minute ago   Up About a minute             0.0.0.0:3001->3001/tcp, :::3001->3001/tcp   rag-playground
        c##########a   nvcr.io/nvidia/aiworkflows/rag-application-multiturn-chatbot:24.08   "uvicorn RAG.src.cha…"   About a minute ago   Up About a minute             0.0.0.0:8081->8081/tcp, :::8081->8081/tcp   rag-application-multiturn-chatbot
        5##########5   nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest                           "/opt/nim/start_serv…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:9080->8000/tcp, :::9080->8000/tcp   nemo-retriever-embedding-microservice
        4##########b   nvcr.io/nim/meta/llama3-8b-instruct:1.0.0                            "/opt/nvidia/nvidia_…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp   nemollm-inference-microservice
        9##########8   pgvector/pgvector:pg16                                               "docker-entrypoint.s…"   About a minute ago   Up About a minute             0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   pgvector
        e##########3   nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04             "/usr/local/dcgm/dcg…"   About an hour ago    Up About an hour              0.0.0.0:9400->9400/tcp, :::9400->9400/tcp   mystifying_montalcini

Update VM Class with UVM parameter:

  • If the UVM parameter is missing, navigate to the Workload Management > Services > VM Classes > Select the appropriate VM Class and click on > Edit VM Class 

  • Click on "Advanced Parameters" and add the new attribute and value as pciPassthru0.cfg.enable_uvm = "1" as shown below.

  • When the new RAG VM is redeployed using the above VM Class, this advanced parameter will be added to the VM.

Additional Information