(base) vmware@rcvrag:~$ docker container ps -aCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES3##########4 nvcr.io/nvidia/aiworkflows/rag-playground:24.08 "python3.10 -m front…" 12 hours ago Up 12 hours 0.0.0.0:3001->3001/tcp, :::3001->3001/tcp rag-playgroundd##########1 nvcr.io/nvidia/aiworkflows/rag-application-multiturn-chatbot:24.08 "uvicorn RAG.src.cha…" 12 hours ago Exited (3) 12 hours ago rag-application-multiturn-chatbotc##########7 nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.0 "/opt/nvidia/nvidia_…" 12 hours ago Exited (139) 12 hours ago nemo-retriever-embedding-microservice4##########c nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 "/opt/nvidia/nvidia_…" 12 hours ago Exited (1) 12 hours ago nemollm-inference-microservice7##########d pgvector/pgvector:pg16 "docker-entrypoint.s…" 12 hours ago Up 12 hours 0.0.0.0:5432->5432/tcp, :::5432->5432/tcp pgvector3##########d nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04 "/usr/local/dcgm/dcg…" 12 hours ago Up 12 hours 0.0.0.0:9400->9400/tcp, :::9400->9400/tcp romantic_keller
nim"/"nemollm-inference-microservice" might fail for two reasons
pciPassthru0.cfg.enable_uvm = "1"[root@ESX##:~] nvidia-smi vgpu -cGPU 00000000:C1:00.0No vGPUs found on this device[root@ESX##:~][root@ESX##:~] nvidia-smi vgpu -sGPU 00000000:C1:00.0 NVIDIA L40S-1B NVIDIA L40S-2B NVIDIA L40S-1Q NVIDIA L40S-2Q NVIDIA L40S-3Q NVIDIA L40S-4Q NVIDIA L40S-6Q NVIDIA L40S-8Q NVIDIA L40S-12Q NVIDIA L40S-16Q NVIDIA L40S-24Q NVIDIA L40S-48Q NVIDIA L40S-1A NVIDIA L40S-2A NVIDIA L40S-3A NVIDIA L40S-4A NVIDIA L40S-6A NVIDIA L40S-8A NVIDIA L40S-12A NVIDIA L40S-16A NVIDIA L40S-24A NVIDIA L40S-48A[root@ESX##:~] nvidia-smi vgpu -cGPU 00000000:C1:00.0 NVIDIA L40S-1B NVIDIA L40S-2B NVIDIA L40S-1Q NVIDIA L40S-2Q NVIDIA L40S-3Q NVIDIA L40S-4Q NVIDIA L40S-6Q NVIDIA L40S-8Q NVIDIA L40S-12Q NVIDIA L40S-16Q NVIDIA L40S-24Q NVIDIA L40S-48Q NVIDIA L40S-1A NVIDIA L40S-2A NVIDIA L40S-3A NVIDIA L40S-4A NVIDIA L40S-6A NVIDIA L40S-8A NVIDIA L40S-12A NVIDIA L40S-16A NVIDIA L40S-24A NVIDIA L40S-48A NVIDIA L40S-4C
NVIDIA L40S-6C
NVIDIA L40S-8C NVIDIA L40S-12C
NVIDIA L40S-16C
NVIDIA L40S-24C
NVIDIA L40S-48C
vRA 8.18.1 release (Affected version)
This issue is seen only in the vRA 8.18.1 version. The container nemollm-inference-microservice failed to start due to incorrect vGPU driver installed on the ESXi host.
To resolve the issue, update the blueprint on vRA for the RAG Catalog, so that the containers would pick the latest images.
Assembler > Design > Edit the templatesoftwareCloudInit"CloudInit" code search for the below line
.services."nemollm-inference".deploy.resources.reservations.devices[0].device_ids = ["${LLM_MS_GPU_ID:-0}"] |.services."nemollm-embedding".image = "nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest" |"version" at the bottom of the page and check the box "Release this version to the catalog"
Workaround: For containers to pick up the latest image
/opt/data/ai-chatbot-docker-workflow_v24.08/docker-compose-nim-ms.yaml cp /opt/data/ai-chatbot-docker-workflow_v24.08/docker-compose-nim-ms.yaml /var/tmp/docker-compose-nim-ms.yaml.origvi /opt/data/ai-chatbot-docker-workflow_v24.08/docker-compose-nim-ms.yaml'image: nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.0' 'image: nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest'docker compose -f /opt/data/ai-chatbot-docker-workflow_v24.08/rag-app-multiturn-chatbot/docker-compose.yaml --profile local-nim --profile pgvector down /opt/dlvm/dl_app.sh docker ps -a command to validate if all the containers are started with the latest image.
(base) vmware@airagworkstation6:/opt/data/ai-chatbot-docker-workflow_v24.08$ docker ps -aCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES0##########0 nvcr.io/nvidia/aiworkflows/rag-playground:24.08 "python3.10 -m front…" About a minute ago Up About a minute 0.0.0.0:3001->3001/tcp, :::3001->3001/tcp rag-playgroundc##########a nvcr.io/nvidia/aiworkflows/rag-application-multiturn-chatbot:24.08 "uvicorn RAG.src.cha…" About a minute ago Up About a minute 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp rag-application-multiturn-chatbot5##########5 nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest "/opt/nim/start_serv…" About a minute ago Up About a minute (healthy) 0.0.0.0:9080->8000/tcp, :::9080->8000/tcp nemo-retriever-embedding-microservice4##########b nvcr.io/nim/meta/llama3-8b-instruct:1.0.0 "/opt/nvidia/nvidia_…" About a minute ago Up About a minute (healthy) 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp nemollm-inference-microservice9##########8 pgvector/pgvector:pg16 "docker-entrypoint.s…" About a minute ago Up About a minute 0.0.0.0:5432->5432/tcp, :::5432->5432/tcp pgvectore##########3 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu22.04 "/usr/local/dcgm/dcg…" About an hour ago Up About an hour 0.0.0.0:9400->9400/tcp, :::9400->9400/tcp mystifying_montalciniUpdate VM Class with UVM parameter:
pciPassthru0.cfg.enable_uvm = "1" as shown below.