NVIDIA driver will not start the GPU

I have not been able to start an attached GPU to an Azure Ubuntu VM. Here are the details I have culled out:

  • IMAGE: NVIDIA GPU-Optimized VMI with vGPU driver
  • GPU: Standard NV6ads A10 v5
  • OS: 22.04.5 LTS
  • lspci confirms the GPU is mounted
    • azureuser@img-seg-vm:~$ lspci
      • 0002:00:00.0 3D controller: NVIDIA Corporation Device 2236 (rev a1)
  • PROBLEM: azureuser@img-seg-vm:~$ nvidia-smi
    • No devices were found
  • Tried installing a container anyway but the GPU is not starting.
    • Container: pytorch:25.01-py3
    • docker run --gpus all -it --rm [nvcr.io/nvidia/pytorch:25.01-py3](http://nvcr.io/nvidia/pytorch:25.01-py3)
      • ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.
      • GPU functionality will not be available.
      • [[ No CUDA-capable device is detected (error 100) ]]
      • Failed to detect NVIDIA driver version.

The issue appears to be that the NVIDIA driver is either not installed or not properly loaded.

Since you are able to successfully spin up the VM, it is unlikely that the problem is related to permissions or access roles.

Given this, I recommend performing some basic SysAdmin troubleshooting before further escalation, as this issue does not seem to require our direct intervention at this time.

These troubleshooting steps may be helpful:

  • Check if the NVIDIA driver is installed:

    • dpkg -l | grep -i nvidia
  • Check / install the correct NVIDIA driver

sudo apt purge -y 'nvidia'
sudo apt autoremove -y
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-utils-535
sudo reboot
  • Also the instructions in this article
    seem applicable.

  • It can be productive to start a new VM for a “do-over-from-scratch” test.