Post-mortem of proxmox 6.5.11-4-pve kernel upgrade with nvidia driver 470.199 with DKMS

On Debian Bookworm, when upgrading the Proxmox kernel using the apt package manager, the process resulted in package installation failures due to NVIDIA driver build issues with DKMS.

Upon attempting to reboot, I encountered difficulties booting into the latest kernel (version 6.5.11-4-pve) due to unsuccessful package configuration. Fortunately, I was able to boot into a previously installed kernel with a version below 6.5.11-4. After conducting some research, I discovered that simplefb had been removed from the initramfs-tools modules.

To resolve the issue, I successfully booted into the 6.5.11-4-pve kernel after adding simplefb to the /etc/initramfs-tools/modules file.

#!/bin/bash

echo "simplefb" >> /etc/initramfs-tools/modules
update-initramfs -u -k all


I then tried to build the Nvidia driver again via DKMS, without success.

After some searches, I found a post at https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1940441.html that fixes the build of the Nvidia driver for the kernel:

nvidia-graphics-drivers (525.125.06-2) unstable; urgency=medium
 * Backport get_user_pages and pin_user_pages changes from
  535.86.05 to fix kernel module build for Linux 6.5.
 -- Andreas Beckmann <...> Thu, 17 Aug 2023 00:34:55 +0200


I decided to install the Nvidia driver from the testing repositories. Many solutions allow you to install a Debian package from another repository, but I chose to temporarily replace Bookworm references in my apt source files with testing, update sources via apt, and then just install the nvidia-tesla-470-kernel-dkms package, which includes the previous patch (470.223.02-1): the build and installation process succeeded. I ensured that no other packages were installed with the driver to avoid installing unwanted system packages.

#!/bin/bash

sed -i 's/bookworm/testing/g' /etc/apt/sources.list
apt update
apt install nvidia-tesla-470-kernel-dkms


After upgrading this only package, I replaced the testing references with bookworm and successfully run nvidia-smi again.

#!/bin/bash

sed -i 's/testing/bookworm/g' /etc/apt/sources.list
apt update


As I use LXC containers which use the nvidia GPU and as I upgraded the driver on the host, I also had to upgrade the driver within these containers, without the kernel modules.

#!/bin/bash

pct enter 100
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/470.223.02/NVIDIA-Linux-x86_64-470.223.02.run
chmod +x NVIDIA-Linux-x86_64-470.223.02.run
./NVIDIA-Linux-x86_64-470.223.02.run --no-kernel-module


Sources:

Pierre FILSTROFF

Senior Software Engineer - IC - Ruby on Rails/Hotwire - Android/iOS - DevOPS