claude and i go on a minor adventure in the world of nvidia on linux
i know the footnote is fake, my dream footnote UI is on the backlog
I use deploy-rs to push configuration changes to my NixOS homelab. It implements an auto-rollback feature that will rollback the deployed profile to the previous generation on activation failure. I like this feature because it lets me iterate on my configuration without constantly worrying about breaking my system.
Most of the time, this deployment process flawlessly. However, I experienced intermittent deployment failures that looked like this:
$ deploy .#agrotera --remote-build
# omit for brevity ...
Failed to start nvidia-persistenced.service
⭐ ⚠️ [activate] [WARN] De-activating due to error
switching profile from version 55 to 54
⭐ ⚠️ [activate] [WARN] Removing generation by ID 55
removing profile version 55
⭐ ℹ️ [activate] [INFO] Attempting to re-activate the last generation
# omit for brevity ...
This would happen when I was just updating the flake lockfile and rebuilding my system with newer packages.
In the universe of changes I could be trying to deploy, version bumping is a pretty low-risk chore; when this came up, I would forgo the auto-rollback and just switch to the new configuration anyway. The service would fail to start but the configuration change would persist. Then, I would reboot and move on. This never broke anything for me, but the randomly-added friction did annoy me.
After suffering this minor inconvenience for a while, I finally sat down to debug this failure. I
wanted to be able to use deploy
and its rollback features consistently.
I started with logs from the failing service.
$ journalctl -b -u nvidia-persistenced.service
...
systemd[1]: nvidia-persistenced.service: Deactivated successfully.
systemd[1]: Stopped NVIDIA Persistence Daemon.
systemd[1]: nvidia-persistenced.service: Consumed 24ms CPU time, 1.8M memory peak, 288K memory swap peak, 1016K read from disk, 284K written to disk, 240B incoming IP traffic, 320B outgoing IP traffic.
systemd[1]: /etc/systemd/system/nvidia-persistenced.service:10: PIDFile= references a path below legacy directory /var/run/, updating /var/run/nvidia-persistenced/nvidia-persistenced.pid → /run/nvidia-persistenced/nvidia-persistenced.pid; please update the unit file accordingly.
systemd[1]: Starting NVIDIA Persistence Daemon...
nvidia-persistenced[271345]: Verbose syslog connection opened
nvidia-persistenced[271345]: Started (271345)
nvidia-persistenced[271345]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
nvidia-persistenced[271345]: PID file unlocked.
nvidia-persistenced[271334]: nvidia-persistenced failed to initialize. Check syslog for more details.
nvidia-persistenced[271345]: PID file closed.
nvidia-persistenced[271345]: Shutdown (271345)
systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
systemd[1]: Failed to start NVIDIA Persistence Daemon.
The NVIDIA NixOS module starts nvidia-persistenced
with the --verbose
flag already,
so this is all the information I was going to get from this journalctl
command. The complaint
about the provided PID file is just a warning, leaving me with: “Failed to query NVIDIA devices.
Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write
permissions for those files.”
I also harbored a suspicion that NVIDIA driver updates were causing these failures. To check that
they were at least correlated, I confirmed that my failing flake update included a version bump to
my NVIDIA package, production
, by peeking at the nixpkgs
commit history since my last update (lol).
With no relevant experience to draw on and no useful results from Google, I put my notes and
thoughts together and then prompted Claude (3.7 at time). The first suggestion was to directly
address the seemingly missing device handles by trying various tweaks to the nvidia-persistenced
service configuration. This made sense to me; if the device really was missing at service start
time, maybe I could have this service express a dependency on device availability somehow.
Fortunately for me, NixOS makes it dead-easy to iterate on this:
{ config, ... }: {
# ...
systemd.services.nvidia-persistenced = {
# attribute tweaks here
};
}
Unfortunately for me, none of the changes I tried worked.
Time to take a step back then. I decided to scroll through the unfiltered logs, reading all the
events following a deployment. Lo and behold, a key piece of information was hiding behind the kernel
source:
$ journalctl -b
...
kernel: NVRM: API mismatch: the client has the version 570.133.07, but
NVRM: this kernel module has the version 570.124.04. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
This looked like the foothold I was looking for. This message comes up right before the failure
message from nvidia-persistenced.service
[^1].
Anyway, if nvidia-persistenced
needs the new kernel module, maybe I can just switch them when the
driver updates! Two years ago, one person on the NVIDIA developer forums said loading new NVIDIA kernel modules is possible on Ubuntu?
Say less.
Again with Claude, I started working on a fix. I would add a NixOS module to my configuration that:
nvidia-persistenced
service to run this script on service startIt took some experimentation and some spelunking to iterate on this.
First, since modprobe
is a higher-level tool that loads modules by name, the mid-upgrade module
resolution was “succeeding” but picking modules from the old package. I had to go looking for the
right Nix attribute to pass the script the correct store path and then use insmod
to directly load .ko
files from that store path.
Once I fixed that, I was still failing to use the right files. These kernel modules are generally found at a path that looks like this:
/nix/store/${input_hash}-${package_name}/lib/modules/${KERNEL_VERSION}/
I couldn’t open the right directory because my update included a kernel update; my earlier code was looking for a directory named for the current kernel but the only provided modules were for the incoming new kernel.
Thinking about this kernel update, it hit me that not only was a kernel update annoying to account
for, it required a reboot anyway. While I still needed to get deploy
to stop failing, I wasn’t
going to avoid a reboot forever; there was less value in swapping parts in-place when the reboot I
might have to do would also reload kernel modules.
I stopped focusing on module swapping to re-evaluate. Finally stepping back, I wondered, “why am I
stopping the world for nvidia-persistenced
?”
Oh no.
The NixOS option hardware.nvidia.nvidiaPersistenced
has the following description: “nvidia-persistenced a update for NVIDIA GPU headless mode, i.e. It ensures all GPUs stay awake even during headless mode”.
Months ago, when I was first configuring this system, I read this description and thought, “Headless
applies to me and ‘awake’ sounds good, why wouldn’t I want that? Let’s enable it.”
This old README makes it much clearer what problems nvidia-persistenced
is meant to solve: nvidia-persistenced
maintains device state for use-cases where the latency cost of “repetitive device initialization”
should not be incurred. I have no such workloads on my gaming-laptop-turned-hobby-media-server, so I
actually have no use for nvidia-persistenced
.
Turns out, the easiest solution to my problem was flipping a boolean and disabling nvidia-persistenced
outright. Now, there is no service to complain about the version mismatch and
I can use deploy
in peace.
I was so close to getting my module swapping script hooked up though. For the sake of completion, I fixed the remaining bugs (but did not try to account for kernel updates) and then tested it by pushing just an NVIDIA driver update. The barebones module is below, and it actually works (for me lol, no promises)!
Ultimately, I spent several hours of my life solving a self-inflicted problem with a trivial change. Go me.
It wasn’t all bad though. I had some valuable lessons reinforced, like “don’t underestimate staring at unfiltered logs” and “try not to tunnel-vision on the immediate next step” and “think harder about components you add to your system”. Extra experience messing with a NixOS system doesn’t hurt either.
I also got a slightly better feel for Claude’s ability to work on NixOS things. While I was trying
systemd service definition tweaks, I ran into minor inaccuracies that could probably be blamed on
insufficient context. For example, Claude happily wrote in a dependency on a non-existent systemd
device, dev-nvidia0.device
. In fairness, /dev/nvidia0
does exist on my system (and likely many
others), and there are also similar-sounding existing devices like dev-sda1.device
for /dev/sda1
. I can’t blame Claude that much for that hallucinated NVIDIA device.
The module swapping NixOS module was mostly a shell script that Claude also wrote before I refined
it. It doesn’t surprise me that the generated script input read more like a normal shell script than
something meant as one piece of a NixOS module; for example, instead of using a package’s version
attribute, Claude would get the store path of a NVIDIA package but then extract the version with
regex. In general, the use of expression substitution was kind of brittle and often wrong; starting
with the module that Claude initially wrote, I can’t quite imagine working this to completion
without at least an okay understanding of how to think in Nix, so to speak.
Finally, this endeavor was a win for my “write literally everything down” approach to homelab tinkering. I worked on this specific problem for like 3 afternoons over the course of 2 months; without the reproducibility affordances of Nix and my copious notes, I don’t think I could have overcome the cost of context-switching and explored this in my free time.
[^1]:
It doesn’t really matter, but I went looking afterward to see if I could connect the two
messages definitively. I found a seemingly matching error here in the NVML docs (which provides functionality for nvidia-smi
) for a driver/library version
mismatch. I also saw this line in nvidia-persistenced.c that fails,
logging the observed error. Unfortunately, that failing call appears to go through the non-free libnvidia-cfg.so
, which is as far as I care to dig right now.
nvidia-kmod-swap.nix
{ config, pkgs, ... }:
let
kernelVersion = pkgs.lib.getVersion config.boot.kernelPackages.kernel;
nvidiaPackage = config.hardware.nvidia.package;
nvidia-check-swap-kernel-modules = (pkgs.writeShellScriptBin "nvidia-check-swap-kernel-modules" ''
#!/usr/bin/env bash
set -euo pipefail
log() {
echo "$1" >&2
}
log "Begin checking whether NVIDIA module swap needed"
# Find the exact module path from the package
NVIDIA_MODULE_PATH="${nvidiaPackage.bin}/lib/modules"
log "NVIDIA module path: $NVIDIA_MODULE_PATH"
# Check if nvidia module is loaded
log "Checking if NVIDIA modules are loaded..."
NVIDIA_LOADED_MODULES=$(${pkgs.kmod}/bin/lsmod | ${pkgs.ripgrep}/bin/rg nvidia | ${pkgs.gawk}/bin/awk '{print $1}')
if [ -z "$NVIDIA_LOADED_MODULES" ]; then
log "NVIDIA modules not loaded. Loading modules."
${pkgs.kmod}/bin/modprobe nvidia
exit $?
fi
log "Found modules: $(echo "$NVIDIA_LOADED_MODULES" | ${pkgs.coreutils}/bin/tr '\n' ' ')"
# Get current kernel module version
MODULE_VERSION=$(${pkgs.kmod}/bin/modinfo nvidia | ${pkgs.ripgrep}/bin/rg '^version:' | ${pkgs.gawk}/bin/awk '{print $2}')
log "Module version: $MODULE_VERSION"
# Get the expected version from the package
EXPECTED_VERSION="${nvidiaPackage.version}"
log "NVIDIA package version: $EXPECTED_VERSION"
# Check if versions match
if [ "$EXPECTED_VERSION" != "$MODULE_VERSION" ]; then
log "Version mismatch detected. Attempting to fix..."
# Check for running processes using the GPU
GPU_PROCS=$(${pkgs.psmisc}/bin/fuser -v /dev/nvidia* 2>/dev/null || true)
if [ -n "$GPU_PROCS" ]; then
log "WARNING: Processes are using the GPU. This may prevent module unloading:"
log "$GPU_PROCS"
# Optional: add logic to kill or stop these processes
fi
# Define the correct unload order for NVIDIA modules
MODULES_TO_UNLOAD="nvidia_drm nvidia_uvm nvidia_modeset nvidia"
# Try to unload modules in the correct order
for module in $MODULES_TO_UNLOAD; do
if ${pkgs.kmod}/bin/lsmod | ${pkgs.ripgrep}/bin/rg -q "^$module "; then
log "Unloading $module..."
if ! ${pkgs.kmod}/bin/rmmod $module; then
log "Failed to unload $module. Services might be using the GPU."
exit 1
fi
else
log "Module $module not loaded, skipping."
fi
done
# Force module path when loading to ensure we get the new modules
log "Loading NVIDIA modules with correct version..."
KERNEL_VERSION="${kernelVersion}"
# Use insmod directly with full path to the new modules if modprobe doesn't work
if [ -d "$NVIDIA_MODULE_PATH" ]; then
log "Looking for modules in $NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc"
if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia.ko" ]; then
${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia.ko"
if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-modeset.ko" ]; then
${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-modeset.ko"
fi
if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-uvm.ko" ]; then
${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-uvm.ko"
fi
if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-drm.ko" ]; then
${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-drm.ko"
fi
else
log "Could not find nvidia.ko at expected path, falling back to modprobe"
${pkgs.kmod}/bin/modprobe nvidia
fi
else
# untested branch
log "Could not find nvidia.ko at expected path, falling back to modprobe"
# For modprobe to find the right modules, we need to tell modprobe where to look
# Create a temporary modprobe.d config to point to the right nvidia driver location
TEMP_MODPROBE_CONF=$(mktemp)
echo "search $NVIDIA_MODULE_PATH/$KERNEL_VERSION" > "$TEMP_MODPROBE_CONF"
${pkgs.kmod}/bin/modprobe -C "$TEMP_MODPROBE_CONF" nvidia
rm -f "$TEMP_MODPROBE_CONF"
fi
# Verify using nvidia-smi
# Guessing based on the manpage, `modinfo` is not a reliable tool to *verify* the loaded kernel module without more tweaking
# because it will search for modules the same way modprobe does
if ${nvidiaPackage.bin}/bin/nvidia-smi --query-gpu=driver_version --format=csv,noheader &>/dev/null; then
NVIDIA_SMI_VERSION_AFTER=$(${nvidiaPackage.bin}/bin/nvidia-smi --query-gpu=driver_version --format=csv,noheader)
log "Successfully loaded NVIDIA modules. NVIDIA-SMI reports version: $NVIDIA_SMI_VERSION_AFTER"
# Trigger udev to recreate device nodes
log "Triggering udev to recreate NVIDIA device nodes..."
${pkgs.systemd}/bin/udevadm trigger --subsystem-match=nvidia
${pkgs.systemd}/bin/udevadm settle
log "Module swap completed successfully."
exit 0
else
log "Failed to load NVIDIA modules correctly. nvidia-smi cannot communicate with the driver."
log "Current modules loaded from: $(${pkgs.kmod}/bin/modinfo nvidia | ${pkgs.ripgrep}/bin/rg '^filename:' | ${pkgs.gawk}/bin/awk '{print $2}')"
exit 1
fi
else
log "NVIDIA versions match. No action needed."
fi
exit 0
'');
in
{
environment.systemPackages = [
nvidia-check-swap-kernel-modules
];
# Modify the nvidia-persistenced service
systemd.services.nvidia-persistenced = {
# Keep the existing configuration
serviceConfig = {
# Add our script as a pre-start check
ExecStartPre = [
"${pkgs.bash}/bin/bash -c '${pkgs.coreutils}/bin/sleep 1'" # Small delay to ensure devices are ready
"${nvidia-check-swap-kernel-modules}/bin/nvidia-check-swap-kernel-modules"
];
};
# Make sure the service restarts if it fails
startLimitIntervalSec = 60;
startLimitBurst = 3;
# Add additional logging for failures
serviceConfig.ExecStopPost = [
"+${pkgs.bash}/bin/bash -c 'if [ "$EXIT_STATUS" != "0" ]; then ${pkgs.systemd}/bin/systemd-cat -t nvidia-persistenced -p err echo "Service failed with status $EXIT_STATUS"; fi'"
];
};
}
<3