nvidia-persistenced broke my nixos deploy

claude and i go on a minor adventure in the world of nvidia on linux

June 2025

i know the footnote is fake, my dream footnote UI is on the backlog

I use deploy-rs to push configuration changes to my NixOS homelab. It implements an auto-rollback feature that will rollback the deployed profile to the previous generation on activation failure. I like this feature because it lets me iterate on my configuration without constantly worrying about breaking my system.

Most of the time, this deployment process flawlessly. However, I experienced intermittent deployment failures that looked like this:

$ deploy .#agrotera --remote-build
# omit for brevity ...
Failed to start nvidia-persistenced.service
⭐ ⚠️ [activate] [WARN] De-activating due to error
switching profile from version 55 to 54
⭐ ⚠️ [activate] [WARN] Removing generation by ID 55
removing profile version 55
⭐ ℹ️ [activate] [INFO] Attempting to re-activate the last generation
# omit for brevity ...

This would happen when I was just updating the flake lockfile and rebuilding my system with newer packages.

In the universe of changes I could be trying to deploy, version bumping is a pretty low-risk chore; when this came up, I would forgo the auto-rollback and just switch to the new configuration anyway. The service would fail to start but the configuration change would persist. Then, I would reboot and move on. This never broke anything for me, but the randomly-added friction did annoy me.

After suffering this minor inconvenience for a while, I finally sat down to debug this failure. I wanted to be able to use deploy and its rollback features consistently.

identifying the cause

I started with logs from the failing service.

$ journalctl -b -u nvidia-persistenced.service
...
systemd[1]: nvidia-persistenced.service: Deactivated successfully.
systemd[1]: Stopped NVIDIA Persistence Daemon.
systemd[1]: nvidia-persistenced.service: Consumed 24ms CPU time, 1.8M memory peak, 288K memory swap peak, 1016K read from disk, 284K written to disk, 240B incoming IP traffic, 320B outgoing IP traffic.
systemd[1]: /etc/systemd/system/nvidia-persistenced.service:10: PIDFile= references a path below legacy directory /var/run/, updating /var/run/nvidia-persistenced/nvidia-persistenced.pid → /run/nvidia-persistenced/nvidia-persistenced.pid; please update the unit file accordingly.
systemd[1]: Starting NVIDIA Persistence Daemon...
nvidia-persistenced[271345]: Verbose syslog connection opened
nvidia-persistenced[271345]: Started (271345)
nvidia-persistenced[271345]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
nvidia-persistenced[271345]: PID file unlocked.
nvidia-persistenced[271334]: nvidia-persistenced failed to initialize. Check syslog for more details.
nvidia-persistenced[271345]: PID file closed.
nvidia-persistenced[271345]: Shutdown (271345)
systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
systemd[1]: Failed to start NVIDIA Persistence Daemon.

The NVIDIA NixOS module starts nvidia-persistenced with the --verbose flag already, so this is all the information I was going to get from this journalctl command. The complaint about the provided PID file is just a warning, leaving me with: “Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.”

I also harbored a suspicion that NVIDIA driver updates were causing these failures. To check that they were at least correlated, I confirmed that my failing flake update included a version bump to my NVIDIA package, production, by peeking at the nixpkgs commit history since my last update (lol).

With no relevant experience to draw on and no useful results from Google, I put my notes and thoughts together and then prompted Claude (3.7 at time). The first suggestion was to directly address the seemingly missing device handles by trying various tweaks to the nvidia-persistenced service configuration. This made sense to me; if the device really was missing at service start time, maybe I could have this service express a dependency on device availability somehow.

Fortunately for me, NixOS makes it dead-easy to iterate on this:

{ config, ... }: {
	# ...
	systemd.services.nvidia-persistenced = {
		# attribute tweaks here
	};
}

Unfortunately for me, none of the changes I tried worked.

Time to take a step back then. I decided to scroll through the unfiltered logs, reading all the events following a deployment. Lo and behold, a key piece of information was hiding behind the kernel source:

$ journalctl -b
...
kernel: NVRM: API mismatch: the client has the version 570.133.07, but
        NVRM: this kernel module has the version 570.124.04.  Please
        NVRM: make sure that this kernel module and all NVIDIA driver
        NVRM: components have the same version.

This looked like the foothold I was looking for. This message comes up right before the failure message from nvidia-persistenced.service [^1].

Anyway, if nvidia-persistenced needs the new kernel module, maybe I can just switch them when the driver updates! Two years ago, one person on the NVIDIA developer forums said loading new NVIDIA kernel modules is possible on Ubuntu? Say less.

let’s swap some kernel modules!

Again with Claude, I started working on a fix. I would add a NixOS module to my configuration that:

Defined a script that:
1. Detects driver-module version conflicts
2. Unloads the relevant NVIDIA kernel modules
3. Carefully loads the new versions of the modules
4. Verifies that the NVIDIA tools did not complain about a version mismatch
Modified the nvidia-persistenced service to run this script on service start

It took some experimentation and some spelunking to iterate on this.

First, since modprobe is a higher-level tool that loads modules by name, the mid-upgrade module resolution was “succeeding” but picking modules from the old package. I had to go looking for the right Nix attribute to pass the script the correct store path and then use insmod to directly load .ko files from that store path.

Once I fixed that, I was still failing to use the right files. These kernel modules are generally found at a path that looks like this:

/nix/store/${input_hash}-${package_name}/lib/modules/${KERNEL_VERSION}/

I couldn’t open the right directory because my update included a kernel update; my earlier code was looking for a directory named for the current kernel but the only provided modules were for the incoming new kernel.

Thinking about this kernel update, it hit me that not only was a kernel update annoying to account for, it required a reboot anyway. While I still needed to get deploy to stop failing, I wasn’t going to avoid a reboot forever; there was less value in swapping parts in-place when the reboot I might have to do would also reload kernel modules.

I stopped focusing on module swapping to re-evaluate. Finally stepping back, I wondered, “why am I stopping the world for nvidia-persistenced?”

Oh no.

The NixOS option hardware.nvidia.nvidiaPersistenced has the following description: “nvidia-persistenced a update for NVIDIA GPU headless mode, i.e. It ensures all GPUs stay awake even during headless mode”. Months ago, when I was first configuring this system, I read this description and thought, “Headless applies to me and ‘awake’ sounds good, why wouldn’t I want that? Let’s enable it.”

This old README makes it much clearer what problems nvidia-persistenced is meant to solve: nvidia-persistenced maintains device state for use-cases where the latency cost of “repetitive device initialization” should not be incurred. I have no such workloads on my gaming-laptop-turned-hobby-media-server, so I actually have no use for nvidia-persistenced.

Turns out, the easiest solution to my problem was flipping a boolean and disabling nvidia-persistenced outright. Now, there is no service to complain about the version mismatch and I can use deploy in peace.

I was so close to getting my module swapping script hooked up though. For the sake of completion, I fixed the remaining bugs (but did not try to account for kernel updates) and then tested it by pushing just an NVIDIA driver update. The barebones module is below, and it actually works (for me lol, no promises)!

reflections

Ultimately, I spent several hours of my life solving a self-inflicted problem with a trivial change. Go me.

It wasn’t all bad though. I had some valuable lessons reinforced, like “don’t underestimate staring at unfiltered logs” and “try not to tunnel-vision on the immediate next step” and “think harder about components you add to your system”. Extra experience messing with a NixOS system doesn’t hurt either.

I also got a slightly better feel for Claude’s ability to work on NixOS things. While I was trying systemd service definition tweaks, I ran into minor inaccuracies that could probably be blamed on insufficient context. For example, Claude happily wrote in a dependency on a non-existent systemd device, dev-nvidia0.device. In fairness, /dev/nvidia0 does exist on my system (and likely many others), and there are also similar-sounding existing devices like dev-sda1.device for /dev/sda1. I can’t blame Claude that much for that hallucinated NVIDIA device.

The module swapping NixOS module was mostly a shell script that Claude also wrote before I refined it. It doesn’t surprise me that the generated script input read more like a normal shell script than something meant as one piece of a NixOS module; for example, instead of using a package’s version attribute, Claude would get the store path of a NVIDIA package but then extract the version with regex. In general, the use of expression substitution was kind of brittle and often wrong; starting with the module that Claude initially wrote, I can’t quite imagine working this to completion without at least an okay understanding of how to think in Nix, so to speak.

Finally, this endeavor was a win for my “write literally everything down” approach to homelab tinkering. I worked on this specific problem for like 3 afternoons over the course of 2 months; without the reproducibility affordances of Nix and my copious notes, I don’t think I could have overcome the cost of context-switching and explored this in my free time.

[^1]: It doesn’t really matter, but I went looking afterward to see if I could connect the two messages definitively. I found a seemingly matching error here in the NVML docs (which provides functionality for nvidia-smi) for a driver/library version mismatch. I also saw this line in nvidia-persistenced.c that fails, logging the observed error. Unfortunately, that failing call appears to go through the non-free libnvidia-cfg.so, which is as far as I care to dig right now.

Appendix: `nvidia-kmod-swap.nix`

open me!

{ config, pkgs, ... }:
let
  kernelVersion = pkgs.lib.getVersion config.boot.kernelPackages.kernel;

  nvidiaPackage = config.hardware.nvidia.package;

  nvidia-check-swap-kernel-modules = (pkgs.writeShellScriptBin "nvidia-check-swap-kernel-modules" ''
    #!/usr/bin/env bash
    set -euo pipefail

    log() {
      echo "$1" >&2
    }

    log "Begin checking whether NVIDIA module swap needed"

    # Find the exact module path from the package
    NVIDIA_MODULE_PATH="${nvidiaPackage.bin}/lib/modules"
    log "NVIDIA module path: $NVIDIA_MODULE_PATH"

    # Check if nvidia module is loaded
    log "Checking if NVIDIA modules are loaded..."
    NVIDIA_LOADED_MODULES=$(${pkgs.kmod}/bin/lsmod | ${pkgs.ripgrep}/bin/rg nvidia | ${pkgs.gawk}/bin/awk '{print $1}')

    if [ -z "$NVIDIA_LOADED_MODULES" ]; then
      log "NVIDIA modules not loaded. Loading modules."
      ${pkgs.kmod}/bin/modprobe nvidia
      exit $?
    fi

    log "Found modules: $(echo "$NVIDIA_LOADED_MODULES" | ${pkgs.coreutils}/bin/tr '\n' ' ')"

    # Get current kernel module version
    MODULE_VERSION=$(${pkgs.kmod}/bin/modinfo nvidia | ${pkgs.ripgrep}/bin/rg '^version:' | ${pkgs.gawk}/bin/awk '{print $2}')
    log "Module version: $MODULE_VERSION"

    # Get the expected version from the package
    EXPECTED_VERSION="${nvidiaPackage.version}"
    log "NVIDIA package version: $EXPECTED_VERSION"

    # Check if versions match
    if [ "$EXPECTED_VERSION" != "$MODULE_VERSION" ]; then
      log "Version mismatch detected. Attempting to fix..."

      # Check for running processes using the GPU
      GPU_PROCS=$(${pkgs.psmisc}/bin/fuser -v /dev/nvidia* 2>/dev/null || true)
      if [ -n "$GPU_PROCS" ]; then
        log "WARNING: Processes are using the GPU. This may prevent module unloading:"
        log "$GPU_PROCS"
        # Optional: add logic to kill or stop these processes
      fi

      # Define the correct unload order for NVIDIA modules
      MODULES_TO_UNLOAD="nvidia_drm nvidia_uvm nvidia_modeset nvidia"

      # Try to unload modules in the correct order
      for module in $MODULES_TO_UNLOAD; do
        if ${pkgs.kmod}/bin/lsmod | ${pkgs.ripgrep}/bin/rg -q "^$module "; then
          log "Unloading $module..."
          if ! ${pkgs.kmod}/bin/rmmod $module; then
            log "Failed to unload $module. Services might be using the GPU."
            exit 1
          fi
        else
          log "Module $module not loaded, skipping."
        fi
      done

      # Force module path when loading to ensure we get the new modules
      log "Loading NVIDIA modules with correct version..."
      KERNEL_VERSION="${kernelVersion}"
      # Use insmod directly with full path to the new modules if modprobe doesn't work
      if [ -d "$NVIDIA_MODULE_PATH" ]; then
        log "Looking for modules in $NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc"
        if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia.ko" ]; then
          ${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia.ko"
          if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-modeset.ko" ]; then
            ${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-modeset.ko"
          fi
          if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-uvm.ko" ]; then
            ${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-uvm.ko"
          fi
          if [ -f "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-drm.ko" ]; then
            ${pkgs.kmod}/bin/insmod "$NVIDIA_MODULE_PATH/$KERNEL_VERSION/misc/nvidia-drm.ko"
          fi
        else
          log "Could not find nvidia.ko at expected path, falling back to modprobe"
          ${pkgs.kmod}/bin/modprobe nvidia
        fi
      else
        # untested branch

        log "Could not find nvidia.ko at expected path, falling back to modprobe"

        # For modprobe to find the right modules, we need to tell modprobe where to look
        # Create a temporary modprobe.d config to point to the right nvidia driver location
        TEMP_MODPROBE_CONF=$(mktemp)
        echo "search $NVIDIA_MODULE_PATH/$KERNEL_VERSION" > "$TEMP_MODPROBE_CONF"
        ${pkgs.kmod}/bin/modprobe -C "$TEMP_MODPROBE_CONF" nvidia
        rm -f "$TEMP_MODPROBE_CONF"
      fi

      # Verify using nvidia-smi
      # Guessing based on the manpage, `modinfo` is not a reliable tool to *verify* the loaded kernel module without more tweaking
      # because it will search for modules the same way modprobe does
      if ${nvidiaPackage.bin}/bin/nvidia-smi --query-gpu=driver_version --format=csv,noheader &>/dev/null; then
        NVIDIA_SMI_VERSION_AFTER=$(${nvidiaPackage.bin}/bin/nvidia-smi --query-gpu=driver_version --format=csv,noheader)
        log "Successfully loaded NVIDIA modules. NVIDIA-SMI reports version: $NVIDIA_SMI_VERSION_AFTER"

        # Trigger udev to recreate device nodes
        log "Triggering udev to recreate NVIDIA device nodes..."
        ${pkgs.systemd}/bin/udevadm trigger --subsystem-match=nvidia
        ${pkgs.systemd}/bin/udevadm settle

        log "Module swap completed successfully."
        exit 0
      else
        log "Failed to load NVIDIA modules correctly. nvidia-smi cannot communicate with the driver."
        log "Current modules loaded from: $(${pkgs.kmod}/bin/modinfo nvidia | ${pkgs.ripgrep}/bin/rg '^filename:' | ${pkgs.gawk}/bin/awk '{print $2}')"
        exit 1
      fi
    else
      log "NVIDIA versions match. No action needed."
    fi

    exit 0
  '');
in
{
  environment.systemPackages = [
    nvidia-check-swap-kernel-modules
  ];

  # Modify the nvidia-persistenced service
  systemd.services.nvidia-persistenced = {
    # Keep the existing configuration
    serviceConfig = {
      # Add our script as a pre-start check
      ExecStartPre = [
        "${pkgs.bash}/bin/bash -c '${pkgs.coreutils}/bin/sleep 1'" # Small delay to ensure devices are ready
        "${nvidia-check-swap-kernel-modules}/bin/nvidia-check-swap-kernel-modules"
      ];
    };

    # Make sure the service restarts if it fails
    startLimitIntervalSec = 60;
    startLimitBurst = 3;

    # Add additional logging for failures
    serviceConfig.ExecStopPost = [
      "+${pkgs.bash}/bin/bash -c 'if [ "$EXIT_STATUS" != "0" ]; then ${pkgs.systemd}/bin/systemd-cat -t nvidia-persistenced -p err echo "Service failed with status $EXIT_STATUS"; fi'"
    ];
  };
}

nvidia-persistenced broke my nixos deploy

identifying the cause

let’s swap some kernel modules!

reflections

Appendix: nvidia-kmod-swap.nix

Appendix: `nvidia-kmod-swap.nix`