← Back to Blog
The Zombie in My GPU: One Friday Night, a Custom NVIDIA Driver, and the Bug Nobody Else Has
Linux & Open SourceJun 20, 2026• 9 min read

The Zombie in My GPU: One Friday Night, a Custom NVIDIA Driver, and the Bug Nobody Else Has

A bleeding-edge stack — Arrow Lake-S, kernel 7.0, the NVIDIA open kernel module, KDE Plasma on Wayland — that slept fine and woke up dead one resume in four. This is the war story of the night I stopped waiting for a fix that was never coming, built a custom 610.43.02 driver against my own kernel, registered it with DKMS, and watched the zombie not come back. Honest about the dead-ends, the netconsole blind spot, and the three failed installs along the way.

The Zombie in My GPU: One Friday Night, a Custom NVIDIA Driver, and the Bug Nobody Else Has

It started, as these things do, with a window manager dying mid-task and a half-joking thought that I had been attacked by zombies and had to restart.

By 5am the next morning I had compiled and installed a custom NVIDIA driver that Ubuntu does not ship, registered it with DKMS so it would survive future kernel updates, and watched the machine wake cleanly from a five-and-a-half-hour sleep that used to freeze it solid. This is the story of the night in between — and of how deep you have to dig before Linux finally shows you its seams.

If you have already read the companion piece, Why KDE Plasma Survives the NVIDIA Linux Suspend Mess, this is the follow-up: the night the bug behind all of that finally got beaten on my own hardware. That article explained why your compositor decides whether the bug ruins your day. This one is what happened when I stopped relying on the compositor to paper over it and went after the driver itself.


The machine, and why I am alone out here

My workstation is an early adopter’s dream and a stability engineer’s nightmare, all at once:

  • Intel Core Ultra 5 245K — Arrow Lake-S, a platform only months old
  • MSI PRO B860-P — one of the newest chipsets going
  • Kernel 7.0 — bleeding-edge mainline, not the battle-tested LTS most people run
  • Hybrid graphics — the Arrow Lake integrated GPU and an NVIDIA RTX 3060
  • The NVIDIA open kernel module — newer and less hammered-on than the decade-old proprietary blob
  • KDE Plasma on Wayland, suspending to RAM several times a day

Each of those layers, alone, has a healthy user base. The intersection — that exact six-way stack — is realistically a few hundred to a couple thousand people on the planet right now. And the slice of those who would trace a resume hang to a GPU firmware race and build tooling to catalogue it rounds down to a rounding error.

That matters, because it explains why the bug was still there. Bugs survive in the dark when their victims are too few and too non-technical to file a clean report. I was neither.


The symptom: a machine that sleeps fine and wakes up dead

On resume from suspend, one of three things happened:

  1. Clean — rare and beautiful.
  2. Degraded — the desktop shell, plasmashell, crashed the instant the screen came back. It respawned on its own a second later, so the desktop returned, but the crash dialog was a daily ritual.
  3. Zombie — roughly one resume in four, the entire machine wedged. Black screen, fans spinning, nothing alive. The kernel log simply ended mid-sentence at PM: suspend entry. The only way out was holding the power button — an ugly, filesystem-risking hard reset, several times a week.

The zombie was the real enemy. The crash dialog was just the enemy’s calling card.


Ruling out the obvious (and being wrong twice)

The first hours were demolition work — killing comfortable wrong answers:

  • The scary Permission denied and Atomic modeset test failed lines kwin printed on every suspend? Benign. They fired on the clean resumes too. I had pointed at them as the smoking gun early on; the data corrected me.
  • Maybe it was the RAM, maybe I should reseat it. No. A deterministic idle-suspend trigger plus a known driver-and-kernel combo is software, not contacts. RAM faults do not politely wait for PM: suspend entry.
  • My own pet theory — that anything over an hour goes zombie — fell apart against my own data: a 26-minute suspend once collapsed, and a six-and-a-half-hour one resumed perfectly clean. Duration was a loaded die, not a switch.

The most important thing I ruled out came from reading the driver source itself: the NVIDIA video-memory-preservation setting that everyone blames was already on and working. The compositor does not control it; the driver does, and it was doing its job. An entire class of internet advice, eliminated by reading the actual code instead of reciting lore. (Worth saying plainly: do not set PreserveVideoMemoryAllocations=1 on the open kernel module — that is the proprietary-driver knob, and on the open module it half-wires a suspend path that does not exist there. The open-module-correct mechanism is UseKernelSuspendNotifiers=1, which was already active.)


The capture rig, and the limit of clever

To see what the kernel said in its dying moment, I set up netconsole — a kernel feature that streams the log over the network as UDP, straight from the kernel, before anything touches disk. I pointed it at three NICs on my NAS, with a tiny rootless Python listener catching the packets.

It worked beautifully. And it taught me its own hard limit: there is a roughly seven-second blind spot on resume, because the network card is still asleep when the GPU bring-up messages — the ones I needed — get printed. The death itself happens inside the blind window. Sometimes the cleverest instrument tells you mostly that you need a different instrument.


The fix: build the driver Ubuntu would not give me

Here is the part Windows users do not get to have. The NVIDIA open kernel module is source-available. So when I traced the failure to the driver’s resume path and found that Ubuntu was frozen at the version I already had — 595.71.05, the newest in the repo — I did the thing a closed platform forbids: I went upstream, pulled a newer release NVIDIA had published but Ubuntu had not packaged (610.43.02), and built it against my own kernel.

It compiled clean. Then it refused to install — three times, each failure peeling back one layer:

  1. Secure Boot blocked the unsigned module. Off it went.
  2. The old driver was pinned by a background daemon, so the new module half-loaded into a version mismatch — a 610 nvidia.ko next to a 595 nvidia-modeset.ko. Stopped the daemon, unloaded the old stack first.
  3. The old driver’s files were still on disk, so the loader kept mixing new and old modules. Purged the distribution driver entirely — after simulating the purge to prove it would not drag the desktop, kernel, or Xorg core out with it — and installed clean.

Each bounce was not a failure so much as a layer of the onion. The third attempt took: /proc/driver/nvidia/version read 610.43.02, both modules matched, nvidia-smi came back healthy, the desktop returned.

Then the one step that turns a one-night hack into something that survives: I registered the driver with DKMS, so it rebuilds itself automatically on every future kernel update instead of silently breaking the next time the kernel bumps. The cost is that driver updates are now manual — off the apt track — but that is a fair trade for a module the distro will never ship.


The verdict

I left it asleep overnight as the final test — a long suspend being exactly the condition that used to kill it. In the morning:

23:35 suspend entry  →  05:05 suspend exit  →  clean

Five and a half hours of deep sleep, woken without a hitch. The original zombie was an 83-minute sleep that never woke. This was four times longer, and it came back like nothing happened. No hung tasks, no GPU faults, no hard reset.

I am being deliberately careful with the word fixed. One clean overnight is a very strong signal, not a proof — a bug that fired one resume in four can hide behind a single good night. I want a run of clean cycles before I carve solved into stone. But the change in character is unmistakable, and the kernel-level GSP firmware recovery that used to be load-bearing now reads clean on every cycle, which is exactly what you would expect if the root cause was the thing I just replaced.

The plasmashell crash, it turns out, is a separate and lesser thing. It persisted across both drivers — the 595 module aborted with a lost GL context, the 610 module segfaulted instead, but the shell still died on resume either way. Persisting across two very different drivers is the tell: this one is KDE-side, not the driver. KWin underneath it survives intact every cycle, so the display always returns; the crash is cosmetic and auto-respawns. Annoying, and a problem for another day.


What I actually learned

Linux is not one machine. It is a stack of layers — firmware, kernel, driver, compositor, shell — each handling failure in its own way, and you do not see the seams until you push a layer to its breaking point. The compositor survived what the shell could not. The kernel survived once the driver was right. Every layer chose where to be fragile.

Two things made the difference between a bricked machine and a fixed one:

  • Reversibility. Every dangerous step had a tested way back before I took it — a recovery script that reinstalls the stock driver and rebuilds the initramfs, sitting ready on disk the whole time. That is the entire reason being bold was safe.
  • Openness. On a closed platform, the story ends at file a ticket and pray. On Linux, the source was on disk, the toolchain was installed, and the fix was mine to build.

I am not unlucky. I am standing where the road is not paved yet — and on Linux, you are allowed to pave it.


Tooling and sources

The suspend-cycle classifier, the netconsole capture rig, and the install and recovery scripts from this hunt are open source, alongside the full case study with the kernel logs and version details:

By Mahmud Farooque5 views