Bug 1174204

Summary: NVIDIA driver after update 440.100 --> 450.57 fails due to remaining old kernel modules
Product: [openSUSE] openSUSE Distribution Reporter: Matthias Bach <marix>
Component: X11 3rd Party DriverAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED FIXED QA Contact: Stefan Dirsch <sndirsch>
Severity: Normal    
Priority: P3 - Medium CC: george.spiggott, jamesrome, marix, valurolafsson, vkrevs
Version: Leap 15.2   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Result of nvidia-bug-report.sh

Description Matthias Bach 2020-07-16 11:12:51 UTC
After applying the NVIDIA driver update to 450.57 I end up with an unsable NVIDIA driver and X.org falling back to 1024p with software rendering. Looking at the journal shows the following:

NVRM: API mismatch: the client has the version 450.57, but
NVRM: this kernel module has the version 440.100.  Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

However, as I have in the meantime completely removed the driver, rebootet with Nouveau–which I am also using to write this–and re-installed the driver it is completely unclear to me where the old kernel module should come from. Zypper also claims that all my packages are on the same version:

S  | Name                       | Typ        | Version                             | Arch   | Repository
---+----------------------------+------------+-------------------------------------+--------+------------------------
   | nvidia-computeG04          | Paket      | 390.138-lp152.14.1                  | x86_64 | nVidia Graphics Drivers
i+ | nvidia-computeG05          | Paket      | 450.57-lp152.28.1                   | x86_64 | nVidia Graphics Drivers
   | nvidia-firmware-installer  | Paket      | 1.1-lp152.1.1                       | noarch | hardware
   | nvidia-firmware-installer  | Quellpaket | 1.1-lp152.1.1                       | noarch | hardware
   | nvidia-gfxG04-kmp-default  | Paket      | 390.138_k5.3.18_lp152.19-lp152.14.1 | x86_64 | nVidia Graphics Drivers
   | nvidia-gfxG04-kmp-preempt  | Paket      | 390.138_k5.3.18_lp152.19-lp152.14.1 | x86_64 | nVidia Graphics Drivers
i+ | nvidia-gfxG05-kmp-default  | Paket      | 450.57_k5.3.18_lp152.19-lp152.28.1  | x86_64 | nVidia Graphics Drivers
   | nvidia-gfxG05-kmp-preempt  | Paket      | 450.57_k5.3.18_lp152.19-lp152.28.1  | x86_64 | nVidia Graphics Drivers
   | nvidia-glG04               | Paket      | 390.138-lp152.14.1                  | x86_64 | nVidia Graphics Drivers
i+ | nvidia-glG05               | Paket      | 450.57-lp152.28.1                   | x86_64 | nVidia Graphics Drivers
   | nvidia-texture-tools       | Paket      | 2.0.8-lp152.3.9                     | x86_64 | Haupt-Repository (OSS)
   | pcp-pmda-nvidia-gpu        | Paket      | 4.3.1-lp152.4.3                     | x86_64 | Haupt-Repository (OSS)
   | skelcd-EULA-NVIDIA-compute | Paket      | 2020.05.04-lp152.1.1                | x86_64 | Haupt-Repository (OSS)
   | x11-video-nvidiaG04        | Paket      | 390.138-lp152.14.1                  | x86_64 | nVidia Graphics Drivers
i+ | x11-video-nvidiaG05        | Paket      | 450.57-lp152.28.1                   | x86_64 | nVidia Graphics Drivers

I have tried explicitly running mkinitrd but this did not change the situation.

One additional thing I ran into is that when removing the driver to switch to Nouveau, the system still behaves as before the uninstallation and lsmod will show the Nvidia driver still being loaded after reboot:

nvidia_drm             53248  0
nvidia_modeset       1118208  1 nvidia_drm
nvidia              20721664  1 nvidia_modeset
ipmi_msghandler        69632  1 nvidia
drm_kms_helper        229376  2 nvidia_drm,nouveau
drm                   544768  5 drm_kms_helper,nvidia_drm,ttm,nouveau

Only explicitly invoking mkinitrd will actually cause the Nvidia driver not to be loaded on boot and provide me with a working Nouveau driver.
Comment 1 Stefan Dirsch 2020-07-16 13:02:25 UTC
Seems the kernel module build of 450 failed or the 440 module is being preferred for some reason. I suggest to uninstall nvidia-gfxG05-kmp-default package, remove all remaining nvidia modules below /lib/modules:

cd /lib/modules
find . -name nvidia*.ko -print | xargs rm

and then reinstall nvidia-gfxG05-kmp-default package. Check then this:

find /lib/modules -name nvidia*.ko
Comment 2 Stefan Dirsch 2020-07-16 13:21:56 UTC
And if it still doesn't work also attach the result when running nvidia-bug-report.sh
Comment 3 Mark Scott 2020-07-16 15:30:06 UTC
Dear Stefan, I too have been hit with the same issue as above and you solution worked for me.

Thanks
Comment 4 Matthias Bach 2020-07-16 16:51:14 UTC
Created attachment 839785 [details]
Result of nvidia-bug-report.sh

Sadly this didn't fix the issue for me.

One interesting thing I noted: Before removing the modules I had a /lib/modules//5.3.18-lp152.20.7-default/updates/nvidia.ko, along with many modules for LEap 15.1 and 15.2 kernels.

After removing all modules and running the driver installation I have /lib/modules//5.3.18-lp152.19-default/updates/nvidia.ko. So I did actually have a module with a higher version number lying around.
Comment 5 James Rome 2020-07-16 16:57:51 UTC
I have this same issue. in 15.2. I get no graphics at all. 
e NVidia drivers got updated. Now I cannot activate them with
# prime-select nvidia
It says it cannot query the GPU. 
I uninstalled and reinstalled the packages, and prime-select still fails.
Help please.
Comment 6 James Rome 2020-07-16 16:58:43 UTC
Can we delete all the 4.4 and 4.12 files in /lib/modules?
Comment 7 James Rome 2020-07-16 17:00:06 UTC
(In reply to James Rome from comment #6)
> Can we delete all the 4.4 and 4.12 files in /lib/modules?

And, I do not have an nvidia file in /lib/modules:
drwxr-xr-x 1 root root  14 Aug 18  2018 4.12.14-lp150.12.10-default
drwxr-xr-x 1 root root  14 Oct  8  2018 4.12.14-lp150.12.13-default
drwxr-xr-x 1 root root  14 Oct 16  2018 4.12.14-lp150.12.16-default
drwxr-xr-x 1 root root  14 Nov  7  2018 4.12.14-lp150.12.19-default
drwxr-xr-x 1 root root  14 Dec 15  2018 4.12.14-lp150.12.22-default
drwxr-xr-x 1 root root  14 Jan 17  2019 4.12.14-lp150.12.25-default
drwxr-xr-x 1 root root  14 Feb 19  2019 4.12.14-lp150.12.28-default
drwxr-xr-x 1 root root  24 Aug  7  2018 4.12.14-lp150.12.4-default
drwxr-xr-x 1 root root  14 Apr 12  2019 4.12.14-lp150.12.45-default
drwxr-xr-x 1 root root  14 May 16  2019 4.12.14-lp150.12.48-default
drwxr-xr-x 1 root root  14 May 27  2019 4.12.14-lp150.12.58-default
drwxr-xr-x 1 root root  14 Jun 17  2019 4.12.14-lp150.12.61-default
drwxr-xr-x 1 root root  14 Aug 18  2018 4.12.14-lp150.12.7-default
drwxr-xr-x 1 root root  14 Sep 22  2019 4.12.14-lp151.28.10-default
drwxr-xr-x 1 root root  14 Oct 10  2019 4.12.14-lp151.28.13-default
drwxr-xr-x 1 root root  14 Oct 30  2019 4.12.14-lp151.28.16-default
drwxr-xr-x 1 root root  14 Nov 13  2019 4.12.14-lp151.28.20-default
drwxr-xr-x 1 root root  14 Dec  9  2019 4.12.14-lp151.28.25-default
drwxr-xr-x 1 root root  14 Mar  8 10:06 4.12.14-lp151.28.32-default
drwxr-xr-x 1 root root  14 Mar 25 18:30 4.12.14-lp151.28.36-default
drwxr-xr-x 1 root root  14 Jul 16  2019 4.12.14-lp151.28.4-default
drwxr-xr-x 1 root root  14 Apr 20 11:14 4.12.14-lp151.28.40-default
drwxr-xr-x 1 root root  14 Jun 11 15:02 4.12.14-lp151.28.44-default
drwxr-xr-x 1 root root  14 Jul  3 10:36 4.12.14-lp151.28.48-default
drwxr-xr-x 1 root root  14 Jul  3 12:44 4.12.14-lp151.28.52-default
drwxr-xr-x 1 root root  14 Aug 11  2019 4.12.14-lp151.28.7-default
drwxr-xr-x 1 root root 278 Jul 30  2017 4.4.27-2-default
drwxr-xr-x 1 root root 278 May 26  2018 4.4.76-1-default
drwxr-xr-x 1 root root 292 Jul 16 12:53 5.3.18-lp152.19-default
drwxr-xr-x 1 root root 292 Jul 16 12:53 5.3.18-lp152.19-preempt
drwxr-xr-x 1 root root 462 Jul 16 12:53 5.3.18-lp152.20.7-default
drwxr-xr-x 1 root root 292 Jul 15 18:23 5.3.18-lp152.20.7-preempt
drwxr-xr-x 1 root root 484 Jul 16 12:53 5.3.18-lp152.26-default
drwxr-xr-x 1 root root 314 Jul 15 18:19 5.3.18-lp152.26-preempt
Comment 8 James Rome 2020-07-16 17:06:26 UTC
I wish this was editable. There are NVidia modules in /lib/modules/5.3.18-lp152.19-preempt/updates. But surely /lib/modules/5.3.18-lp152.26-preempt/updates would be newer, but nothing is there.
Comment 9 Matthias Bach 2020-07-16 17:30:10 UTC
(In reply to Matthias Bach from comment #4)
> Sadly this didn't fix the issue for me.

I just realised I failed. I only ran `find /lib/modules -name nvidia.ko -delete`. Will retry with `find /lib/modules -name nvidia.ko -delete`.
Comment 10 Matthias Bach 2020-07-16 17:44:19 UTC
(In reply to Matthias Bach from comment #9)
> (In reply to Matthias Bach from comment #4)
> > Sadly this didn't fix the issue for me.
> 
> I just realised I failed. I only ran `find /lib/modules -name nvidia.ko
> -delete`. Will retry with `find /lib/modules -name nvidia.ko -delete`.

So doing this properly does fix the issue. Thanks!

Still weird that I had /lib/modules/5.3.18-lp152.20.7-default/updates/nvidia*.ko though when the current package builds /lib/modules/5.3.18-lp152.19-default/updates/nvidia*.ko which now gets linked from /lib/modules/5.3.18-lp152.20.7-default/weak-updates/updates/nvidia*.ko.
Comment 11 Stefan Dirsch 2020-07-16 18:19:52 UTC
(In reply to Matthias Bach from comment #10)
> So doing this properly does fix the issue. Thanks!

Good!

> Still weird that I had
> /lib/modules/5.3.18-lp152.20.7-default/updates/nvidia*.ko though 

So I assume these were the 440.110 ones still, which weren't removed during uninstallation of old package for some reason.

> when the current package builds
> /lib/modules/5.3.18-lp152.19-default/updates/nvidia*.ko 

That's correct.

> which now gets  linked from
> /lib/modules/5.3.18-lp152.20.7-default/weak-updates/updates/nvidia*.ko.

That's how it is supposed to be. Create symlinks for all kernels sharing the same kABI. Our weak-updates concept.
Comment 12 Stefan Dirsch 2020-07-16 18:23:35 UTC
@James Rome Please follow instructions of comment#1. They make sure nothing is left below /lib/modules.
Comment 13 James Rome 2020-07-16 18:29:49 UTC
Yes, using find /lib/modules -name nvidia*.ko -delete
and removing and reinstalling the drivers fixed it.
Comment 14 Stefan Dirsch 2020-07-16 18:34:40 UTC
Ok. So at least we have a workaround. But now I'm afraid this happens for everyone for this update 440.100 --> 450.57. :-(
Comment 15 Stefan Dirsch 2020-07-16 19:26:22 UTC
Now I know what happens. Up to 440.100 mistakenly kernel modules were rebuilt and installed for the kernel, against it has been locally built. Currently this is 5.3.18-lp152.20.7. With 450.57 I switched this back to our weak-modules concept, i.e. kernel modules are installed to a fixed kernel version (here: 5.3.18-lp152.19; even if it doesn't exist on the system), then weak-modules symlinks are created for all other installed kernels.

Example                         440.100 packages           450.57 packages
----------- 
.19 fixed GA Kernel       no kernel moules            450.57 modules
.20 build kernel              440.100 modules            440.100 modules (no weak symlinks created) ***
.85 another kernel          no kernel modules         weak symlinks to .19 fixed kernel (450.57 modules) 

*** because modules with the same name already exist

As a fix I could remove the old modules before installing the new ones.
Comment 16 Stefan Dirsch 2020-07-16 21:13:24 UTC
Fixed and pushed packages towards nvidia. Consider this a reliable workaround as long as this update is not available yet:

rpm -e nvidia-gfxG05-kmp-default --nodeps
find /lib/modules -name nvidia*.ko -delete
zypper in nvidia-gfxG05-kmp-default

Fixed packages contain the following RPM changelog:

Thu Jul 16 19:36:52 UTC 2020 - Stefan Dirsch <sndirsch@suse.com>

- remove still existing old kernel modules during installation of
  new modules, since otherwise weak-modules doesn't work 
  (boo#1174204)