Bug 1131028

Summary: Nvidia rpm fails to build nvidia.ko, yet claims success
Product: [openSUSE] openSUSE Distribution Reporter: Carlos Robinson <carlos.e.r>
Component: X11 3rd Party DriverAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED FIXED QA Contact: Stefan Dirsch <sndirsch>
Severity: Enhancement    
Priority: P4 - Low CC: carlos.e.r, stschoettl
Version: Leap 15.0   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on:    
Bug Blocks: 1145316    
Attachments: var/log/zypp/history section

Description Carlos Robinson 2019-03-29 17:20:46 UTC
Due to a bad link, the nvidia rpm (installed via *yast*) failed to create the nvidia.ko (and nvidia-uvm.ko). Make fails, yet rpm continues with the installation running dracut and not informing the the user of the problem, that causes graphic mode to fail completely.

Only when installing rpm manually (via rpm --install --force), and pausing the process at the correct place, I could see what the problem was:

Telcontar:/data/storage_c/repositorios_zypp/15_0/download.nvidia.com-leap/x86_64
# rpm --install --force nvidia-gfxG03-kmp-default-340.107_k4.12.14_lp150.11-lp150.12.1.x86_64.rpm
make: *** /usr/src/linux-obj/x86_64/default: No such file or directory.
Stop.  <----------
make: *** /usr/src/linux-obj/x86_64/default: No such file or directory.
Stop.  <----------
/usr/src/kernel-modules/nvidia-340.107-default /
NVIDIA: calling KBUILD...
make[1]: *** /lib/modules//source: No such file or directory.  Stop.
NVIDIA: left KBUILD.
nvidia.ko failed to build!  <----------
make: *** [Makefile:192: nvidia.ko] Error 1  <=========
/
install: cannot stat
'/usr/src/kernel-modules/nvidia-340.107-default/nvidia.ko': No such file or directory
depmod: WARNING: could not open modules.order at
/lib/modules/4.12.14-lp150.11-default: No such file or directory
depmod: WARNING: could not open modules.builtin at
/lib/modules/4.12.14-lp150.11-default: No such file or directory

Modprobe blacklist files have been created at /etc/modprobe.d to prevent
Nouveau from loading. This can be reverted by deleting
/etc/modprobe.d/nvidia-*.conf.

....  then it goes on with dracut, many hundred of lines that I did not copy.
Comment 1 Carlos Robinson 2019-03-29 17:22:51 UTC
Notice that the rpm was built at openSUSE. It is not a driver problem, but an rpm issue, thus the correct report place is this.

http://http.download.nvidia.com/opensuse/README
Comment 2 Stefan Dirsch 2019-04-01 12:16:28 UTC
So you made an update of the NVIDIA RPMs or what exactly?

Which kernel are you currently running? 

  --> "uname -r'

Which kernel packages are installed?

  --> rpm -qa | grep kernel

Which nvidia packages are installed?

 --> rpm -qa | grep -i nvidia
Comment 3 Carlos Robinson 2019-04-01 12:38:55 UTC
First I installed the 15.1 kernel and 15.1 Nvidia driver, which failed (Bug 1131029). That Nvidia rpm did not inform of the failure to build the kernel module.

Then I reverted. I uninstalled 15.1 kernel and Nvidia modules, and forced reinstall of 15.0 Nvidia modules. The Nvidia rpm again failed to build the kernel module (for a different reason) but kept silent about the error. I found out the error and solved it - basically the kernel rpms of 15.0 have to be reinstalled to create the appropriate symlinks.

The problem reported here, however, is the failure of the Nvidia rpms to abort on Make error and report the issue. It continues installing dracut and claims success - when it failed.

Why Make failed is irrelevant now.

cer@Telcontar:~> uname -r
4.12.14-lp150.12.48-default
cer@Telcontar:~> rpm -qa | grep kernel
kernel-source-4.12.14-lp150.12.45.1.noarch
texlive-l3kernel-2017.133.svn44483-lp150.5.4.noarch
kernel-source-4.12.14-lp150.12.48.1.noarch
texlive-l3kernel-doc-2017.133.svn44483-lp150.5.4.noarch
kernel-syms-4.12.14-lp150.12.48.1.x86_64
kernel-devel-4.12.14-lp150.12.48.1.noarch
kernel-firmware-20190118-lp150.2.12.1.noarch
kernel-devel-4.12.14-lp150.12.45.1.noarch
kernel-default-devel-4.12.14-lp150.12.48.1.x86_64
kernel-docs-4.12.14-lp150.12.48.1.noarch
nfs-kernel-server-2.1.1-lp150.4.6.1.x86_64
kernel-macros-4.12.14-lp150.12.48.1.noarch
kernel-default-4.12.14-lp150.12.48.1.x86_64
cer@Telcontar:~> rpm -qa | grep -i nvidia
nvidia-computeG03-340.107-lp150.12.2.x86_64
nvidia-uvm-gfxG03-kmp-default-340.107_k4.12.14_lp150.11-lp150.12.1.x86_64
x11-video-nvidiaG03-340.107-lp150.12.2.x86_64
nvidia-glG03-340.107-lp150.12.2.x86_64
nvidia-gfxG03-kmp-default-340.107_k4.12.14_lp150.11-lp150.12.2.x86_64
cer@Telcontar:~>


If you wish, I can try to reproduce the error and post the thousand of lines from rpm output... But IMHO, it is pointless.
Comment 4 Stefan Dirsch 2019-04-01 13:09:00 UTC
Ok. Not sure, who creates

  /usr/src/linux-obj/x86_64/default

It should point to the latest kernel-sources I believe. In your case this would be

  /usr/src/linux-4.12.14-lp150.12.48

or alike. 

Maybe you can remove

  kernel-devel-4.12.14-lp150.12.45.1.noarch
  kernel-source-4.12.14-lp150.12.45.1.noarch
  kernel-syms-4.12.14-lp150.12.48.1.x86_64

(you no longer have such a kernel on your system anyway)

and reinstall

  kernel-default-4.12.14-lp150.12.48.1.x86_64
  kernel-default-devel-4.12.14-lp150.12.48.1.x86_64
  kernel-devel-4.12.14-lp150.12.48.1.noarch
  kernel-docs-4.12.14-lp150.12.48.1.noarch
  kernel-macros-4.12.14-lp150.12.48.1.noarch
  kernel-source-4.12.14-lp150.12.48.1.noarch

so the missing symlink gets created. 

Updating kernel, then downgrading again unfortunately doesn't work reliably when (re-)building the kernel module. Especially if such
a symlink is pointing to nowhere ...
Comment 5 Carlos Robinson 2019-04-01 17:22:18 UTC
Ok. Yes, the trick seems to be to reinstall the kernel links

However, the problem I report is that the "nvidia-gfxG03-kmp-default-340.107_k4.12.14_lp150.11-lp150.12.1.x86_64.rpm" doesn't say "I failed to build the module, installation failed".


Look. With the rpms I listed above (#3) I intentionally break the link:

Telcontar:/usr/src/linux-obj/x86_64 # l default 
lrwxrwxrwx 1 root root 50 Apr  1 19:02 default -> ../../linux-4.12.14-lp151.12.48-obj/x86_64/default

(I edited 150 to 151 in the target name)

The target does not exist. Now, I try to install the nvidia-gfxG03-kmp-default rpm.

Look:

[paste begin]
Telcontar:/data/... #  rpm --install --force nvidia-gfxG03-kmp-default-340.107_k4.12.14_lp150.11-lp150.12.2.x86_64.rpm
make: *** /usr/src/linux-obj/x86_64/default: No such file or directory.  Stop.
make: *** /usr/src/linux-obj/x86_64/default: No such file or directory.  Stop.
/usr/src/kernel-modules/nvidia-340.107-default /
NVIDIA: calling KBUILD...
make[1]: *** /lib/modules//source: No such file or directory.  Stop.
NVIDIA: left KBUILD.
nvidia.ko failed to build!
make: *** [Makefile:192: nvidia.ko] Error 1
/
install: cannot stat '/usr/src/kernel-modules/nvidia-340.107-default/nvidia.ko': No such file or directory
depmod: WARNING: could not open modules.order at /lib/modules/4.12.14-lp150.11-default: No such file or directory
depmod: WARNING: could not open modules.builtin at /lib/modules/4.12.14-lp150.11-default: No such file or directory

Modprobe blacklist files have been created at /etc/modprobe.d to prevent Nouveau from loading. This can be reverted by deleting /etc/modprobe.d/nvidia-*.conf.

*** Reboot your computer and verify that the NVIDIA graphics driver can be loaded. ***
[paste pause]



Notice the message that starts with: "reboot". It is claiming success. But see above:  "nvidia.ko failed to build!" Make failed with "Error 1".

THAT is the problem I'm reporting. Not why it failed, but that it fails and *claims success*.

rpm then calls dracut to create initrd - why, when there is no module?

[paste continues]

*** Reboot your computer and verify that the NVIDIA graphics driver can be loaded. ***

Creating initrd: /boot/initrd-4.12.14-lp150.12.48-default
dracut: Executing: /usr/bin/dracut --logfile /var/log/YaST2/mkinitrd.log --force --force-drivers "pata_jmicron ata_piix ata_generic netconsole xennet xenblk" /boot/initrd-4.12.14-lp150.12.48-default 4.12.14-lp150.12.48-default
dracut: *** Including module: bash ***
dracut: *** Including module: systemd ***
dracut: *** Including module: warpclock ***
dracut: *** Including module: systemd-initrd ***
dracut: *** Including module: i18n ***
dracut: *** Including module: kernel-modules ***
dracut: *** Including module: resume ***
dracut: *** Including module: rootfs-block ***
dracut: *** Including module: suse-xfs ***
dracut: *** Including module: terminfo ***
dracut: *** Including module: udev-rules ***
dracut: Skipping udev rule: 40-redhat.rules
dracut: Skipping udev rule: 50-firmware.rules
dracut: Skipping udev rule: 50-udev.rules
dracut: Skipping udev rule: 91-permissions.rules
dracut: Skipping udev rule: 80-drivers-modprobe.rules
dracut: *** Including module: dracut-systemd ***
dracut: *** Including module: haveged ***
dracut: *** Including module: usrmount ***
dracut: *** Including module: base ***
dracut: *** Including module: fs-lib ***
dracut: *** Including module: shutdown ***
dracut: *** Including module: suse ***
dracut: *** Including modules done ***
dracut: *** Installing kernel module dependencies and firmware ***
dracut: *** Installing kernel module dependencies and firmware done ***
dracut: *** Resolving executable dependencies ***
dracut: *** Resolving executable dependencies done***
dracut: *** Hardlinking files ***
dracut: *** Hardlinking files done ***
dracut: *** Stripping files ***
dracut: *** Stripping files done ***
dracut: *** Generating early-microcode cpio image ***
dracut: *** Constructing GenuineIntel.bin ****
dracut: *** Store current command line parameters ***
dracut: Stored kernel commandline:
dracut: rd.driver.pre=pata_jmicron
rd.driver.pre=ata_piix
rd.driver.pre=ata_generic
rd.driver.pre=netconsole
rd.driver.pre=xennet
rd.driver.pre=xenblk
dracut:  resume=UUID=4feaa6f5-38c4-4674-ae54-8e22389731a1
dracut:  root=UUID=ac173013-18ad-4c4e-921e-fd2ecfb56495 rootfstype=ext4 rootflags=rw,relatime,lazytime,data=ordered
dracut: *** Creating image file '/boot/initrd-4.12.14-lp150.12.48-default' ***
dracut: *** Creating initramfs image file '/boot/initrd-4.12.14-lp150.12.48-default' done ***
Telcontar:/data/... # 

[paste ends]

It is not displayed here, but when this is done from YaST, *YaST* says *nothing* about the failure to build the kernel module. The first news that something is wrong is when video fails.
Comment 6 Stefan Dirsch 2019-04-01 17:50:56 UTC
Well, we're using the build structure from NVIDIA ...

What would be the benefit of letting the build fail officially? The build is done in %post of package install, so the package is already installed and can't be reverted automatically.

IIRC the build even gives you errors, if the build succeeds! I don't plan to reimplement NVIDIA's driver build. Seriously.
Comment 7 Carlos Robinson 2019-04-01 20:24:50 UTC
I said nothing about the build structure from Nvida head quarters.

I said nothing about reverting the package install.

The Make process errors are not reported to the user inside yast.

Yast says that the package succeeded, and proceed to reboot. Surely you can add a message telling yast that there were problems with make. Write the log somewhere and tell.

I only want YaST to tell the user that something happened.

This situation is not acceptable.
Comment 8 Stefan Dirsch 2019-04-02 18:25:10 UTC
(In reply to Carlos Robinson from comment #7)
> The Make process errors are not reported to the user inside yast.
> 
> Yast says that the package succeeded, and proceed to reboot. Surely you can
> add a message telling yast that there were problems with make. Write the log
> somewhere and tell.
> 
> I only want YaST to tell the user that something happened.

No, this has never been possible. This would be a feature request for YaST/zypper.
Comment 9 Stefan Dirsch 2019-05-22 14:01:18 UTC
Ok. Could you please open a bug or feature request against YaST, so it shows the output of appropriate scripts, when any of these exit with an error code != 0? So I can close this one as duplicate ...
Comment 10 Stefan Dirsch 2019-06-11 13:35:03 UTC
Hmm. Still interested?
Comment 11 Carlos Robinson 2019-06-11 15:57:36 UTC
Yes, sorry. Just trying to find sometime in which I can boot the machine to do it (repeat the rpm run and risk reboot to obtain the text and error code).

Just for curiosity sake, my intention is to get, when I get the money, new hardware, AMD: not Intel, nor Nvidia. I have 3 unsolvable problems with my current hardware, and this is one.
Comment 12 Stefan Dirsch 2019-06-11 17:09:28 UTC
Hmm. I was talking about opening a feature request against YaST... (comment #9)
Comment 13 Stephan Schöttl 2019-06-28 12:57:44 UTC
If may add a comment: Apart from error messages, it would also be nice if the kernel module were built correctly. That's for my use case anyway.

Thanks anyway for the analysis above. With this information I was finally able to build the modules after doing
cd /usr/src/ && ln -s linux-4.12.14-lp151.28.7-obj linux-obj
Comment 14 Stefan Dirsch 2019-07-05 13:17:47 UTC
(In reply to Stefan Dirsch from comment #12)
> Hmm. I was talking about opening a feature request against YaST... (comment
> #9)

Ok. Looks like Carlos is not (any longer).
Comment 15 Carlos Robinson 2019-07-05 19:08:28 UTC
Bug 1140563 submitted.
It took an hour just to extract sample logs. Now I have to reboot the machine so that it is consistent.
Comment 16 Stefan Dirsch 2019-07-05 19:40:35 UTC
Thanks!
Comment 17 Stefan Dirsch 2019-07-08 13:29:27 UTC
(In reply to Stefan Dirsch from comment #9)
> Ok. Could you please open a bug or feature request against YaST, so it shows
> the output of appropriate scripts, when any of these exit with an error code
> != 0? So I can close this one as duplicate ...

Looks like this has already been implemented according to boo#1140563. So reopening this bugreport.
Comment 18 Stefan Dirsch 2019-07-08 14:25:00 UTC
nvidia-gfxG05.changes
-------------------------------------------------------------------
Mon Jul  8 14:04:20 UTC 2019 - Stefan Dirsch <sndirsch@suse.com>

- kmp-post.sh/kmp-trigger.sh
  * exit with error code 1 from %post/%trigger, if kernel module
    build/install fails (boo#1131028)

Changed this for nvidia-gfx{,G01,G02,G03,G04,G05}. Will be in place with the next update on the nvidia server. Sources (changes) are in
  
  https://build.opensuse.org/project/show/X11:Drivers:Video

if you're interested. Closing as fixed ...
Comment 19 Carlos Robinson 2019-07-08 14:33:47 UTC
Created attachment 809702 [details]
var/log/zypp/history section

I attach the section of "/var/log/zypp/history" where it logs failure of the script, when finding '/usr/src/kernel-modules/nvidia-340.107-default/nvidia.ko' is missing.

I do not know how to see if:

 107 - ZYPPER_EXIT_INF_RPM_SCRIPT_FAILED

is set. But YaST did not display that to me, that's certain.

I may try to reproduce the failure on 15.1, if that is of interest.


The text is visible when using zypper, but as there are thousands of lines flowing, it is impossible to know that it happened unless the admin is looking at the text attentively.
Comment 20 Carlos Robinson 2019-07-08 14:35:14 UTC
(In reply to Stefan Dirsch from comment #18)
...
> if you're interested. Closing as fixed ...

Yes, I'll test that when I notice the update on servers, thanks.
Comment 21 Stefan Dirsch 2019-07-08 15:04:05 UTC
(In reply to Carlos Robinson from comment #20)
> (In reply to Stefan Dirsch from comment #18)
> ...
> > if you're interested. Closing as fixed ...
> 
> Yes, I'll test that when I notice the update on servers, thanks.

Sure. Go ahead! :-)
Comment 22 Stefan Dirsch 2019-07-11 13:33:08 UTC
RPM updates for G05 430.34 driver will already include this change. Should be available shortly ...
Comment 23 Carlos Robinson 2019-07-22 21:06:13 UTC
I'd like to test this, but my card needs the G03 driver, and these have not been updated in the NVidia repository since mid June (maybe mid May) -- nor has the G05, anyway:


http://http.download.nvidia.com/opensuse/leap/15.1/x86_64/

nvidia-glG03-340.107-lp151.12.2.x86_64.rpm 28MB 2019-06-12 18:07
nvidia-glG04-390.116-lp151.7.1.x86_64.rpm 28MB 2019-06-12 18:07
nvidia-glG05-430.26-lp151.14.1.x86_64.rpm 26MB 2019-06-12 18:07 


I'm not on any hurry, just mentioning in case you wonder why I don't report back ;-)
Comment 24 Stefan Dirsch 2019-07-22 21:59:21 UTC
Packages are already in the pipeline. They just need to be signed by Nvidia. Unfortunately this time this can't be done before sometime in August. Usually this only takes a few days at the maximum. This time we have bad luck.