Bug 1000625

Summary: systemd: NVIDIA driver no longer working due to patches missing in Leap 42.2/sle12-sp2
Product: [openSUSE] openSUSE Distribution Reporter: Luigi Baldoni <aloisio>
Component: BasesystemAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P2 - High CC: aloisio, ddadap, fbui, forgotten_zfHS33mKgr, lnussel, opensuse_org, sndirsch
Version: Leap 42.2Flags: sndirsch: SHIP_STOPPER-
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: No
Marketing QA Status: --- IT Deployment: ---
Bug Depends on:    
Bug Blocks: 1001470    
Attachments: nvidia-bug-report-sanitised
nvidia dmesg log
udevadm info -e

Description Luigi Baldoni 2016-09-23 07:47:30 UTC
Zypper returns an error when trying to install the nvidia driver
on Leap 42.2 beta 2.

---
# zypper install x11-video-nvidiaG04
Loading repository data...
Warning: Repository 'openSUSE-Leap-42.2-Update-Non-Oss' appears to be outdated. Consider using a different mirror or server.
Reading installed packages...
Resolving package dependencies...

Problem: nothing provides ksym(default:module_layout) = 958ed937 needed by nvidia-gfxG04-kmp-default-367.44_k4.4.19_1-12.1.x86_64
 Solution 1: do not install x11-video-nvidiaG04-367.44-12.1.x86_64
 Solution 2: break nvidia-gfxG04-kmp-default-367.44_k4.4.19_1-12.1.x86_64 by ignoring some of its dependencies

Choose from above solutions by number or cancel [1/2/c] (c):
---

No other nonstandard package or repository added.
Comment 1 Stefan Dirsch 2016-09-23 08:21:09 UTC
Sigh. Apparently kABI has changed again. :-( It should not matter, since the whole kernel module is built on the target system. So please try solution 2 and let me know, whether this fixes the issue.
Comment 2 Luigi Baldoni 2016-09-23 10:30:22 UTC
By ignoring the warning the packages install correctly but then I can only see a black screen when display-manager is started.

Not sure if the two problems are related.
Comment 3 Stefan Dirsch 2016-09-23 12:06:23 UTC
Ok. Please attach the result when running nvidia-bug-report.sh.
Comment 4 Luigi Baldoni 2016-09-23 14:59:52 UTC
Created attachment 693933 [details]
nvidia-bug-report-sanitised
Comment 5 Stefan Dirsch 2016-09-26 08:07:32 UTC
Hmm. Nothing obvious I could find. :-(

nvidia-gfxG04-kmp-default
nvidia-computeG04
nvidia-glG04-
x11-video-nvidiaG04

are all installed?

Does running

  X -retro :99

running from Linux console give you a nice Xserver picture, where you can move the mouse around?

Daniel, any idea? Anything obvious you could spot?
Comment 6 Luigi Baldoni 2016-09-26 10:01:27 UTC
(In reply to Stefan Dirsch from comment #5)
> nvidia-gfxG04-kmp-default
> nvidia-computeG04
> nvidia-glG04-
> x11-video-nvidiaG04
> 
> are all installed?

Yes.

> Does running
> 
>   X -retro :99
> 
> running from Linux console give you a nice Xserver picture, where you can
> move the mouse around?

And yes.

Should I try anything else? Would it make sense to give the official installer a try or possibly even 370.28 ?

Regards
Comment 7 Stefan Dirsch 2016-09-26 10:58:18 UTC
Then this sounds like the issue is being triggered by the displaymanager or Xsession (if autologin has been enabled and already starts when X starts). I suggest to switch to lightdm and chose another Xsession than GNOME/KDE?

See /etc/sysconfig/displaymanager. Displaymanager needs to be restarted.
Comment 8 Luigi Baldoni 2016-09-26 12:28:20 UTC
(In reply to Stefan Dirsch from comment #7)
> Then this sounds like the issue is being triggered by the displaymanager or
> Xsession (if autologin has been enabled and already starts when X starts). I
> suggest to switch to lightdm and chose another Xsession than GNOME/KDE?
> 
> See /etc/sysconfig/displaymanager. Displaymanager needs to be restarted.

LightDM start correctly, but then this happens:

plasma5 and KDE plasma workspace:

"Plasma failed to start
Plasma is unable to start as it could not correctly
use OpenGL 2.
Please check that your graphic drivers are set up
correctly."

Gnome:

"Oh no! Something has gone wrong.
A problem has occurred and the system can't recover
Please log out and try again."

IceWM:

No problem detected.
Comment 9 Stefan Dirsch 2016-09-26 12:40:25 UTC
Sounds like the wrong libGL is being used. Check that entries in 

  /etc/ld.so.conf.d/nvidia-gfxG04.conf

are active and not commented out. ldd to some OpenGL binary should mention

  /usr/X11R6/lib64/libGL.so.1

instead of

   /usr/lib64/libGL.so.1
Comment 10 Luigi Baldoni 2016-09-26 13:06:26 UTC
(In reply to Stefan Dirsch from comment #9)
> Sounds like the wrong libGL is being used. Check that entries in 
> 
>   /etc/ld.so.conf.d/nvidia-gfxG04.conf

That file contains:

/usr/X11R6/lib64
/usr/X11R6/lib

> are active and not commented out. ldd to some OpenGL binary should mention
> 
>   /usr/X11R6/lib64/libGL.so.1
> 
> instead of
> 
>    /usr/lib64/libGL.so.1

# ldd /usr/bin/plasmashell|grep GL
        libGL.so.1 => /usr/X11R6/lib64/libGL.so.1 (0x00007f9b65c76000)

# ls /usr/X11R6/lib64/libGL.so.1 -l
lrwxrwxrwx 1 root root 15 Sep 23 12:04 /usr/X11R6/lib64/libGL.so.1 -> libGL.so.367.44

Regards
Comment 11 Stefan Dirsch 2016-09-26 13:49:46 UTC
Weird. This looks ok. There are some simple OpenGL demos in Mesa-demo-x package like glxgears, glxinfo. Are these working when selecting failsafe, icewm or xfce as Xsession in lightdm?
Comment 12 Luigi Baldoni 2016-09-26 14:12:08 UTC
(In reply to Stefan Dirsch from comment #11)
> Weird. This looks ok. There are some simple OpenGL demos in Mesa-demo-x
> package like glxgears, glxinfo. Are these working when selecting failsafe,
> icewm or xfce as Xsession in lightdm?

$ glxinfo
name of display: :0
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  153 (GLX)
  Minor opcode of failed request:  24 (X_GLXCreateNewContext)
  Value in failed request:  0x0
  Serial number of failed request:  76
  Current serial number in output stream:  77

$ glxgears 
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  153 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  28
  Current serial number in output stream:  29
Comment 13 Stefan Dirsch 2016-09-26 14:17:15 UTC
Wow! That's bad. No idea what's going on. According to the nvidia bugreport logfile glx Xserver extension is the NVIDIA one and also is the version number 367.44 as the libGL.so.1 is.
Comment 14 Luigi Baldoni 2016-09-26 15:09:00 UTC
(In reply to Stefan Dirsch from comment #13)
> Wow! That's bad. No idea what's going on. According to the nvidia bugreport
> logfile glx Xserver extension is the NVIDIA one and also is the version
> number 367.44 as the libGL.so.1 is.

367.44 installed from the .run doesn't appear to have problems...

Not sure what to think, perhaps I should compare the hashes of each relevant library?
Comment 15 Stefan Dirsch 2016-09-27 13:46:11 UTC
Indeed I could reproduce the issue. Due to security reasons we use

options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=33 NVreg_DeviceFileMode=0660

line in /etc/modprobe.d/50-nvidia.conf since ages. So only members of video group can stillaccess the nvidia devices. 

Since the default users is no longer member of the video group since some time we've added two patches to systemd, which adds the appropriate ACLs to the devices during sessions startup.

- apply-ACL-for-nvidia-device-nodes.patch
- apply-ACL-for-nvidia-uvm-device-node.patch

These patches have been silently removed between Leap 42.1/sle12-sp1 and Leap 42.2/sle12-sp2.

We want these back !!! Seriously.
Comment 18 Franck Bui 2016-09-27 16:00:46 UTC
@Luigy, could you please show the output of the following commands ?

 udevadm info /dev/nvidiactl
 udevadm info /dev/nvidia*

Thanks.
Comment 19 Luigi Baldoni 2016-09-27 16:38:11 UTC
(In reply to Franck Bui from comment #18)
> @Luigy, could you please show the output of the following commands ?
> 
>  udevadm info /dev/nvidiactl

$ udevadm info /dev/nvidiactl
Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.

>  udevadm info /dev/nvidia*

$ udevadm info /dev/nvidia*
Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.

Regards
Comment 20 Franck Bui 2016-09-28 06:07:19 UTC
(In reply to Luigi Baldoni from comment #19)
> 
> >  udevadm info /dev/nvidia*
> 
> $ udevadm info /dev/nvidia*
> Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.

Huh ? Do the device nodes actually exist ?

IOW, what does "ls -l /dev/nvidia*" show ?

Thanks.
Comment 21 Luigi Baldoni 2016-09-28 06:28:03 UTC
(In reply to Franck Bui from comment #20)
> (In reply to Luigi Baldoni from comment #19)
> > 
> > >  udevadm info /dev/nvidia*
> > 
> > $ udevadm info /dev/nvidia*
> > Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
> 
> Huh ? Do the device nodes actually exist ?

Yes.
 
> IOW, what does "ls -l /dev/nvidia*" show ?

$ ls -l /dev/nvidia*
crw-rw---- 1 root video 195,   0 Sep 28 08:24 /dev/nvidia0
crw-rw---- 1 root video 195, 255 Sep 28 08:24 /dev/nvidiactl
crw-rw-rw- 1 root root  195, 254 Sep 28 08:24 /dev/nvidia-modeset
crw-rw---- 1 root video 247,   0 Sep 28 08:24 /dev/nvidia-uvm

Regards.
Comment 22 Dr. Werner Fink 2016-09-28 06:45:38 UTC
(In reply to Luigi Baldoni from comment #21)

The error was not ENOENT (No such file or directory) but ENODEV that is that the driver and/or kernel modules do not have initialized the device nodes.

Make sure that all nvidia kernel modules are loaded, you might check this with

   lsmod | grep ^nv

as well as with

   ls -ld /sys/module/nv*
Comment 23 Luigi Baldoni 2016-09-28 06:53:09 UTC
(In reply to Dr. Werner Fink from comment #22)
> (In reply to Luigi Baldoni from comment #21)
> 
> The error was not ENOENT (No such file or directory) but ENODEV that is that
> the driver and/or kernel modules do not have initialized the device nodes.
> 
> Make sure that all nvidia kernel modules are loaded, you might check this
> with
> 
>    lsmod | grep ^nv

$ lsmod | grep ^nv
nvidia_drm             49152  2 
nvidia_modeset        770048  3 nvidia_drm
nvidia_uvm            794624  0 
nvidia              11493376  43 nvidia_modeset,nvidia_uvm
 
> as well as with
> 
>    ls -ld /sys/module/nv*

$ ls -ld /sys/module/nv*
drwxr-xr-x 6 root root 0 Sep 28 08:50 /sys/module/nvidia
drwxr-xr-x 6 root root 0 Sep 28 08:50 /sys/module/nvidia_drm
drwxr-xr-x 5 root root 0 Sep 28 08:50 /sys/module/nvidia_modeset
drwxr-xr-x 6 root root 0 Sep 28 08:50 /sys/module/nvidia_uvm

Regards.
Comment 24 Dr. Werner Fink 2016-09-28 07:04:34 UTC
(In reply to Luigi Baldoni from comment #23)

Then explain why

  sudo udevadm info /dev/nvidia*

does fail here? Go on and debug this, that is stop the X server and unload/reload the modules with modprobe and afterwards have a look into the kernel messages with

  dmesg | grep -i3 nvidia
Comment 25 Franck Bui 2016-09-28 07:15:37 UTC
(In reply to Dr. Werner Fink from comment #24)
> (In reply to Luigi Baldoni from comment #23)
> 
> Then explain why
> 
>   sudo udevadm info /dev/nvidia*
> 
> does fail here? 

I fear that the driver is probably not following the standard kernel "rules" when it comes to init/create/declare a device (at least).

IOW the API used to send uevents to userspace (udev) is not used at all (I guess) :-/

@Luigi, could you attach the output of "udevadm info -e" ?
Comment 26 Luigi Baldoni 2016-09-28 07:20:40 UTC
Created attachment 694673 [details]
nvidia dmesg log
Comment 27 Ludwig Nussel 2016-09-28 07:55:42 UTC
Franck, the NVidia device nodes are not known to the kernel. The proprietary drivers have no access to the GPL interface required for that.
See bug 808319 for the full story of the patches.
Comment 29 Dr. Werner Fink 2016-09-28 08:06:47 UTC
(In reply to Luigi Baldoni from comment #26)

[    1.257580] nvidia_modeset: Unknown symbol nvidia_register_module (err 0)
[    1.257593] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err 0)
[    1.257603] nvidia_modeset: Unknown symbol nvidia_unregister_module (err 0)

... looks like this module will not work at all.
Comment 30 Franck Bui 2016-09-28 08:11:42 UTC
(In reply to Ludwig Nussel from comment #27)
> Franck, the NVidia device nodes are not known to the kernel. The proprietary
> drivers have no access to the GPL interface required for that.
> See bug 808319 for the full story of the patches.

Thanks Ludwig for the pointer.

According to your comment here https://bugzilla.suse.com/show_bug.cgi?id=808319#c34 it seems that there was an alternative fix (http://people.freedesktop.org/~kay/0001-udev-export-dead-device-nodes-to-run-udev-devnode-ua.patch) but the patch has been removed so I can't look at it anymore.

Do you remember what it was about ? or maybe you still have it ?
Comment 32 Stefan Dirsch 2016-09-28 08:17:45 UTC
(In reply to Dr. Werner Fink from comment #29)
> (In reply to Luigi Baldoni from comment #26)
> 
> [    1.257580] nvidia_modeset: Unknown symbol nvidia_register_module (err 0)
> [    1.257593] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err 0)
> [    1.257603] nvidia_modeset: Unknown symbol nvidia_unregister_module (err
> 0)
> 
> ... looks like this module will not work at all.

Loading this in addition was necessary for your Optimus setup, if you can still remember. It won't hurt either. ;-)

And this is totally unrelated to this bug!
Comment 33 Luigi Baldoni 2016-09-28 08:52:40 UTC
Created attachment 694685 [details]
udevadm info -e
Comment 34 Stefan Dirsch 2016-09-28 14:12:45 UTC
Franck, even if Ludwig would still find this patch, I doubt that our kernel guys would accept it still for sle12-sp2. A patch, which probably will never be accepted upstream. udev rules using this hack haven't been written either.

Please, bring these ACL patches back to systemd for sle12-sp2 in time, i.e. ASAP. I'm happy to test the updated RPMs. Otherwise half of our desktop users are no longer able to start their desktops. Same applies for Leap 42.2 of course.

JFYI, I already gave the two patches a try on my NVIDIA test system on top of systemd in current Leap 42.2 and it fixes indeed the issue.
Comment 35 Franck Bui 2016-09-28 15:01:18 UTC
(In reply to Stefan Dirsch from comment #34)
> Franck, even if Ludwig would still find this patch, I doubt that our kernel
> guys would accept it still for sle12-sp2.

That was "just out of curiosity".

But if the patch 

> A patch, which probably will never
> be accepted upstream. udev rules using this hack haven't been written either.
>

And do you really think that udev/systemd upstream would have accepted such hack if you did send it to them ?

The sad thing is that people seem to think that udev/systemd is the place for keeping hacks that we'll have to maintain forever.
Comment 36 Franck Bui 2016-09-28 15:02:47 UTC
@luigi, could you try this:

cat >/etc/tmpfiles.d/nvidia-hack.conf<<EOF
L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl
L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm
L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0
EOF

and reboot and see if it fixes your issue.

Thanks.
Comment 37 Franck Bui 2016-09-28 15:05:40 UTC
(In reply to Franck Bui from comment #35)
> (In reply to Stefan Dirsch from comment #34)
> > Franck, even if Ludwig would still find this patch, I doubt that our kernel
> > guys would accept it still for sle12-sp2.
> 
> That was "just out of curiosity".
> 
> But if the patch 
> 

oops forgot to finish my sentence :)

So here it is: "But if the patch looked better than the actual hack, we should consider using it in the futur."
Comment 39 Stefan Dirsch 2016-09-28 15:26:06 UTC
I Haven't been asked, but since I observe the same issue here ...

(In reply to Franck Bui from comment #36)
> @luigi, could you try this:
> 
> cat >/etc/tmpfiles.d/nvidia-hack.conf<<EOF
> L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl
> L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm
> L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0
> EOF
> 
> and reboot and see if it fixes your issue.

This magic appears to fix the issue.
Comment 40 Luigi Baldoni 2016-09-28 15:38:24 UTC
(In reply to Franck Bui from comment #36)
> @luigi, could you try this:
> 
> cat >/etc/tmpfiles.d/nvidia-hack.conf<<EOF
> L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl
> L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm
> L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0
> EOF
> 
> and reboot and see if it fixes your issue.

I can also confirm this fixes the problem.
Comment 41 Franck Bui 2016-09-28 15:39:46 UTC
(In reply to Stefan Dirsch from comment #39)
> 
> This magic appears to fix the issue.

Very good.

Would you accept then to keep this in nvidia package so all the black magic are kept in one single place ?
Comment 42 Stefan Dirsch 2016-09-29 07:57:43 UTC
Franck, I do not understand at all how this hack works. If you try to explain to me what it does and why this works I may consider adding it permanently to the NVIDIA packages.

Seems that only the user, which is currently logged in into the session has access to /dev/nvidia* files. ACL listings to the device files via getfacl confirms that.

Reading tmpfiles.d manual page didn't explain the magic to me. :-(

And if this an undocumented feature of tmpfiles.d I'm wondering, whether it may be removed at some point silently, so we're in trouble again ...
Comment 43 Franck Bui 2016-09-29 15:57:48 UTC
(In reply to Stefan Dirsch from comment #42)
> Franck, I do not understand at all how this hack works. If you try to
> explain to me what it does and why this works I may consider adding it
> permanently to the NVIDIA packages.
> 

Sure and sorry for not explaining earlier.

udev has support for static devices, see man udev and search for the "static_node" option. I must admit that this part is poorly documented (IMHO).

Basically such devices dont trigger any events (like nvidia ones) so regular rules don't apply to them. However when it starts, udev will look at the rules with a "static_node" options defined and will apply the permissions to the static devices/nodes defined by the option.

It's also possible to use the "TAG" key for those device. It's pretty useful if you want to define the "uaccess" which is later used by logind to find devices whose accesses need to be granted to the logged in user.

IOW using this rule:

 TAG+="uaccess", OPTIONS+="static_node=foo"

allows one to mark the static device /dev/foo with the "uaccess" tag.

Since static devices are not part of the udev DB, the fact that the device node has a tag is done by creating symlinks in /run/udev/static_node-tags/tag as it's described in the man page.

But things are not so easy when it comes to NVIDIA ;)

Indeed nvidia device nodes don't exist when udev is started (since they're not really static nodes after all: they're created manually by something). Therefore udev doesn't acccept to apply the "static" rule to them.

And here comes the trick, I used a tmpfile snippet to create the symlinks manually.

As I said a rule could be used but the nvidia nodes need to be present when udev is started (so very early).

One way to do that would be to create the nodes inside from initrd. But I don't know what creates those nodes so I don't know if that could be doable.
Comment 44 Stefan Dirsch 2016-09-30 09:47:47 UTC
Thanks a bunch, Franck. This explanation has been *very* useful to me. This way it was easy for me to add the creation of the udev/uaccess symlinks to /etc/modprobe.d/50-nvidia.conf. This is no longer considered a showstopper, since the issue can be fixed in the NVIDIA packages themselves.
Comment 45 Stefan Dirsch 2016-10-08 09:14:12 UTC
*** Bug 1003701 has been marked as a duplicate of this bug. ***
Comment 46 Stefan Dirsch 2016-10-08 09:15:27 UTC
Fixed packages will be available soon.