|
Bugzilla – Full Text Bug Listing |
| Summary: | systemd: NVIDIA driver no longer working due to patches missing in Leap 42.2/sle12-sp2 | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Distribution | Reporter: | Luigi Baldoni <aloisio> |
| Component: | Basesystem | Assignee: | Stefan Dirsch <sndirsch> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P2 - High | CC: | aloisio, ddadap, fbui, forgotten_zfHS33mKgr, lnussel, opensuse_org, sndirsch |
| Version: | Leap 42.2 | Flags: | sndirsch:
SHIP_STOPPER-
|
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | No | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Bug Depends on: | |||
| Bug Blocks: | 1001470 | ||
| Attachments: |
nvidia-bug-report-sanitised
nvidia dmesg log udevadm info -e |
||
|
Description
Luigi Baldoni
2016-09-23 07:47:30 UTC
Sigh. Apparently kABI has changed again. :-( It should not matter, since the whole kernel module is built on the target system. So please try solution 2 and let me know, whether this fixes the issue. By ignoring the warning the packages install correctly but then I can only see a black screen when display-manager is started. Not sure if the two problems are related. Ok. Please attach the result when running nvidia-bug-report.sh. Created attachment 693933 [details]
nvidia-bug-report-sanitised
Hmm. Nothing obvious I could find. :-( nvidia-gfxG04-kmp-default nvidia-computeG04 nvidia-glG04- x11-video-nvidiaG04 are all installed? Does running X -retro :99 running from Linux console give you a nice Xserver picture, where you can move the mouse around? Daniel, any idea? Anything obvious you could spot? (In reply to Stefan Dirsch from comment #5) > nvidia-gfxG04-kmp-default > nvidia-computeG04 > nvidia-glG04- > x11-video-nvidiaG04 > > are all installed? Yes. > Does running > > X -retro :99 > > running from Linux console give you a nice Xserver picture, where you can > move the mouse around? And yes. Should I try anything else? Would it make sense to give the official installer a try or possibly even 370.28 ? Regards Then this sounds like the issue is being triggered by the displaymanager or Xsession (if autologin has been enabled and already starts when X starts). I suggest to switch to lightdm and chose another Xsession than GNOME/KDE? See /etc/sysconfig/displaymanager. Displaymanager needs to be restarted. (In reply to Stefan Dirsch from comment #7) > Then this sounds like the issue is being triggered by the displaymanager or > Xsession (if autologin has been enabled and already starts when X starts). I > suggest to switch to lightdm and chose another Xsession than GNOME/KDE? > > See /etc/sysconfig/displaymanager. Displaymanager needs to be restarted. LightDM start correctly, but then this happens: plasma5 and KDE plasma workspace: "Plasma failed to start Plasma is unable to start as it could not correctly use OpenGL 2. Please check that your graphic drivers are set up correctly." Gnome: "Oh no! Something has gone wrong. A problem has occurred and the system can't recover Please log out and try again." IceWM: No problem detected. Sounds like the wrong libGL is being used. Check that entries in /etc/ld.so.conf.d/nvidia-gfxG04.conf are active and not commented out. ldd to some OpenGL binary should mention /usr/X11R6/lib64/libGL.so.1 instead of /usr/lib64/libGL.so.1 (In reply to Stefan Dirsch from comment #9) > Sounds like the wrong libGL is being used. Check that entries in > > /etc/ld.so.conf.d/nvidia-gfxG04.conf That file contains: /usr/X11R6/lib64 /usr/X11R6/lib > are active and not commented out. ldd to some OpenGL binary should mention > > /usr/X11R6/lib64/libGL.so.1 > > instead of > > /usr/lib64/libGL.so.1 # ldd /usr/bin/plasmashell|grep GL libGL.so.1 => /usr/X11R6/lib64/libGL.so.1 (0x00007f9b65c76000) # ls /usr/X11R6/lib64/libGL.so.1 -l lrwxrwxrwx 1 root root 15 Sep 23 12:04 /usr/X11R6/lib64/libGL.so.1 -> libGL.so.367.44 Regards Weird. This looks ok. There are some simple OpenGL demos in Mesa-demo-x package like glxgears, glxinfo. Are these working when selecting failsafe, icewm or xfce as Xsession in lightdm? (In reply to Stefan Dirsch from comment #11) > Weird. This looks ok. There are some simple OpenGL demos in Mesa-demo-x > package like glxgears, glxinfo. Are these working when selecting failsafe, > icewm or xfce as Xsession in lightdm? $ glxinfo name of display: :0 X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 153 (GLX) Minor opcode of failed request: 24 (X_GLXCreateNewContext) Value in failed request: 0x0 Serial number of failed request: 76 Current serial number in output stream: 77 $ glxgears X Error of failed request: BadValue (integer parameter out of range for operation) Major opcode of failed request: 153 (GLX) Minor opcode of failed request: 3 (X_GLXCreateContext) Value in failed request: 0x0 Serial number of failed request: 28 Current serial number in output stream: 29 Wow! That's bad. No idea what's going on. According to the nvidia bugreport logfile glx Xserver extension is the NVIDIA one and also is the version number 367.44 as the libGL.so.1 is. (In reply to Stefan Dirsch from comment #13) > Wow! That's bad. No idea what's going on. According to the nvidia bugreport > logfile glx Xserver extension is the NVIDIA one and also is the version > number 367.44 as the libGL.so.1 is. 367.44 installed from the .run doesn't appear to have problems... Not sure what to think, perhaps I should compare the hashes of each relevant library? Indeed I could reproduce the issue. Due to security reasons we use options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=33 NVreg_DeviceFileMode=0660 line in /etc/modprobe.d/50-nvidia.conf since ages. So only members of video group can stillaccess the nvidia devices. Since the default users is no longer member of the video group since some time we've added two patches to systemd, which adds the appropriate ACLs to the devices during sessions startup. - apply-ACL-for-nvidia-device-nodes.patch - apply-ACL-for-nvidia-uvm-device-node.patch These patches have been silently removed between Leap 42.1/sle12-sp1 and Leap 42.2/sle12-sp2. We want these back !!! Seriously. @Luigy, could you please show the output of the following commands ? udevadm info /dev/nvidiactl udevadm info /dev/nvidia* Thanks. (In reply to Franck Bui from comment #18) > @Luigy, could you please show the output of the following commands ? > > udevadm info /dev/nvidiactl $ udevadm info /dev/nvidiactl Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. > udevadm info /dev/nvidia* $ udevadm info /dev/nvidia* Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. Regards (In reply to Luigi Baldoni from comment #19) > > > udevadm info /dev/nvidia* > > $ udevadm info /dev/nvidia* > Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. Huh ? Do the device nodes actually exist ? IOW, what does "ls -l /dev/nvidia*" show ? Thanks. (In reply to Franck Bui from comment #20) > (In reply to Luigi Baldoni from comment #19) > > > > > udevadm info /dev/nvidia* > > > > $ udevadm info /dev/nvidia* > > Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. > > Huh ? Do the device nodes actually exist ? Yes. > IOW, what does "ls -l /dev/nvidia*" show ? $ ls -l /dev/nvidia* crw-rw---- 1 root video 195, 0 Sep 28 08:24 /dev/nvidia0 crw-rw---- 1 root video 195, 255 Sep 28 08:24 /dev/nvidiactl crw-rw-rw- 1 root root 195, 254 Sep 28 08:24 /dev/nvidia-modeset crw-rw---- 1 root video 247, 0 Sep 28 08:24 /dev/nvidia-uvm Regards. (In reply to Luigi Baldoni from comment #21) The error was not ENOENT (No such file or directory) but ENODEV that is that the driver and/or kernel modules do not have initialized the device nodes. Make sure that all nvidia kernel modules are loaded, you might check this with lsmod | grep ^nv as well as with ls -ld /sys/module/nv* (In reply to Dr. Werner Fink from comment #22) > (In reply to Luigi Baldoni from comment #21) > > The error was not ENOENT (No such file or directory) but ENODEV that is that > the driver and/or kernel modules do not have initialized the device nodes. > > Make sure that all nvidia kernel modules are loaded, you might check this > with > > lsmod | grep ^nv $ lsmod | grep ^nv nvidia_drm 49152 2 nvidia_modeset 770048 3 nvidia_drm nvidia_uvm 794624 0 nvidia 11493376 43 nvidia_modeset,nvidia_uvm > as well as with > > ls -ld /sys/module/nv* $ ls -ld /sys/module/nv* drwxr-xr-x 6 root root 0 Sep 28 08:50 /sys/module/nvidia drwxr-xr-x 6 root root 0 Sep 28 08:50 /sys/module/nvidia_drm drwxr-xr-x 5 root root 0 Sep 28 08:50 /sys/module/nvidia_modeset drwxr-xr-x 6 root root 0 Sep 28 08:50 /sys/module/nvidia_uvm Regards. (In reply to Luigi Baldoni from comment #23) Then explain why sudo udevadm info /dev/nvidia* does fail here? Go on and debug this, that is stop the X server and unload/reload the modules with modprobe and afterwards have a look into the kernel messages with dmesg | grep -i3 nvidia (In reply to Dr. Werner Fink from comment #24) > (In reply to Luigi Baldoni from comment #23) > > Then explain why > > sudo udevadm info /dev/nvidia* > > does fail here? I fear that the driver is probably not following the standard kernel "rules" when it comes to init/create/declare a device (at least). IOW the API used to send uevents to userspace (udev) is not used at all (I guess) :-/ @Luigi, could you attach the output of "udevadm info -e" ? Created attachment 694673 [details]
nvidia dmesg log
Franck, the NVidia device nodes are not known to the kernel. The proprietary drivers have no access to the GPL interface required for that. See bug 808319 for the full story of the patches. (In reply to Luigi Baldoni from comment #26) [ 1.257580] nvidia_modeset: Unknown symbol nvidia_register_module (err 0) [ 1.257593] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err 0) [ 1.257603] nvidia_modeset: Unknown symbol nvidia_unregister_module (err 0) ... looks like this module will not work at all. (In reply to Ludwig Nussel from comment #27) > Franck, the NVidia device nodes are not known to the kernel. The proprietary > drivers have no access to the GPL interface required for that. > See bug 808319 for the full story of the patches. Thanks Ludwig for the pointer. According to your comment here https://bugzilla.suse.com/show_bug.cgi?id=808319#c34 it seems that there was an alternative fix (http://people.freedesktop.org/~kay/0001-udev-export-dead-device-nodes-to-run-udev-devnode-ua.patch) but the patch has been removed so I can't look at it anymore. Do you remember what it was about ? or maybe you still have it ? (In reply to Dr. Werner Fink from comment #29) > (In reply to Luigi Baldoni from comment #26) > > [ 1.257580] nvidia_modeset: Unknown symbol nvidia_register_module (err 0) > [ 1.257593] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err 0) > [ 1.257603] nvidia_modeset: Unknown symbol nvidia_unregister_module (err > 0) > > ... looks like this module will not work at all. Loading this in addition was necessary for your Optimus setup, if you can still remember. It won't hurt either. ;-) And this is totally unrelated to this bug! Created attachment 694685 [details]
udevadm info -e
Franck, even if Ludwig would still find this patch, I doubt that our kernel guys would accept it still for sle12-sp2. A patch, which probably will never be accepted upstream. udev rules using this hack haven't been written either. Please, bring these ACL patches back to systemd for sle12-sp2 in time, i.e. ASAP. I'm happy to test the updated RPMs. Otherwise half of our desktop users are no longer able to start their desktops. Same applies for Leap 42.2 of course. JFYI, I already gave the two patches a try on my NVIDIA test system on top of systemd in current Leap 42.2 and it fixes indeed the issue. (In reply to Stefan Dirsch from comment #34) > Franck, even if Ludwig would still find this patch, I doubt that our kernel > guys would accept it still for sle12-sp2. That was "just out of curiosity". But if the patch > A patch, which probably will never > be accepted upstream. udev rules using this hack haven't been written either. > And do you really think that udev/systemd upstream would have accepted such hack if you did send it to them ? The sad thing is that people seem to think that udev/systemd is the place for keeping hacks that we'll have to maintain forever. @luigi, could you try this: cat >/etc/tmpfiles.d/nvidia-hack.conf<<EOF L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0 EOF and reboot and see if it fixes your issue. Thanks. (In reply to Franck Bui from comment #35) > (In reply to Stefan Dirsch from comment #34) > > Franck, even if Ludwig would still find this patch, I doubt that our kernel > > guys would accept it still for sle12-sp2. > > That was "just out of curiosity". > > But if the patch > oops forgot to finish my sentence :) So here it is: "But if the patch looked better than the actual hack, we should consider using it in the futur." I Haven't been asked, but since I observe the same issue here ... (In reply to Franck Bui from comment #36) > @luigi, could you try this: > > cat >/etc/tmpfiles.d/nvidia-hack.conf<<EOF > L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl > L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm > L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0 > EOF > > and reboot and see if it fixes your issue. This magic appears to fix the issue. (In reply to Franck Bui from comment #36) > @luigi, could you try this: > > cat >/etc/tmpfiles.d/nvidia-hack.conf<<EOF > L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl > L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm > L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0 > EOF > > and reboot and see if it fixes your issue. I can also confirm this fixes the problem. (In reply to Stefan Dirsch from comment #39) > > This magic appears to fix the issue. Very good. Would you accept then to keep this in nvidia package so all the black magic are kept in one single place ? Franck, I do not understand at all how this hack works. If you try to explain to me what it does and why this works I may consider adding it permanently to the NVIDIA packages. Seems that only the user, which is currently logged in into the session has access to /dev/nvidia* files. ACL listings to the device files via getfacl confirms that. Reading tmpfiles.d manual page didn't explain the magic to me. :-( And if this an undocumented feature of tmpfiles.d I'm wondering, whether it may be removed at some point silently, so we're in trouble again ... (In reply to Stefan Dirsch from comment #42) > Franck, I do not understand at all how this hack works. If you try to > explain to me what it does and why this works I may consider adding it > permanently to the NVIDIA packages. > Sure and sorry for not explaining earlier. udev has support for static devices, see man udev and search for the "static_node" option. I must admit that this part is poorly documented (IMHO). Basically such devices dont trigger any events (like nvidia ones) so regular rules don't apply to them. However when it starts, udev will look at the rules with a "static_node" options defined and will apply the permissions to the static devices/nodes defined by the option. It's also possible to use the "TAG" key for those device. It's pretty useful if you want to define the "uaccess" which is later used by logind to find devices whose accesses need to be granted to the logged in user. IOW using this rule: TAG+="uaccess", OPTIONS+="static_node=foo" allows one to mark the static device /dev/foo with the "uaccess" tag. Since static devices are not part of the udev DB, the fact that the device node has a tag is done by creating symlinks in /run/udev/static_node-tags/tag as it's described in the man page. But things are not so easy when it comes to NVIDIA ;) Indeed nvidia device nodes don't exist when udev is started (since they're not really static nodes after all: they're created manually by something). Therefore udev doesn't acccept to apply the "static" rule to them. And here comes the trick, I used a tmpfile snippet to create the symlinks manually. As I said a rule could be used but the nvidia nodes need to be present when udev is started (so very early). One way to do that would be to create the nodes inside from initrd. But I don't know what creates those nodes so I don't know if that could be doable. Thanks a bunch, Franck. This explanation has been *very* useful to me. This way it was easy for me to add the creation of the udev/uaccess symlinks to /etc/modprobe.d/50-nvidia.conf. This is no longer considered a showstopper, since the issue can be fixed in the NVIDIA packages themselves. *** Bug 1003701 has been marked as a duplicate of this bug. *** Fixed packages will be available soon. |