Bug 219456

Summary: Restart Xserver not possible after upgrading the driver version
Product: [openSUSE] openSUSE 10.2 Reporter: Stefan Dirsch <sndirsch>
Component: X11 3rd PartyAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED FIXED QA Contact: Stefan Dirsch <sndirsch>
Severity: Enhancement    
Priority: P1 - Urgent CC: aritger, eich, lfriedman, werner
Version: Beta 2   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: xdm.diff
kdm.diff
gdm.diff
Xorg
diff
/usr/bin/X.%name

Description Stefan Dirsch 2006-11-09 16:23:04 UTC
I just updated the nvidia driver. After terminating my Xsession, the Xserver tried to restart (default xdm/kdm/gdm setup on SUSE since ever), which was not possible since the already loaded kernel module does not match the new driver version. The displaymanager tries 3 times to start the Xserver then terminates. Unloading the kernel module and restarting the displaymanger fixes this problem, but is rather inconvenient for the average user. Most of them
will reboot after having seen the error message "login:". :-(

Would it be possible that the NVIDIA driver unloads the kernel module if it doesn't match (or unload it in general when it's loaded - if a check is not possible) and tries again with a possibly new one loaded?

Let us know what you think about the proposal.
Comment 1 Lonni Friedman 2006-11-09 16:26:16 UTC
Stefan,
I'm not sure that I understand how you updated the nvidia driver.  The problem that you're reporting shouldn't even be possible if you used the official nvidia driver package.  Were you using an RPM for the nvidia driver?

thanks
Comment 2 Stefan Dirsch 2006-11-09 16:35:19 UTC
Yes, I used the driver RPMs we build for SLES10/SLED10. Background: We would like to provide a driver update due to the "Rapid7 Advisory R7-0025" security issue. Did you understand the problem or was my decsription to confusing for you?
Comment 3 Lonni Friedman 2006-11-09 16:44:19 UTC
The official NVIDIA installer already unloads the nvidia kernel module as part of the installation process.  It sounds like your RPM isn't doing that (probably because the RPM can be installed while X is using it?).  I guess I'm just not clear what you're asking us to change?
Comment 4 Stefan Dirsch 2006-11-09 16:57:55 UTC
> The official NVIDIA installer already unloads the nvidia kernel module as
> part of the installation process.
Sure. I'm aware of. Makes sense.

> It sounds like your RPM isn't doing that
> (probably because the RPM can be installed while X is using it?).
That's it! :-)

> I guess I'm just not clear what you're asking us to change?
During Xserver start unload already loaded kernel module, when it does not match the driver version, and load a possibly updated kernel module.

You're right. When using the official installer, such a feature is not required at all.
Comment 5 Stefan Dirsch 2006-11-09 17:07:27 UTC
Oops. I think we can do a braindead solution to address this problem ourselves by forcing to always unload it first.

cat >> /etc/modprobe.de/nvidia < EOF
install nvidia { /sbin/rmmod nvidia; sbin/modprobe -i nvidia }
EOF



Comment 6 Lonni Friedman 2006-11-09 17:09:37 UTC
Thanks Stefan.  Let me know if you need anything further from NVIDIA.
Comment 7 Stefan Dirsch 2006-11-09 17:27:59 UTC
> Oops. I think we can do a braindead solution
It's braindead since it doesn't work at all. This would only work when the driver always tries to load the kernel module, but it doesn't. So now we're back to 

"During Xserver start unload already loaded kernel module, when it does not
match the driver version, and load a possibly updated kernel module."
Comment 8 Lonni Friedman 2006-11-09 17:32:27 UTC
I've opened RFE 266558 for this issue.  thanks.
Comment 9 Stefan Dirsch 2006-11-09 17:46:19 UTC
Thanks. Just for curiosity. What does RFE stand for?
Comment 11 Lonni Friedman 2006-11-09 17:48:30 UTC
Request for enhancement
Comment 14 Stefan Dirsch 2006-11-09 18:01:09 UTC
Thanks, Lonni. Matthias has access to the RFE database and will watch this issue. :-)
Comment 15 Lonni Friedman 2006-12-21 18:37:43 UTC
This RFE has been marked will not fix, as the requested enhancement is only neccesary with 3rd party repackaging of the NVIDIA driver.
Comment 16 Stefan Dirsch 2006-12-21 18:55:18 UTC
So NVIDIA is not interested in easy driver integration into Linux products? This is definitely bad news. :-(
Comment 17 andy ritger 2006-12-21 23:11:10 UTC
> So NVIDIA is not interested in easy driver integration into Linux products?
> This is definitely bad news. :-(

I'm sorry for the miscommunication, Stefan.  After all the great work SuSE
and NVIDIA has done together, we both know the above statement isn't true.


Several NVIDIA engineers discussed this bug a while ago, and the consensus
was that this was cumbersome to solve within the driver:

    - we wouldn't want running *any* user space driver component to try
      to unload the kernel module, so probably limit this approach to just 
      the X driver, but what about non-X components? libnvidia-cfg.so
      (used by `nvidia-xconfig --query-gpu-info`)?  what about the
      upcoming CUDA compute driver?

    - unloading would need to be done in userspace rather than in the
      kernel, but the user space portion doesn't have easy access to the
      module usage count, so it would require some more complicated
      handshaking to negotiate when it was safe to unload (or atleast
      when an unload is not expected to succeed)

    - if we wanted to backport this to the 71xx legacy drivers, we'd
      have problems solving this in nvidia_drv.o because the driver is
      inside the XF86 libc wrapper; that's solvable, but cumbersome.

It feels to me like the problem you encountered here would be better
solved through package management.  I remember when we first developed
nvidia-installer, you suggested having nvidia-installer not let you
install while the X server was running (I think you found problems with
the nv driver running while the NVIDIA kernel module was loaded).

In general, it seems cleaner to need to first uninstall the current
NVIDIA driver before installing the new driver.  And I think that's
our recommendation.

Of course NVIDIA's willing to work with you on this, but it doesn't seem
like having the X driver unload the kernel module is the right solution.

FWIW, NVIDIA does still need to solve the problem of allowing multiple
driver versions to coexist on the file system at once, even if only one
is allowed to be used at a time (we receive this request from various
distros building LiveCDs), so it may be that as we work through the
issues for that, some better solutions to this current problem may arise.

But I think the best course of action is for your rpm packaging to more
strictly enforce when the driver can be installed, similar to what
nvidia-installer does.

Let me know what I can do to help.

Thanks,
- Andy
Comment 18 Stefan Dirsch 2006-12-22 04:25:03 UTC
Thanks for detailed explanation, Andy. 

I'll try to unload the kernel module right after Xsession has finished, i.e. before the new Xserver starts. Not sure how to tell this exactly xdm/kdm/gdm. Need to figure out. Of course the unload will fail when a second Xsever is running. Bad luck.
Comment 19 Egbert Eich 2007-01-02 09:59:31 UTC
The kernel rpm asks the user to reboot the system after installation to make sure the upgraded kernel and its modules are loaded. So this is a non issue for all free drivers.
In this particular case the situation is a little more complicated as a user land component has a dependency on specific kernel driver user space/ kernel ABI version. I'm not exactly sure how we solve this problem with user space components /drivers we control (ie. the free ones) but I think we try to avoid upgrading these products to a new ABI for an existing product (Stefan?).
There are different ways to solve this:
1. Make the NVIDIA kernel/user land interfaces backward compatible on both sides (this would require the new X driver to run with an old kernel module - possible with a degraded feature set). 
2. force the user to reboot the system whenever ugrading an NVIDIA component.
Juding from Andy's comment above this seems to be the only way to gurantee a consistent system under all circumstances although it seems to be overkill in most real world use cases.

I'd assume that the former solution would create a temendous burdeon and possible introduce regression for a feature of temorary value while the latter approach creates an inconvenience for the user. 
This inconcenience could be reduced if the reboot would only be required if there really was an ABI change between kernel and user land (assuming this is not always the case).

Comment 20 Stefan Dirsch 2007-01-03 17:28:58 UTC
I propose to simply hack xdm/kmd/gdm to unload the kernel module right after terminating the Xserver. I don't have a better idea right now. Trying to do this in /etc/X11/xdm/Xreset doesn't work since the Xserver is still running
at this time. :-(
Comment 21 Stefan Dirsch 2007-01-04 14:26:29 UTC
Created attachment 111497 [details]
xdm.diff

Proposal for xdm.
Comment 22 Stefan Dirsch 2007-01-04 14:31:03 UTC
It would be better to unload the module only when the version of the currently loaddd kernel module differs from the one, which is currently installed. You can check the version of the currently loaded kernel module via "cat /proc/driver/version/nvidia", but the version from the installed one cannot be checked with modinfo. Looks like it's only available via the "strings" command. :-(
Comment 23 Stefan Dirsch 2007-01-04 15:13:54 UTC
Created attachment 111504 [details]
kdm.diff

Proposal for kdm (untested).
Comment 24 Stefan Dirsch 2007-01-04 15:32:55 UTC
Created attachment 111512 [details]
gdm.diff

Proposal for gdm (untested).
Comment 25 Egbert Eich 2007-01-04 16:36:55 UTC
I'm against hacking kdm/xdm to address the few times where an update happened. This is even more hacky than putting it into the driver. It doesn't address the cases where no X driver is running (as mentioned in Andy's mail).
Also it will always return false (unless someone manages to install an Nvidia and a FireGL card).
The install procedure can test if it can unload the driver instead and reboot if this fails.
How about if we create a script to replace the /var/X11R6/bin link when a driver is installed:
Put the following into /var/X11R6/bin/X.install:
#/bin/bash
mods=`lsmod | grep -o -E "nvidia|fglrx"`;
test $? -eq 0 ] &&  for i in ${mods}; do rmmod $i; done;
test -f /var/X11R6/bin/X._install && mv /var/X11R6/bin/X._install /var/X11R6/bin/X;
exec /var/X11R6/bin/X ${1+"$@"}

The installation procedure can do someting like:

test -f /var/X11R6/bin/X._install \
  || { mv /var/X11R6/bin/X /var/X11R6/bin/X._install; \
       ln -sf /var/X11R6/bin/X.install /var/X11R6/bin/X' };
Comment 26 Egbert Eich 2007-01-04 16:37:58 UTC
This will of course not address Andy's other concerns. But it will keep actions confined to where they are required.
Comment 27 Stefan Dirsch 2007-01-04 17:37:58 UTC
> The install procedure can test if it can unload the driver instead and
> rebootif this fails.
IMHO this is overkill and I'm afraid there's no way to trigger a reboot, i.e. to tell YaST to do so, in a safe way during a package installation.

> How about if we create a script to replace the /var/X11R6/bin link when a
> driver is installed:
> Put the following into /var/X11R6/bin/X.install:
> #/bin/bash
> mods=`lsmod | grep -o -E "nvidia|fglrx"`;
> test $? -eq 0 ] &&  for i in ${mods}; do rmmod $i; done;
> test -f /var/X11R6/bin/X._install && mv /var/X11R6/bin/X._install
> /var/X11R6/bin/X;
> exec /var/X11R6/bin/X ${1+"$@"}
This would be possible, but ...

> The installation procedure can do someting like:
>
> test -f /var/X11R6/bin/X._install \
>   || { mv /var/X11R6/bin/X /var/X11R6/bin/X._install; \
>        ln -sf /var/X11R6/bin/X.install /var/X11R6/bin/X' };
What do you mean with installation procedure? 

- YaST? 
- RPM %post install section?

At least SaX2 and (/etc/X11/xdm/)SuSEconfig.xdm then will needs to be
adjusted as well (otherwise the wrapper script would be overwritten). My 
first idea has also been a wrapper script, but I abandonned it immediately when thinking about all the places where changes have to be made.
Comment 28 Stefan Dirsch 2007-01-04 22:18:50 UTC
When using a wrapper script I vote for replacing /usr/bin/Xorg (instead of /var/X11R6/bin/X) and moving /usr/bin/Xorg to /usr/bin/Xorg.bin. Then we only need to replace package xorg-x11-server (and package permissions on SLES10).
Comment 29 Stefan Dirsch 2007-01-04 22:40:23 UTC
Proposal for new /usr/bin/Xorg (untested):

#!/bin/sh

# force kernel module reload after driver update; kernel module and
# X driver version need to match (Novell Bugzilla #219456)

if [ "$UID" == "0" ]; then
  mods=$(lsmod | grep -o -E "^(nvidia|fglrx)")
  if [ $? -eq 0 ]; then
    for m in ${mods}; do
      /sbin/rmmod $m
    done
  fi
fi

exec /usr/bin/Xorg.bin ${1+"$@"}
Comment 30 Stefan Dirsch 2007-01-04 23:25:06 UTC
Created attachment 111584 [details]
Xorg

Improved Xorg wrapper.
Comment 31 Egbert Eich 2007-01-05 11:20:43 UTC
I'm not sure if I have made myself clear. My idea was that the installation program for the nvidia/fireglx driver *temporarily* replaces the link of the Xserver binary. Once the binary has run sucessfully it will itself move back the link.
Unloading the kernel module every time will trigger complaints by people who are concerned about short startup times as it will require the Xserver to reload the kernel module every time it is restarted.
Maybe I'm missing someting but I don't see any benefit unloading this module every time the Xserver gets restarted. I only see a need in the particular situation where the X and kernel drivers have been updated and the loaded kernel driver doesn't match what is needed by the X driver any more.
I don't care if you want to put the 'wrapper' into /usr/bin/X11. I just don't want to see it used all the time. However my idea was that the installer itself can provide this script.
It is confusing and people who don't know about the  nvidia/fireglx situation (and don't care about either driver) will wonder what we are up to.
Comment 32 Stefan Dirsch 2007-01-05 11:47:42 UTC
Thanks for explanation, Egbert. Now I understand what you want to achieve. :-)
A wrapper script /var/X11R6/bin/X, which unloads the possibly existing old kernel module and replaces itself with a symlink to /usr/bin/Xorg again after the first run. Hmm ... sounds cool ... but also problematic ...

Things like this could be done in %post of a package, but I'm afraid that 
SaX2 will kill this mechanism (overwrite the wrapper script), when it is used for configuration on top of a running Xserver right after installation of this package.

I'll need to think about a solution based on your idea, which will work in
all cases.
Comment 33 Stefan Dirsch 2007-01-05 14:39:27 UTC
To prevent the SaX2 prolematic it would be possible to use /usr/bin/Xorg as wrapper script (untested). /usr/bin/Xorg needs to be moved to /usr/bin/Xorg.bin before in %post. But this won't work, when /usr is readonly. :-(

#!/bin/sh

# force kernel module reload after driver update; kernel module and
# X driver version need to match (Novell Bugzilla #219456)

if [ "$UID" == "0" ]; then
  mods=$(lsmod | grep -o -E "^(nvidia|fglrx)")
  if [ $? -eq 0 ]; then
    for m in ${mods}; do
      /sbin/rmmod $m
    done
  fi
fi

mv /usr/bin/Xorg.bin /usr/bin/Xorg
exec /usr/bin/Xorg ${1+"$@"}
Comment 34 Egbert Eich 2007-01-05 15:14:54 UTC
Is there a need for SaX2 to still touch the link?
Also is this a realistic scenario? When the nvidia kernel module is laoded and cannot be unloaded the user would already have a running Xserver with NVIDIA driver (from where he is doing the update). In all other cases the installation procedure would be able to uninstall the module. 
I personally don't like to replace the Xorg binary with a wrapper script - not even temporarily.
Also if w ecan get SaX2 to unload the driver module (using some 'flag' that's been set up by the installation procedure) it can do with the link anything it likes.
Comment 35 Stefan Dirsch 2007-01-05 15:54:21 UTC
> Is there a need for SaX2 to still touch the link?
I'm not sure. Marcus?

> When the nvidia kernel module is laoded and cannot be unloaded the user 
> would already have a running Xserver with NVIDIA
> driver (from where he is doing the update)
This bugreport is about this situation. .-(

> I personally don't like to replace the Xorg binary with a wrapper script -
> not even temporarily.
I do understand this. I'm not happy with it either.

> Also if we can get SaX2 to unload the driver module (using some 'flag'
> that's been set up by the installation procedure) it can do with the link
> anything it likes.
Sure, but we are talking about the situation, where it isn't possible to unload the module, because it runs on top of a NVIDIA Xserver.
Comment 36 Stefan Dirsch 2007-01-05 16:00:56 UTC
> I personally don't like to replace the Xorg binary with a wrapper script -
> not even temporarily.
Hmm ... we could replace the /usr/bin/X symlink instead.
Comment 37 Egbert Eich 2007-01-08 10:35:49 UTC
(In reply to comment #35)
> > Also if we can get SaX2 to unload the driver module (using some 'flag'
> > that's been set up by the installation procedure) it can do with the link
> > anything it likes.
> Sure, but we are talking about the situation, where it isn't possible to unload
> the module, because it runs on top of a NVIDIA Xserver.
> 
This would only be the case if SaX was running inside the X session where the installation took place. Why would the user do that?
Anyway if Marcus was willing to touch SaX2 there are two ways to remedy the problem.
1. SaX2 detects the situation and refuses to run - opening a window telling the user to restart X first.
2. SaX2 would detect the situation and do something smart about the link ie. changing the backup copy of the installer link - which could cause problems if multiple stacked backup copies existed or better assume the link is correct and leave it alone.

Comment 38 Egbert Eich 2007-01-08 10:50:42 UTC
(In reply to comment #36)
> > I personally don't like to replace the Xorg binary with a wrapper script -
> > not even temporarily.
> Hmm ... we could replace the /usr/bin/X symlink instead.
> 

This would not allow for a solution that gets installed by the update process and deinstalls itself when not needed any more. The merit of the /var/X11R6/bin/ solution was that this is guaranteed to be writable even after the installation. 

We are trying to design our installation procedure here and SaX2 is part of this (and under our control) therefore why not make it aware of this procedure?
Indeed I can see scenarios where my solution would not work well (ie. centralized installation scenarios where /usr is shared among many systems) however these systems need other heuristics to deal with changes to the individual /var copies. No such scenario is presently supported as far as I know.
Comment 39 Stefan Dirsch 2007-01-08 11:25:16 UTC
My biggest concern about replacing /var/X11R6/bin/X is that we also need to adjust SuSEconfig.xdm (xorg-x11 package). Otherwise /var/X11R6/bin/X will be immediately overwritten at the end of installation (SLES10) / during (re)start of xdm (openSUSE >= 10.2). But of course changing this would be pretty trivial. 

BTW, maybe we even won't need to replace /var/X11R6/bin/X if we set $DISPLAYMANAGER_XSERVER in /etc/sysconfig/displaymanager appropriately during installation and call SuSEconfig.xdm afterwards. The wrapper script could
set $DISPLAYMANAGER_XSERVER back after the first run.
Comment 40 Stefan Dirsch 2007-01-08 11:58:35 UTC
Looks like SaX2 doesn't create the /var/X11R6/bin/X symlink. At least it doesn't when started on top of a running Xserver. This is good. :-)
Comment 41 Matthias Hopf 2007-01-08 15:43:56 UTC
(In reply to comment #37)
> > Sure, but we are talking about the situation, where it isn't possible to unload
> > the module, because it runs on top of a NVIDIA Xserver.
> > 
> This would only be the case if SaX was running inside the X session where the
> installation took place. Why would the user do that?

Hm - why *wouldn't* he do that? It's standard in about every OS to download and install a new graphics driver while it (the old version) is running.

(In reply to comment #38)
> This would not allow for a solution that gets installed by the update process
> and deinstalls itself when not needed any more. The merit of the
> /var/X11R6/bin/ solution was that this is guaranteed to be writable even after
> the installation. 

While I do not think that supporting exactly this scenario in cases where /usr is mounted r/o only (e.g. clusters with shared /usr), /var is certainly the better place for this. /usr should be immutable after installation.

Comment 42 Stefan Dirsch 2007-01-08 16:51:54 UTC
(In reply to comment #39)
> BTW, maybe we even won't need to replace /var/X11R6/bin/X if we set
> $DISPLAYMANAGER_XSERVER in /etc/sysconfig/displaymanager appropriately
> during installation and call SuSEconfig.xdm afterwards. The wrapper script
> could set $DISPLAYMANAGER_XSERVER back after the first run.

I've now implemented this. Works fine for me. I'll attach the diff for the specfile and the wrapper script JFYI.


Comment 43 Stefan Dirsch 2007-01-08 16:52:45 UTC
Created attachment 111853 [details]
diff

Patch for specfile.
Comment 44 Stefan Dirsch 2007-01-08 16:53:46 UTC
Created attachment 111855 [details]
/usr/bin/X.%name

wrapper script
Comment 45 Stefan Dirsch 2007-01-08 17:29:53 UTC
I plan to use the new mechanism (see comments #43/44) for future NVIDIA/ATI driver releases.
Comment 47 Stefan Dirsch 2007-01-08 17:31:18 UTC
Closing as fixed.
Comment 48 andy ritger 2007-01-08 17:39:15 UTC
(sorry for joining the discussion late, just got back from holiday)

Wow, I'm sorry that this got so involved.  It sounds like you have a solution
that works, and I don't have any better ideas.

I suppose one other approach might be to defer installation of the new driver
until the next time the X server starts -- either when the user reboots or when
the user restarts X.  At that point (after the X server scripts have started,
but before the server binary has started), the old driver could be uninstalled
and the new driver could be installed, then server startup could continue.

Maybe the X startup scripts could be made to cheaply determine if there is a
new driver to install?  (e.g., look in a special directory)

Anyway, just throwing out the idea; if your current solution works, then fine
with me.
Comment 49 Stefan Dirsch 2007-01-08 17:51:53 UTC
Unfortunately defering a package installation is not possible ATM. Cool idea nevertheless. :-)

I think we have a working solution now. :-) At least for most of the cases. It cannot work if more than one Xserver is active.
Comment 50 Egbert Eich 2007-01-12 15:35:49 UTC
The installation cannot be deferred but the files could be installed in a scratch area and moved into place later. This however may fail when /usr is mounted ro as at the time the Xserver is restarted it may already be mounted ro (while it must be rw at installation time.)
In any case it would be just as troublesome as it would also require a wrapper script.
Another caveat: this procedure only works with xdm - it doesn't work with startx. The advantage of the /var/X11R6/bin/X solution was that it would have worked with any X starting mechanism.
In any case I will reopen this as the startx script may need some additional tweaking.
A reply to comment #41:
> Hm - why *wouldn't* he do that? It's standard in about every OS to download and
> install a new graphics driver while it (the old version) is running.
SaX2 isn't needed for the installation of the package. When the user is running X when installing the driver he either has a configured Xserver (with the NVIDIA driver) running already or he is installing from a server that doesn't use the NVIDIA driver in which case the kernel module is not a concern. 
Comment 51 Egbert Eich 2007-01-12 15:36:34 UTC
Opps, too fast. Didn't hit reopen right.
Comment 52 Stefan Dirsch 2007-01-12 15:46:32 UTC
startx is only for the experienced user and no longer supported since a long time. I don't plan to handle this scenario. If you want to provide a new startx patch, let me know.

BTW, it's much more easier to unload the kernel module when you use startx anyway. :-)