Bug 810022

Summary: nVidia proprietary drivers broken for 12.3
Product: [openSUSE] openSUSE 12.3 Reporter: Tony Su <tonysu>
Component: X11 3rd Party DriverAssignee: E-mail List <xorg-maintainer-bugs>
Status: RESOLVED WONTFIX QA Contact: Stefan Dirsch <sndirsch>
Severity: Critical    
Priority: P3 - Medium CC: johanp, linreg
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: SUSE Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: All Loaded Target Units on this Machine
boot.log
journal.log
Plymouth-start Status
Xorg.0.log
nvidia bug report generated from script
G03 build stdout

Description Tony Su 2013-03-18 14:44:39 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0

Scope of affected Users
Unknown. I would guess should affect at least all who have avoided nouveau drivers in the past due to the notorious "kaleidescope" problem which has existed from approx openSUSE versions 11.1-12.2
Affects CUDA users

Both KMS on/off

Included are all the relevant logfiles I used to troubleshoot
All logfiles were retrieved during the failed graphical boot by dropping into a command line console.

Each of the logfiles were modified by adding as the first line the command used to retrieve that logfile

Summary:
Cause of problem appears to be
Attempt to load non-existent "nevidia module"

NOTE:
Secondary issue, when changing from nVidia driver to nouveau or nv drivers, the xorg.conf file is not automatically modified or deleted, manual User intervention is required. apparently this is a long-standing unaddressed issue and unlike when the User is prompted to run nvidia.config when installing the proprietary nvidia driver, the User is not prompted or given any clue this is required when installing the nouveau or nv drivers.

For this reason, although this is primarily a 3rd party driver problem, IMO the xorg.conf issue should be fixed by either whoever develops kdm or however the xorg-conf file is managed.

Reproducible: Always

Steps to Reproduce:
Install or upgrade with nVidia Repo enabled.
ftp://download.nvidia.com/opensuse/12.3/

As prompted, run nvidia-config to create the required xorg.config

When booting, normally, click ESC to view the bootup sequence

Immediately before reaching the Plymouth startup screen, note that bootup hangs on "Reached graphical user interface"

This is where most people cannot proceed further
Actual Results:  
System bootup hangs

Expected Results:  
Proceed to Plymouth

Proceeding from when the system hangs,
It is at this point that the accompanying logs were retrieved
Do not wait too long to escape to console by entering ALT-F1, otherwise after a minute or so the system will freeze.

The following commands were used to retrieve the logs

systemctl --all --type=target
cat /var/log/boot.log 
journalctl -b
systemctl status plymouth-start.service
cat /var/log/Xorg.0.log

IMO most important information from these logs, clearly the X server starts and an attempt is made to [LoadModule: "nvidia"] but failing that the Xserver stops. The system then attempts to connect to graphical.target and unable to find a running X server hangs (maybe there should also be a graceful fallback like automatically escape to console?)

journal.log 
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK xdm[736]: Starting service kdm..done
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Started LSB: X Display Manager.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK acpid[580]: client connected from 801[0:0]
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK acpid[580]: 1 client rule loaded
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK kdm[771]: X server died during startup
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK kdm[771]: X server for display :0 cannot be started, session disabled
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK fail2ban[734]: Starting fail2ban ..done
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Started LSB: Bans IPs with too many authentication failures.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Starting Multi-User.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Reached target Multi-User.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Starting Graphical Interface.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Reached target Graphical Interface.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Starting Update UTMP about System Runlevel Changes...
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Started Update UTMP about System Runlevel Changes.
Mar 17 18:12:44 SUSEMACHINE.LOCALNETWORK systemd[1]: Startup finished in 3s 576ms 851us (kernel) + 4s 647ms 496us (userspace) = 8s 224ms 347us.
Mar 17 18:12:51 SUSEMACHINE.LOCALNETWORK acpid[580]: client 801[0:0] has disconnected


Xorg.0.log 
     7.899] Loading extension GLX
[     7.899] (II) LoadModule: "nvidia"
[     7.900] (WW) Warning, couldn't open module nvidia
[     7.900] (II) UnloadModule: "nvidia"
[     7.900] (II) Unloading nvidia
[     7.900] (EE) Failed to load module "nvidia" (module does not exist, 0)
[     7.900] (EE) No drivers available.
[     7.900] 
Fatal server error:
[     7.900] no screens found
[     7.900] (EE)
Comment 1 Tony Su 2013-03-18 14:47:19 UTC
Created attachment 530200 [details]
All Loaded Target Units on this Machine
Comment 2 Tony Su 2013-03-18 14:47:53 UTC
Created attachment 530201 [details]
boot.log
Comment 3 Tony Su 2013-03-18 14:48:29 UTC
Created attachment 530202 [details]
journal.log
Comment 4 Tony Su 2013-03-18 14:50:16 UTC
Created attachment 530204 [details]
Plymouth-start Status
Comment 5 Tony Su 2013-03-18 14:51:09 UTC
Created attachment 530207 [details]
Xorg.0.log
Comment 6 Stefan Dirsch 2013-03-18 15:24:42 UTC
So how did you install NVIDIA drivers? Your GPU is supported by G03 driver RPMs.
Comment 7 Tony Su 2013-03-18 15:27:43 UTC
FYI - openSUSE Technical Forums Thread, another User with identical problem on different GPU.
https://forums.opensuse.org/english/get-technical-help-here/install-boot-login/484402-upgrade-12-3-12-2-cannot-start-x-nvidia.html
Comment 8 Tony Su 2013-03-18 15:29:43 UTC
RPM install method was as described in posting...
Enable nVidia repo
zypper up which allows auto detect and selection of drivers
There were two drivers offered, yes GO3 was one of the drivers offered and tried.
Comment 9 Stefan Dirsch 2013-03-18 16:03:14 UTC
Apparently nouveau kernel module has not been blacklisted. Hence the nvidia kernel module could not be loadede. There should be the file 
/etc/modprobe.d/nvidia-<kernel_flavor>.conf with the blacklist entry. It's part of the G03 KMP. 

Please run nvidia-bug-report.sh and attach the result.
Comment 10 tom master 2013-03-18 16:09:46 UTC
Solution for me:
Upgrade nvidia 304 ==> 310 driver

error description: 
1) by a migration from 12.2 to 12.3 old driver 304 is not deinstalled
2) path for nvidia.ko is wrong. 
 RPM install nvidia 310 in /lib/modules/3.7.9-1.1-desktop/updates/nvidia.ko
3.) modprobe or xconfig ca nvidia not find or show a text like "module has version 304 and not 310 ..."
4.) Nvidia Path should be /lib/modules/3.7.10-1.1-desktop/weak-updates/updates/nvidia.ko

Solution:
Copy or Link nvidia.ko from /lib/modules/3.7.9-1.1-desktop/updates/nvidia.ko ==> /lib/modules/3.7.10-1.1-desktop/weak-updates/updates/nvidia.ko

I hope this is the solution for this Problem
Comment 11 Tony Su 2013-03-18 16:12:32 UTC
(In reply to comment #9)
> Apparently nouveau kernel module has not been blacklisted. Hence the nvidia
> kernel module could not be loadede. There should be the file 
> /etc/modprobe.d/nvidia-<kernel_flavor>.conf with the blacklist entry. It's part
> of the G03 KMP. 
> 
> Please run nvidia-bug-report.sh and attach the result.

Curiously,
Although "locate" has nvidia-bug-report.sh in its database, it's not on my system now. Am I supposed to re-install the proprietary driver package, re-produce the problem and then run this script?
Comment 12 Stefan Dirsch 2013-03-18 16:16:31 UTC
(In reply to comment #11)
> (In reply to comment #9)
> > Apparently nouveau kernel module has not been blacklisted. Hence the nvidia
> > kernel module could not be loadede. There should be the file 
> > /etc/modprobe.d/nvidia-<kernel_flavor>.conf with the blacklist entry. It's part
> > of the G03 KMP. 
> > 
> > Please run nvidia-bug-report.sh and attach the result.
> 
> Curiously,
> Although "locate" has nvidia-bug-report.sh in its database, it's not on my
> system now. Am I supposed to re-install the proprietary driver package,
> re-produce the problem and then run this script?

I you're lacking this script (it should be in /usr/bin/nvidia-bug-report.sh; it's part of x11-video-nvidiaG03 package), your system is messed up anyway. So the answer is yes.
Comment 13 Stefan Dirsch 2013-03-18 16:17:50 UTC
(In reply to comment #10)
> Solution for me:
> Upgrade nvidia 304 ==> 310 driver
> 
> error description: 
> 1) by a migration from 12.2 to 12.3 old driver 304 is not deinstalled
> 2) path for nvidia.ko is wrong. 
>  RPM install nvidia 310 in /lib/modules/3.7.9-1.1-desktop/updates/nvidia.ko
> 3.) modprobe or xconfig ca nvidia not find or show a text like "module has
> version 304 and not 310 ..."
> 4.) Nvidia Path should be
> /lib/modules/3.7.10-1.1-desktop/weak-updates/updates/nvidia.ko
> 
> Solution:
> Copy or Link nvidia.ko from /lib/modules/3.7.9-1.1-desktop/updates/nvidia.ko
> ==> /lib/modules/3.7.10-1.1-desktop/weak-updates/updates/nvidia.ko
> 
> I hope this is the solution for this Problem

This sounds more like bnc#802624.
Comment 14 Tony Su 2013-03-18 17:02:04 UTC
(In reply to comment #9)
> Apparently nouveau kernel module has not been blacklisted. Hence the nvidia
> kernel module could not be loadede. There should be the file 
> /etc/modprobe.d/nvidia-<kernel_flavor>.conf with the blacklist entry. It's part
> of the G03 KMP. 
> 
> Please run nvidia-bug-report.sh and attach the result.

Curiously,
Although "locate" has nvidia-bug-report.sh in its database, it's not on my system now. Am I supposed to re-install the proprietary driver package, re-produce the problem and then run this script?
Comment 15 Tony Su 2013-03-18 18:19:03 UTC
Created attachment 530244 [details]
nvidia bug report generated from script

From the machine fixed using the nouveau driver,
zypper mr -e download.nvidia.com-opensuse
zypper in nvidia-gfxG03-kmp-default
reboot
ALT-F1 (and login as root)
execute the the bug reporting script.
BTW- Note that I'm also uploading the stdout of the install which includes some specific build errors
Comment 16 Tony Su 2013-03-18 18:23:09 UTC
Created attachment 530246 [details]
G03 build stdout

When the G03 driver was being installed, an error appeared in the stdout. Entire stdout as an attachment, this is the part of note

  Building modules, stage 2.
  MODPOST 1 modules
WARNING: could not find /usr/src/kernel-modules/nvidia-310.32-default/.nv-kernel.o.cmd for /usr/src/kernel-modules/nvidia-310.32-default/nv-kernel.o
  CC      /usr/src/kernel-modules/nvidia-310.32-default/nvidia.mod.o
In file included from /usr/src/linux-3.7.10-1.1/include/linux/kernel.h:10:0,
                 from /usr/src/linux-3.7.10-1.1/include/linux/cache.h:4,
                 from /usr/src/linux-3.7.10-1.1/include/linux/time.h:4,
                 from /usr/src/linux-3.7.10-1.1/include/linux/stat.h:18,
                 from /usr/src/linux-3.7.10-1.1/include/linux/module.h:10,
                 from /usr/src/kernel-modules/nvidia-310.32-default/nvidia.mod.c:1:
/usr/src/linux-3.7.10-1.1/include/linux/bitops.h: In function ‘hweight_long’:
/usr/src/linux-3.7.10-1.1/include/linux/bitops.h:66:41: warning: signed and unsigned type in conditional expression [-Wsign-compare]
  LD [M]  /usr/src/kernel-modules/nvidia-310.32-default/nvidia.ko
make: Leaving directory `/usr/src/linux-3.7.10-1.1-obj/x86_64/default'
/usr/src/kernel-modules/nvidia-310.32-default /
NVIDIA: calling KBUILD...
make[1]: Entering directory `/usr/src/linux-3.7.10-1.1'
make -C /usr/src/linux-obj/x86_64/default \
KBUILD_SRC=/usr/src/linux-3.7.10-1.1 \
KBUILD_EXTMOD="/usr/src/kernel-modules/nvidia-310.32-default" -f /usr/src/linux-3.7.10-1.1/Makefile \
modules
test -e include/generated/autoconf.h -a -e include/config/auto.conf || (                \
echo >&2;                                                       \
echo >&2 "  ERROR: Kernel configuration is invalid.";           \
echo >&2 "         include/generated/autoconf.h or include/config/auto.conf are missing.";\
echo >&2 "         Run 'make oldconfig && make prepare' on kernel src to fix it.";      \
echo >&2 ;                                                      \
/bin/false)
Comment 17 Stefan Dirsch 2013-03-18 18:34:29 UTC
I believe you can ignore this error message. Seems the build worked fine. My current guess is that you're using a custom kernel, which is not kABI compatible. Thus no weak-updates symlinks are created. Run the following commands and add the output as comment here:

  uname -r
  ls /lib/modules
  find /lib/modules -name nvidia.ko
  modprobe nvidia
  modinfo nvidia
Comment 18 Tony Su 2013-04-13 18:44:01 UTC
Re Comment 17

Have been running regular kernels from Factory (3.8) and OSS (3.7 Default, Desktop and Xen). The only time I have run one of these kernels in a modified manner is running the script that enables QEMU ARM emulation, but those changes are temporary and volatile, and no actual changes are made to the installed kernels (simply adds ARM extensions only).

The following results were retrieved with the nVidia G03 driver is installed. Interestingly now when I enable the nVidia repo, "zypper up" does not automatically select and install an nVidia driver now. For the following results I manually selected and installed the G03 driver.
 
uname -r
3.7.10-1.1-default
3.7.10-1.1-default
3.7.10-1.1-desktop
3.7.10-1.1-xen
3.7.6-1.2-default
3.7.6-1.2-desktop
3.7.6-1.2-xen
3.7.9-1.1-default
3.7.9-1.1-desktop
3.8.2-1-default
3.8.2-1-desktop
3.8.2-1-xen


/lib/modules/3.8.2-1-default/weak-updates/updates/nvidia.ko
/lib/modules/3.7.10-1.1-default/updates/nvidia.ko
filename:       /lib/modules/3.7.10-1.1-default/updates/nvidia.ko
alias:          char-major-195-*
version:        310.44
supported:      external
license:        NVIDIA
alias:          pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias:          pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        
vermagic:       3.8.2-1-default SMP mod_unload modversions 
parm:           NVreg_Mobile:int
parm:           NVreg_ResmanDebugLevel:int
parm:           NVreg_RmLogonRC:int
parm:           NVreg_ModifyDeviceFiles:int
parm:           NVreg_DeviceFileUID:int
parm:           NVreg_DeviceFileGID:int
parm:           NVreg_DeviceFileMode:int
parm:           NVreg_RemapLimit:int
parm:           NVreg_UpdateMemoryTypes:int
parm:           NVreg_InitializeSystemMemoryAllocations:int
parm:           NVreg_RMEdgeIntrCheck:int
parm:           NVreg_UsePageAttributeTable:int
parm:           NVreg_MapRegistersEarly:int
parm:           NVreg_RegisterForACPIEvents:int
parm:           NVreg_CheckPCIConfigSpace:int
parm:           NVreg_EnablePCIeGen3:int
parm:           NVreg_EnableMSI:int
parm:           NVreg_RegistryDwords:charp
parm:           NVreg_RmMsg:charp
Comment 19 Stefan Dirsch 2013-04-15 09:40:52 UTC
Ok. So weak-updates links have only been created for

3.8.2-1-default
3.7.10-1.1-default

So all the other kernels apparently aren't kABI compatible. But honestly I don't want to investigate the issue for 12 diffent kernels installed on the system. xen isn't supported by NVIDIA though. This I can tell you for sure. ;-)
Comment 20 Stefan Dirsch 2015-01-07 14:37:49 UTC
Product is no longer supported. In case the issue is still reproducable on a maintainerd product (at that momement: openSUSE 13.1 or later), feel free to reopen.