Bug 1090122

Summary: Intel Kabylake: Fallback from Wayland to Xorg fails
Product: [openSUSE] openSUSE Distribution Reporter: Vladimir FROMENT <tutux84>
Component: X.OrgAssignee: E-mail List <xorg-maintainer-bugs>
Status: RESOLVED FIXED QA Contact: E-mail List <xorg-maintainer-bugs>
Severity: Normal    
Priority: P3 - Medium CC: msrb, mstaudt, sndirsch, tiwai, tutux84
Version: Leap 15.0   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Journalctl logs
180426-KOTD boot failure journalctl logs

Description Vladimir FROMENT 2018-04-18 19:20:04 UTC
Hi,

In /etc/gdm/custom.conf, if WaylandEnable is set to false, after rebooting and typing login&password into the Gnome login screen, the system hang. I suspect a kernel panic but I don't know what log I can provide to prove it since it seems impossible to switch to a console TTY : ctrl+alt+F1 to F12 doesn't respond once the system is frozen.
The problem was also present in the previous build (197.1 I believe). Reproducibility: always

For your info, I need to fallback to Xorg because Wayland doesn't detect my external screen (I haven't had time to look for troubleshoot tips yet but I may open a bug report in a few days).

===========================
Some info about my system is following. In a nutshell: a 2017 Optimus Laptop without any driver installed apart from those provided at the install process. I also use an encrypted /home.
uname -a: Linux linux-5udt 4.12.14-lp150.8-default #1 SMP Sat Apr 7 05:12:52 UTC 2018 (8719fc4) x86_64 x86_64 x86_64 GNU/Linux

inxi -F:
System:    Host: linux-5udt Kernel: 4.12.14-lp150.8-default x86_64 bits: 64
           Desktop: Gnome 3.26.2 Distro: openSUSE Leap 15.0 Beta
Machine:   Device: laptop System: GIGABYTE product: P64V7 serial: HH9006711A0002
           Mobo: GIGABYTE model: P64V7 serial: N/A
           UEFI: American Megatrends v: FB09 date: 07/28/2017
Battery    BAT1: charge: 78.4 Wh 83.2% condition: 94.2/94.2 Wh (100%)
CPU:       Quad core Intel Core i7-7700HQ (-HT-MCP-) cache: 6144 KB
           clock speeds: max: 3800 MHz 1: 2800 MHz 2: 2800 MHz 3: 2800 MHz
           4: 2800 MHz 5: 2800 MHz 6: 2800 MHz 7: 2800 MHz 8: 2800 MHz
Graphics:  Card-1: Intel Device 591b
           Card-2: NVIDIA GP106M [GeForce GTX 1060 Mobile]
           Display Server: wayland (X.org 1.19.6 ) driver: i915
           tty size: 80x24 Advanced Data: N/A for root
Audio:     Card Intel CM238 HD Audio Controller driver: snd_hda_intel
           Sound: ALSA v: k4.12.14-lp150.8-default
Network:   Card-1: Intel Wireless 8260 driver: iwlwifi
           IF: wlan1 state: down mac: 9a:0c:b4:38:1c:b1
           Card-2: Realtek RTL8153 Gigabit Ethernet Adapter driver: r8152
           IF: eth0 state: N/A speed: N/A duplex: N/A mac: N/A
Drives:    HDD Total Size: 525.1GB (1.6% used)
           ID-1: /dev/sda model: Crucial_CT525MX3 size: 525.1GB
Partition: ID-1: / size: 18G used: 6.0G (35%) fs: btrfs dev: /dev/sda4
           ID-2: /var size: 18G used: 6.0G (35%) fs: btrfs dev: /dev/sda4
           ID-3: /opt size: 18G used: 6.0G (35%) fs: btrfs dev: /dev/sda4
           ID-4: /tmp size: 18G used: 6.0G (35%) fs: btrfs dev: /dev/sda4
           ID-5: /home size: 3.0G used: 118M (4%) fs: xfs dev: /dev/dm-0
           ID-6: swap-1 size: 2.15GB used: 0.00GB (0%)
           fs: swap dev: /dev/sda6
Sensors:   None detected - is lm-sensors installed and configured?
Info:      Processes: 321 Uptime: 0:24 Memory: 1449.3/15918.5MB
           Init: systemd runlevel: 5 Client: Shell (bash) inxi: 2.3.40
Comment 1 Max Staudt 2018-04-19 09:45:17 UTC
Thank you for your bug report!

Comments below:


(In reply to Vladimir FROMENT from comment #0)
> In /etc/gdm/custom.conf, if WaylandEnable is set to false, after rebooting
> and typing login&password into the Gnome login screen, the system hang. I
> suspect a kernel panic but I don't know what log I can provide to prove it
> since it seems impossible to switch to a console TTY : ctrl+alt+F1 to F12
> doesn't respond once the system is frozen.

Anything to be found in the system journal? Maybe the system had time to write an error message to disk - please use journalctl to have a look. On the other hand, maybe the X server hung, but the system itself is still alive, and VT switching is dead because X and GDM (or maybe earlier: Plymouth and GDM) are fighting for VT_SETMODE and produce a deadlock in the VT subsystem. Been there before.


> For your info, I need to fallback to Xorg because Wayland doesn't detect my
> external screen (I haven't had time to look for troubleshoot tips yet but I
> may open a bug report in a few days).

Sigh. That's something for the desktop team, I guess.


> CPU:       Quad core Intel Core i7-7700HQ (-HT-MCP-) cache: 6144 KB

Kaby Lake - that's pretty darn new. I suspect all display outputs are connected to the Intel GPU. Once the Nvidia card is blocked, things should "just work".


> Graphics:  Card-1: Intel Device 591b
>            Card-2: NVIDIA GP106M [GeForce GTX 1060 Mobile]

Whoops. By default, the nouveau kernel driver is in use, which is known to do funky things with some cards.

Can you please blacklist the nouveau kernel module, rebuild the initramfs (by calling mkinitrd) and then reboot? Maybe that'll fix it... You can use lsinird to check that nouveau.ko is not contained in the resulting initramfs.


Thanks!
Comment 2 Stefan Dirsch 2018-04-19 09:50:16 UTC
Ok. Please attach /var/log/gdm/greeter.log and /home/<user>/.local/share/xorg/Xorg.1.log first. You may end up disabling one of your two GPUs in order to get rid of these issues though.
Comment 3 Vladimir FROMENT 2018-04-19 16:02:15 UTC
Created attachment 767737 [details]
Journalctl logs

A bit of overview about the timestamps in these journalctl logs:

vlad@linux-5udt:~> grep -i "\-\- Reboot" -B1 Documents/journalctl-Xorg-fallback.txt 
avril 19 17:40:40 linux-5udt systemd-journald[423]: Journal stopped
-- Reboot --    ## This reboot happens after disabling Wayland in /etc/gdm/custom.conf
--
avril 19 17:41:48 linux-5udt systemd[1]: Startup finished in 5.623s (kernel) + 5.200s (initrd) + 47.118s (userspace) = 1min 5.494s.
-- Reboot --    ## At about 17:41 I started typing login/pass in GDM, I waited until 17:44 before hard rebooting due to frozen state. So the interesting traces are right before this point
--
avril 19 17:46:39 linux-5udt systemd-journald[461]: Journal stopped
-- Reboot --    ## Right after re-enabling Wayland
Comment 4 Vladimir FROMENT 2018-04-19 16:07:35 UTC
(In reply to Stefan Dirsch from comment #2)
> Ok. Please attach /var/log/gdm/greeter.log and
> /home/<user>/.local/share/xorg/Xorg.1.log first. You may end up disabling
> one of your two GPUs in order to get rid of these issues though.

/var/log/gdm is empty on my system.
And there is no ~/.local/share/xorg folder in my case.

I will try to disable nouveau module right now.
Comment 5 Vladimir FROMENT 2018-04-19 16:41:37 UTC
It appears that nouveau was already blacklisted, I forgot I had installed bumblebee some time ago... So I did various tests to make sure that the Xorg fallback failure wasn't related to bumblebee. It isn't, it keeps failing.
I tried also with nouveau enabled again but it didn't change anything. FYI I did ran the initrd command after enabling it.
Now the system is in the situation where nouveau is blacklisted again and bumblebee is removed (I reinstalled bbswitch though).
Comment 6 Stefan Dirsch 2018-04-19 19:05:41 UTC
Hmm. Nothing obvious in the X logfile I found in the journalctl. Indeed nouveau is disabled. It's an Intel Kabylake GPU. No idea why it freezes.
Comment 7 Stefan Dirsch 2018-04-20 13:24:55 UTC
Another option would be to disable Intel graphics (in Firmware) - if possible and then run NVIDIA's proprietary driver. But I'm not sure, whether the hardware supports this (for all needed outputs).
Comment 8 Vladimir FROMENT 2018-04-21 22:03:05 UTC
(In reply to Stefan Dirsch from comment #7)
> Another option would be to disable Intel graphics (in Firmware) - if
> possible and then run NVIDIA's proprietary driver. But I'm not sure, whether
> the hardware supports this (for all needed outputs).

Do you mean disabling the Intel GPU in the BIOS ? It is not possible with this laptop.
Eventually, by following [1] and [2], I could fix the fallback issue by setting the kernel parameter "i915.enable_guc=1". This option apparently enable advanced drivers for recent Intel chipsets. Following [2] advices, I also added enable_rc6=1, enable_fbc=1, enable_psr=1, disable_power_well=0 and semaphores=1. That seems not to have introduced any regression in my use cases. Either under Wayland and Xorg.

So the bug report can be considered fixed from my point of view (although my external screen is still not detected, which is odd because Ubuntu 17.10 does it, but that's another story). Unless you need more info/logs from me ?

[1] https://wiki.archlinux.org/index.php/intel_graphics
[2] https://gist.github.com/Brainiarc7/aa43570f512906e882ad6cdd835efe57
Comment 9 Stefan Dirsch 2018-04-23 10:21:28 UTC
(In reply to Vladimir FROMENT from comment #8)
> (In reply to Stefan Dirsch from comment #7)
> > Another option would be to disable Intel graphics (in Firmware) - if
> > possible and then run NVIDIA's proprietary driver. But I'm not sure, whether
> > the hardware supports this (for all needed outputs).
> 
> Do you mean disabling the Intel GPU in the BIOS ? It is not possible with
> this laptop.

That's why I wrote *if possible*. ;-) Obviously, this is not an option on your system then.

> Eventually, by following [1] and [2], I could fix the fallback issue by
> setting the kernel parameter "i915.enable_guc=1". This option apparently
> enable advanced drivers for recent Intel chipsets. Following [2] advices, I
> also added enable_rc6=1, enable_fbc=1, enable_psr=1, disable_power_well=0
> and semaphores=1. That seems not to have introduced any regression in my use
> cases. Either under Wayland and Xorg.
> 
> So the bug report can be considered fixed from my point of view (although my
> external screen is still not detected, which is odd because Ubuntu 17.10
> does it, but that's another story). Unless you need more info/logs from me ?
> 
> [1] https://wiki.archlinux.org/index.php/intel_graphics
> [2] https://gist.github.com/Brainiarc7/aa43570f512906e882ad6cdd835efe57

Well, I would call this a workaround, not a fix. Seems option "i915.enable_guc=1" is enough to fix the issue for you, right?
Comment 10 Vladimir FROMENT 2018-04-23 16:29:05 UTC
Yes, this sole option was enough. The others I mentioned were added after validating that my fallback issue was solved.
Comment 11 Stefan Dirsch 2018-04-23 16:45:21 UTC
Thanks!
Comment 12 Max Staudt 2018-04-25 10:19:42 UTC
Takashi, FYI. Seems like some Kaby Lake chips have funky firmware loading. Do you know about this?
Comment 13 Takashi Iwai 2018-04-25 11:03:04 UTC
Yes, i915 driver loads a few different kind of firmware files (DMC, GuC and HuC in addition to CSR, VBT and GVT-d stuff).

Here the firmware in question is the second one, GuC, and I thought this should have been loaded / enabled automatically for CFL.
Currently GuC for CFL is identical as for KBL.

What shows /sys/module/i915/parameters/enable_guc if you don't pass the value -1?  After loading the driver, it'll be set to either 0, 1 or 2.
Comment 14 Stefan Dirsch 2018-04-25 12:26:04 UTC
> Graphics:  Card-1: Intel Device 591b

Takashi, sure this is CFL (Coffelake)?

#define INTEL_KBL_GT2_IDS(info) \
[...]
 INTEL_VGA_DEVICE(0x591B, info), /* Halo GT2 */ \

Coffeelake has different IDs (0x3E??) according to current linux/drm/i915_pciids.h.
Comment 15 Stefan Dirsch 2018-04-25 12:27:42 UTC
(In reply to Takashi Iwai from comment #13)
> What shows /sys/module/i915/parameters/enable_guc if you don't pass the
> value -1?  After loading the driver, it'll be set to either 0, 1 or 2.

According to 

https://wiki.archlinux.org/index.php/intel_graphics#Enable_GuC_.2F_HuC_firmware_loading

this came with Kernel 4.16.
Comment 16 Stefan Dirsch 2018-04-25 12:28:32 UTC
(In reply to Stefan Dirsch from comment #15)
> (In reply to Takashi Iwai from comment #13)
> > What shows /sys/module/i915/parameters/enable_guc if you don't pass the
> > value -1?  After loading the driver, it'll be set to either 0, 1 or 2.
> 
> According to 
> 
> https://wiki.archlinux.org/index.php/intel_graphics#Enable_GuC_.
> 2F_HuC_firmware_loading
> 
> this came with Kernel 4.16.

But maybe it's already in sle15/Leap 15 with our backports.
Comment 17 Takashi Iwai 2018-04-25 12:31:25 UTC
(In reply to Stefan Dirsch from comment #14)
> > Graphics:  Card-1: Intel Device 591b
> 
> Takashi, sure this is CFL (Coffelake)?

Sorry, I was confused.  The chip in question is Kaby Lake (KBL).

But the question still stands.  Both KBL and CFL use the same firmware, and guc loading should have been enabled without the extra option.
Comment 18 Takashi Iwai 2018-04-25 12:31:58 UTC
(In reply to Stefan Dirsch from comment #16)
> (In reply to Stefan Dirsch from comment #15)
> > (In reply to Takashi Iwai from comment #13)
> > > What shows /sys/module/i915/parameters/enable_guc if you don't pass the
> > > value -1?  After loading the driver, it'll be set to either 0, 1 or 2.
> > 
> > According to 
> > 
> > https://wiki.archlinux.org/index.php/intel_graphics#Enable_GuC_.
> > 2F_HuC_firmware_loading
> > 
> > this came with Kernel 4.16.
> 
> But maybe it's already in sle15/Leap 15 with our backports.

Yes.  SLE15 / openSUSE Leap 15.0 kernel already got tons of backports and i915 driver is almost equivalent with 4.16.
Comment 19 Stefan Dirsch 2018-04-25 14:28:02 UTC
(In reply to Takashi Iwai from comment #13)
> What shows /sys/module/i915/parameters/enable_guc if you don't pass the
> value -1?  After loading the driver, it'll be set to either 0, 1 or 2.

For this please test without option

  "i915.enable_guc=1"

(and all the other options). If needed, i.e. you're using an /etc/modprobe.d file snippet, recreate initrd afterwards via

  mkinitrd
Comment 20 Vladimir FROMENT 2018-04-25 18:41:26 UTC
(In reply to Stefan Dirsch from comment #19)
> (In reply to Takashi Iwai from comment #13)
> > What shows /sys/module/i915/parameters/enable_guc if you don't pass the
> > value -1?  After loading the driver, it'll be set to either 0, 1 or 2.
> 
> For this please test without option
> 
>   "i915.enable_guc=1"
> 
> (and all the other options). If needed, i.e. you're using an /etc/modprobe.d
> file snippet, recreate initrd afterwards via
> 
>   mkinitrd

So after disabling all above-mentionned options in Yast > Bootloader and rebooting, the value of /sys/module/i915/parameters/enable_guc is 0.
Comment 21 Takashi Iwai 2018-04-25 19:28:37 UTC
Thanks.  I checked the recent code, and indeed the default value is zero.
It was changed from -1 to 0 some time ago due to the latency issues and S4 resume problem, according to the git log.

If enable_guc=1 option alone really helps, I believe it's worth to report to upstream devs.  It'd be great if you can double-check it.
Comment 22 Stefan Dirsch 2018-04-26 09:23:46 UTC
(In reply to Takashi Iwai from comment #21)
> Thanks.  I checked the recent code, and indeed the default value is zero.
> It was changed from -1 to 0 some time ago due to the latency issues and S4
> resume problem, according to the git log.

Ok. Interesting.

> If enable_guc=1 option alone really helps, I believe it's worth to report to
> upstream devs.  

Which is supposed to be done by us or the reporter?

> It'd be great if you can double-check it.

That's what the reporter did before (comment #10). So should he really *double* check literally?
Comment 23 Takashi Iwai 2018-04-26 09:35:33 UTC
(In reply to Stefan Dirsch from comment #22)
> (In reply to Takashi Iwai from comment #21)
> > Thanks.  I checked the recent code, and indeed the default value is zero.
> > It was changed from -1 to 0 some time ago due to the latency issues and S4
> > resume problem, according to the git log.
> 
> Ok. Interesting.
> 
> > If enable_guc=1 option alone really helps, I believe it's worth to report to
> > upstream devs.  
> 
> Which is supposed to be done by us or the reporter?

At best someone who own the hardware and can test, so the reporter would be the best option.  Most likely the upstream devs will ask testing the latest development version or some patch, so we should be in Cc, of course.

> > It'd be great if you can double-check it.
> 
> That's what the reporter did before (comment #10). So should he really
> *double* check literally?

Yes, we need to test with the latest upstream version before reporting to upstream, at least.
4.17-rc kernel is found in OBS Kernel:HEAD repo, and 4.16.x is in OBS Kernel:stable repo.

I *guess* the problem remains, but if these version work, there is another hope for a quicker fix.
Comment 24 Stefan Dirsch 2018-04-26 09:44:59 UTC
Thanks. Vladimir, could you please test our KOTD? (currently 4.17-rc)

https://en.opensuse.org/openSUSE:Kernel_of_the_day
Comment 25 Vladimir FROMENT 2018-04-26 17:35:01 UTC
(In reply to Stefan Dirsch from comment #24)
> Thanks. Vladimir, could you please test our KOTD? (currently 4.17-rc)
> 
> https://en.opensuse.org/openSUSE:Kernel_of_the_day

Installed KOTD with this command:
rpm -i --force http://download.opensuse.org/repositories/Kernel:/HEAD/standard/x86_64/kernel-default-4.17.rc2-2.1.g0fad7ab.x86_64.rpm

But the system fails to boot correctly. I get an error message at boot time saying "[FAILED] Failed to start Load Kernel Modules". The system doesn't get to load gdm and end up in maintenance mode. The journalctl logs will be attached right away.
It seems related to the encrypted /home. I can reinstall Leap Beta in last version with an unencrypted /home but that would not be representing my normal setup.

On another hand, prior to installing KOTD, I upgraded my Leap Beta via "zypper dup" and the workaround doesn't work anymore, even with i915.enable_guc=1. Wayland load but fallback to Xorg doesn't (same symptoms as before). It was beta 206.1 before upgrade.

Let me know what you would need from me to move forward. I should have some time this weekend to test multiple setups if needed.
Comment 26 Vladimir FROMENT 2018-04-26 17:35:50 UTC
Created attachment 768443 [details]
180426-KOTD boot failure journalctl logs
Comment 27 Stefan Dirsch 2018-04-26 19:06:46 UTC
OMG. :-(
Comment 28 Max Staudt 2018-04-27 14:03:34 UTC
Ummm...


  avril 26 19:06:32 linux-5udt systemd-cryptsetup[605]: Set cipher aes, mode xts-plain64, key size 256 bits for device /dev/disk/by-uuid/62d95b82-3e11-4713-85c2-8f7a9bd8b1d4.
  avril 26 19:06:34 linux-5udt kernel: device-mapper: table: 254:0: crypt: unknown target type
  avril 26 19:06:34 linux-5udt kernel: device-mapper: ioctl: error adding target to table
  avril 26 19:06:34 linux-5udt systemd-cryptsetup[605]: Failed to activate: Input/output error
  avril 26 19:06:34 linux-5udt systemd[1]: systemd-cryptsetup@cr_sda5.service: Main process exited, code=exited, status=1/FAILURE
  avril 26 19:06:34 linux-5udt systemd[1]: Failed to start Cryptography Setup for cr_sda5.
  avril 26 19:06:34 linux-5udt systemd[1]: Dependency failed for Encrypted Volumes.


Sounds like your system fails to unlock your encrypted home.

Also, there are no messages regarding the i915 driver in your log, so it seems that the KMS graphics driver isn't even loaded.

Looks like a kernel or base system bug to me. Totally unrelated to your graphics problems.
Comment 29 Stefan Dirsch 2018-04-27 14:03:57 UTC
Hope you installed the KOTD in addition to the existing one and can still boot the old one?
Comment 30 Vladimir FROMENT 2018-04-28 12:44:48 UTC
(In reply to Max Staudt from comment #28)
> Looks like a kernel or base system bug to me. Totally unrelated to your
> graphics problems.

Agreed.

(In reply to Stefan Dirsch from comment #29)
> Hope you installed the KOTD in addition to the existing one and can still
> boot the old one?

Yes, I had no problem to boot on the old kernel. All is working fine.
But that kind of messed up the troubleshoot path héhé.
Comment 31 Vladimir FROMENT 2018-05-07 08:05:57 UTC
After a lot of tests, I installed from scratch beta 234.2. I noticed an option at GDM login screen which proposed to load Gnome over Wayland OR load Gnome over Xorg.
I don't know if this option was present in the previous betas, but it does the trick. Gnome loads without the Intel GUC driver (/sys/module/i915/parameters/enable_guc is set to 0, with default bootloader parameters).

I still have some instabilities in one use case on Xorg I will report in detail when I have more time in a few days (external screen detected on Xorg, but after reboot, GDM do not show up anymore). But regarding the initial report, I would say the issue is solved or workarounded ;)
Comment 32 Stefan Dirsch 2018-05-07 09:42:23 UTC
Hmm. So I guess this is again *without* "WaylandEnable=false" in /etc/gdm/custom.conf, right? Can you confirm this?

So running gdm itself on Wayland and then running the GNOME session on top of X.Org appears to work for you - for whatever reasons.

I guess we can then close the issue then, since this is what people will try, if GNOME on Wayland does not work.
Comment 33 Vladimir FROMENT 2018-05-12 09:20:49 UTC
(In reply to Stefan Dirsch from comment #32)
> Hmm. So I guess this is again *without* "WaylandEnable=false" in
> /etc/gdm/custom.conf, right? Can you confirm this?

I confirm.
Comment 34 Stefan Dirsch 2018-05-12 09:52:55 UTC
Ok. Let's close this one then.

> I still have some instabilities in one use case on Xorg I will report in detail > when I have more time in a few days (external screen detected on Xorg, but after > reboot, GDM do not show up anymore). But regarding the initial report, I would 
> say the issue is solved or workarounded ;)

Please use a separate bugreport this then, but you could refer to this bugreport there. Thanks!
Comment 35 Stefan Dirsch 2018-05-12 09:53:18 UTC
Considered fixed.