|
Bugzilla – Full Text Bug Listing |
| Summary: | GPU hang | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Distribution | Reporter: | Dave Plater <davejplater> |
| Component: | Kernel | Assignee: | E-mail List <kernel-maintainers> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | astieger, carlos.e.r, davejplater, linuxkamarada, lv, martin.schlander, mmarek, nwr10cst-oslnx, opensuse, patrik.jakobsson, psychonaut, tiwai, wbauer |
| Version: | Leap 42.3 | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | SUSE Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
gpu crash dump
dmesg log gpu hang CER: Messages log CER: gpu log CER: hwinfo output |
||
(In reply to Dave Plater from comment #0) > In plasma5 the work space freezes for a period intermittently and journalctl > has this output: > Jul 24 10:14:49 arbuthnot kernel: [drm] GPU HANG: ecode 6:0:0xbd73ffff, in > plasmashell [3124], reason: Hang on render ring, action: reset > Jul 24 10:14:49 arbuthnot kernel: [drm] GPU hangs can indicate a bug > anywhere in the entire gfx stack, including userspace. > Jul 24 10:14:49 arbuthnot kernel: [drm] Please file a _new_ bug report on > bugs.freedesktop.org against DRI -> DRM/Intel > Jul 24 10:14:49 arbuthnot kernel: [drm] drm/i915 developers can then > reassign to the right component if it's not a kernel issue. > Jul 24 10:14:49 arbuthnot kernel: [drm] The gpu crash dump is required to > analyze gpu hangs, so please always attach it. > Jul 24 10:14:49 arbuthnot kernel: [drm] GPU crash dump saved to > /sys/class/drm/card0/error > Jul 24 10:14:49 arbuthnot kernel: drm/i915: Resetting chip after gpu hang This rather looks like a kernel/DRM or intel graphics driver problem to me... Reassigning to the Kernel for now. As suggested on the factory ml, I've removed drm-kmp-default, nomodeset and rebooted. So far I haven't had a gpu hang. I think I can safely state that removing drm-kmp-default definitely fixes the gpu hangs. It shouldn't be pulled in for affected gpu's OK, could you check what is the typical way to reproduce the bug? I'll try to test a SandyBridge machine here, but I'd like to know the procedure. To reproduce: install drm-kmp-default in a 42.3 installation and boot into plasma5. I normally have a saved session with firefox and konsole both with multiple tabs and one kwrite instance. Four desktops firefox on 1 and konsole/kwrite on 4. The hanging starts when I open thunderbird on desktop 2 and go back to firefox. Using sddm window manager. I'll try xfce which is my backup gui. OK, thanks. I think I can see the issue reliably on the local machine. I just need to boot with smaller memory, e.g. mem=1G boot option, start KDE, then open Firefox. That triggers the GPU hang immediately. It implies an issue in the page handling. BTW, are you using intel X driver (i.e. xf86-video-intel is installed?) On a freshly installed Leap 42.3 system, I didn't have it but the modesetting driver is used instead. The problem happens no matter which X driver is used, so it doesn't matter much, but the devils live always in details, hence I'd like to make sure. In anyway, it'd be good to know whether this happens on XFCE, too (with or without copmositor). i+ | xf86-video-amdgpu | package | 1.3.0-1.1 | x86_64 | oss i+ | xf86-video-amdgpu | package | 1.3.0-1.1 | x86_64 | Main Repository (OSS) i+ | xf86-video-fbdev | package | 0.4.4-9.4 | x86_64 | oss i+ | xf86-video-fbdev | package | 0.4.4-9.4 | x86_64 | Main Repository (OSS) i+ | xf86-video-intel | package | 2.99.917.770_gcb6ba2da-1.3 | x86_64 | oss i+ | xf86-video-intel | package | 2.99.917.770_gcb6ba2da-1.3 | x86_64 | Main Repository (OSS) i+ | xf86-video-nouveau | package | 1.0.15-1.3 | x86_64 | oss i+ | xf86-video-nouveau | package | 1.0.15-1.3 | x86_64 | Main Repository (OSS) i+ | xf86-video-vesa | package | 2.3.4-9.4 | x86_64 | oss i+ | xf86-video-vesa | package | 2.3.4-9.4 | x86_64 | Main Repository (OSS) I just rebooted with drm-kmp-default reinstalled but after deleting nomodeset I set runlevel 3 and tailed journalctl on tty2 and init 5 on tty1 and I haven't had a hang in plasma5 yet. I've confirmed that "fbcon: inteldrmfb (fb0) is primary device" is in the journal, it only occurs when I boot without nomodeset and drm-kmp-default is installed. Ill try a reboot straight into runlevel 5 It's happened when in firefox and thunderbird's new mail notification popped up. I'm not confident that I can reproduce in xfce because I don't think that it's load will be enough. Now I'm trying to reproduce it, even tried a full screen video, it's hard to reproduce. Could you try to test XFCE with the smaller memory size as I did? In my case with KDE, mem=1G sufficed to trigger the problem quickly. You can try a slightly smaller value, too. (In reply to Takashi Iwai from comment #9) > Could you try to test XFCE with the smaller memory size as I did? > In my case with KDE, mem=1G sufficed to trigger the problem quickly. > You can try a slightly smaller value, too. You mean video memory, the lowest I can go is 32M framebuffer and 128M graphics. I'm using xfce now with those minimums. No, I meant the whole RAM size. You can limit the size by passing mem=XXX boot option, where XXX is the size (e.g. 1G, 512M, etc). The problem of i915 driver is tied with the RAM size. When user-space use more memory and the free page becomes tight, the system tries to swap out, and the i915 driver tries shrink its page lists. The problem seems happening during it. plasma5 had a hang immediately with mem=1G. After ctrl-backspace time 2 I logged into xfce and apart from being very slow switching applications/desktops, I had to close kicad, I haven't had a gpu hang yet even with the thunderbird pop up. Going back to my normal 4G, I've got work to do. Created attachment 733736 [details]
gpu crash dump
Crash dump from the last hang with mem=1G
This looks like a regression caused by the recent PM fix. I found a paper-over patch in the recent upstream, so I tried to backport it, and this seems working. The test drm-kmp package is being built in OBS home:tiwai:branches:openSUSE:Leap:42.3:Update/drm repo. Retrieve the rpm via osc, osc getbinaries home:tiwai:branches:openSUSE:Leap:42.3:Update/drm/standard/x86_64 Could you test this kmp? (In reply to Takashi Iwai from comment #14) > Retrieve the rpm via osc, > osc getbinaries > home:tiwai:branches:openSUSE:Leap:42.3:Update/drm/standard/x86_64 Now finally the package was published, too: http://download.opensuse.org/repositories/home:/tiwai:/branches:/openSUSE:/Leap:/42.3:/Update/standard/ Installed via osc getbinaries and then booted into plasma5 with mem=1G and no gpu hangs even with the thunderbird pop up. Looks like you fixed the bug. I'm now on normal 4G memory. Thanks. I submitted the fix now. This is an autogenerated message for OBS integration: This bug (1050256) was mentioned in https://build.opensuse.org/request/show/512623 42.3 / drm Update does not install. The package provides multiversion(kernel), but the two versions conflict in /lib/modules/4.4.76-1-default/updates/drivers/gpu/drm/i915/i915.ko and we are not bumping the kernel version at the same time. Rejecting for Leap 42.3 maintenance as is. (In reply to Andreas Stieger from comment #20) > Update does not install. The package provides multiversion(kernel), but the > two versions conflict in > /lib/modules/4.4.76-1-default/updates/drivers/gpu/drm/i915/i915.ko and we > are not bumping the kernel version at the same time. > > Rejecting for Leap 42.3 maintenance as is. Gah, we're still having that issue. So we need to fix the kernel package at first, check in, then rebuild KMP based on it. Michal, what is your take? You can also submit the first update kernel and the drm fix together. The drm update was already submitted :) So the missing piece is the fix in the kernel side (and the rebuild of KMP with it). That bug also affected me, as I reported on the mailing list: https://lists.opensuse.org/opensuse-factory/2017-07/msg00725.html Removing both the xf86-video-intel and drm-kmp-default packages solved that problem for me. First I removed xf86-video-intel, which didn't solve the problem, then I removed drm-kmp-default and crashes stopped. My system has already 2 days uptime without crashes. We are preparing a maintenance update for the 42.3 kernel. Is the submitted update ready to be re-built against it? Yes, it's built fine against the new kernel. Basically we keep kABI, so KMP should be always buildable to newer kernels. Created attachment 735685 [details] dmesg log gpu hang Gpu hang again :( > uname -a Linux linux-0mvy.suse 4.4.79-4-default #1 SMP Thu Aug 3 14:49:17 UTC 2017 (4dc78e3) x86_64 x86_64 x86_64 GNU/Linux For openSUSE Leap 42.3, test update packages built against the current kernel can be found here: http://download.opensuse.org/repositories/openSUSE:/Maintenance:/7039/openSUSE_Leap_42.3_Update/ http://download.opensuse.org/update/leap/42.3-test/ It installs easily with the 4.4.79 kernel already installed. It looks good now # uname -a Linux linux-0mvy.suse 4.4.79-19-default #1 SMP Thu Aug 10 20:28:47 UTC 2017 (2dd03e8) x86_64 x86_64 x86_64 GNU/Linux openSUSE-RU-2017:2194-1: An update that has two recommended fixes can now be installed. Category: recommended (important) Bug References: 1048155,1050256 CVE References: Sources used: openSUSE Leap 42.3 (src): drm-4.9.33-5.2 I have this issue after upgrading my laptop to 42.3 from 42.2, using the offline or DVD upgrade method. CPU: Model: 6.23.10 "Pentium(R) Dual-Core CPU T4300 @ 2.10GHz" Video: Model: "Intel Mobile 4 Series Chipset Integrated Graphics Controller" Vendor: pci 0x8086 "Intel Corporation" Device: pci 0x2a42 "Mobile 4 Series Chipset Integrated Graphics Controller" SubVendor: pci 0x103c "Hewlett-Packard Company" SubDevice: pci 0x3069 Revision: 0x07 Driver: "i915" Driver Modules: "i915" (hwinfo output will be attached) Crash log: <3.6> 2018-01-27 12:47:05 minas-tirith systemd 1 - - Started Postfix Mail Transport Agent. <0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808879] [drm] GPU HANG: ecode 4:0:0xfdefffff, in X [2154], reason: Hang on render ring, action: reset <0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808883] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. <0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808884] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel <0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808884] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. <0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808885] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. <0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808885] [drm] GPU crash dump saved to /sys/class/drm/card0/error <0.5> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808914] drm/i915: Resetting chip after gpu hang <0.5> 2018-01-27 12:47:26 minas-tirith kernel - - - [ 1137.820965] drm/i915: Resetting chip after gpu hang <0.5> 2018-01-27 12:47:36 minas-tirith kernel - - - [ 1147.820140] drm/i915: Resetting chip after gpu hang I commented this on the openSUSE mail list, and Dave Plater suggested nomodeset. This works, but the video mode changes to something like 800*600, which is pretty bad. He also suggested to reopen this Bugzilla. At that moment I had kernel 4.4.104-39, and drm-kmp-default 4.9.33_k4.4.79_4-5.2. I updated to his version, drm-kmp-default-4.9.33_k4.4.104_39-7.24.x86_64.rpm; this is more stable, but in the end the X environment froze: mouse moves, but no response. I could ctrl-alt-f1. I see in the log several entries like this (different PID), don't know if related: <3.6> 2018-01-27 19:58:34 minas-tirith console-kit-daemon 3128 - - (process:10750): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed I hibernated the machine and went back home. Restored (not restarted) and I see this in the log: <3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - - System resumed. <3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - - INFO: running /usr/lib/systemd/system-sleep/grub2.sleep for hibernate <3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - - INFO: Running grub-once-restore .. <3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - - 2018-01-27 21:16:36+01:00 - Thawing the system now... <3.4> 2018-01-27 21:16:36 minas-tirith systemd-sh - - - Thawing the system now... <3.6> 2018-01-27 21:16:37 minas-tirith systemd 1 - - Stopped Deferred execution scheduler. <3.6> 2018-01-27 21:16:37 minas-tirith systemd 1 - - Started Deferred execution scheduler. <3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - - Laptop mode <3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - - enabled, not active [unchanged] <3.6> 2018-01-27 21:16:37 minas-tirith systemd-sleep 10886 - - INFO: Done. <3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - - Laptop mode <3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - - enabled, not active [unchanged] <3.6> 2018-01-27 21:16:37 minas-tirith systemd-sleep 10886 - - tput: No value for $TERM and no -T specified <0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816731] [drm] GPU HANG: ecode 4:0:0xfdeffdfb, in X [2171], reason: Hang on render ring, action: reset <0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816736] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. <0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816736] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel <0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816737] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. <0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816737] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. <0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816738] [drm] GPU crash dump saved to /sys/class/drm/card0/error <0.5> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816792] drm/i915: Resetting chip after gpu hang <0.5> 2018-01-27 21:17:00 minas-tirith kernel - - - [13697.816112] drm/i915: Resetting chip after gpu hang I will attach gpu.2.log, and messages log since machine upgrade, and hwinfo --cpu and --gfxcard My desktop is XFCE and I have 4 GiB of RAM. Created attachment 757808 [details]
CER: Messages log
Created attachment 757809 [details]
CER: gpu log
Created attachment 757810 [details]
CER: hwinfo output
On suggestion from Felix Miata I add inxi output:
minas-tirith:/home/cer/Bugzilla/Bug_1050256 - GPU hang # inxi -c0 -G
Graphics: Card: Intel Mobile 4 Series Integrated Graphics Controller
Display Server: X.org 1.18.3 drivers: intel (unloaded: modesetting,fbdev,vesa)
tty size: 150x51 Advanced Data: N/A for root
minas-tirith:/home/cer/Bugzilla/Bug_1050256 - GPU hang #
Carlos, this is a completely different GPU family than in the initial report (GM45 vs. Sandybridge). Please open a separate bugreport. Also first thing to try with ancient Intel GPUs is uninstalling drm-kmp-default package. Ok, will do. Thanks. Done, Bug 1077885 - GPU hang (Intel Mobile 4 Series Integrated Graphics Controller) SUSE-SU-2018:0509-1: An update that solves one vulnerability and has 8 fixes is now available. Category: security (moderate) Bug References: 1041744,1046821,1047277,1047729,1048155,1050256,1055493,1066175,1077885 CVE References: CVE-2017-10810 Sources used: SUSE Linux Enterprise Workstation Extension 12-SP3 (src): drm-4.9.33-4.11.1 SUSE Linux Enterprise Desktop 12-SP3 (src): drm-4.9.33-4.11.1 |
In plasma5 the work space freezes for a period intermittently and journalctl has this output: Jul 24 10:14:49 arbuthnot kernel: [drm] GPU HANG: ecode 6:0:0xbd73ffff, in plasmashell [3124], reason: Hang on render ring, action: reset Jul 24 10:14:49 arbuthnot kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jul 24 10:14:49 arbuthnot kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jul 24 10:14:49 arbuthnot kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jul 24 10:14:49 arbuthnot kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jul 24 10:14:49 arbuthnot kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error Jul 24 10:14:49 arbuthnot kernel: drm/i915: Resetting chip after gpu hang Sometimes it can occur continuously but ctl-alt-f1 to a console and init 3 is still possible. This is after a zypper dup --no-allow-vendor-change from 42.2 where this problem didn't occur. Setting nomodeset at boot makes the problem go away. My graphics is : 08: PCI 02.0: 0300 VGA compatible controller (VGA) [Created at pci.378] Unique ID: _Znp.Ek_1fzLhuA5 SysFS ID: /devices/pci0000:00/0000:00:02.0 SysFS BusID: 0000:00:02.0 Hardware Class: graphics card Model: "Intel 2nd Generation Core Processor Family Integrated Graphics Controller" Vendor: pci 0x8086 "Intel Corporation" Device: pci 0x0102 "2nd Generation Core Processor Family Integrated Graphics Controller" SubVendor: pci 0x105b "Foxconn International, Inc." SubDevice: pci 0x0d8d Revision: 0x09 Memory Range: 0xf7800000-0xf7bfffff (rw,non-prefetchable) Memory Range: 0xe0000000-0xefffffff (ro,non-prefetchable) I/O Ports: 0xf000-0xf03f (rw) IRQ: 11 (no events) I/O Ports: 0x3c0-0x3df (rw) Module Alias: "pci:v00008086d00000102sv0000105Bsd00000D8Dbc03sc00i00" Driver Info #0: Driver Status: i915 is active Driver Activation Cmd: "modprobe i915" Config Status: cfg=no, avail=yes, need=no, active=unknown Primary display adapter: #8 My cpu is an "Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz"