Bug 1050256

Summary: GPU hang
Product: [openSUSE] openSUSE Distribution Reporter: Dave Plater <davejplater>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: astieger, carlos.e.r, davejplater, linuxkamarada, lv, martin.schlander, mmarek, nwr10cst-oslnx, opensuse, patrik.jakobsson, psychonaut, tiwai, wbauer
Version: Leap 42.3   
Target Milestone: ---   
Hardware: x86-64   
OS: SUSE Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: gpu crash dump
dmesg log gpu hang
CER: Messages log
CER: gpu log
CER: hwinfo output

Description Dave Plater 2017-07-24 15:32:27 UTC
In plasma5 the work space freezes for a period intermittently and journalctl has this output:
Jul 24 10:14:49 arbuthnot kernel: [drm] GPU HANG: ecode 6:0:0xbd73ffff, in plasmashell [3124], reason: Hang on render ring, action: reset
Jul 24 10:14:49 arbuthnot kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jul 24 10:14:49 arbuthnot kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jul 24 10:14:49 arbuthnot kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jul 24 10:14:49 arbuthnot kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jul 24 10:14:49 arbuthnot kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jul 24 10:14:49 arbuthnot kernel: drm/i915: Resetting chip after gpu hang

Sometimes it can occur continuously but ctl-alt-f1 to a console and init 3 is still possible.
This is after a zypper dup --no-allow-vendor-change from 42.2 where this problem didn't occur.
Setting nomodeset at boot makes the problem go away.
My graphics is :
08: PCI 02.0: 0300 VGA compatible controller (VGA)              
  [Created at pci.378]
  Unique ID: _Znp.Ek_1fzLhuA5
  SysFS ID: /devices/pci0000:00/0000:00:02.0
  SysFS BusID: 0000:00:02.0
  Hardware Class: graphics card
  Model: "Intel 2nd Generation Core Processor Family Integrated Graphics Controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x0102 "2nd Generation Core Processor Family Integrated Graphics Controller"
  SubVendor: pci 0x105b "Foxconn International, Inc."
  SubDevice: pci 0x0d8d 
  Revision: 0x09
  Memory Range: 0xf7800000-0xf7bfffff (rw,non-prefetchable)
  Memory Range: 0xe0000000-0xefffffff (ro,non-prefetchable)
  I/O Ports: 0xf000-0xf03f (rw)
  IRQ: 11 (no events)
  I/O Ports: 0x3c0-0x3df (rw)
  Module Alias: "pci:v00008086d00000102sv0000105Bsd00000D8Dbc03sc00i00"
  Driver Info #0:
    Driver Status: i915 is active
    Driver Activation Cmd: "modprobe i915"
  Config Status: cfg=no, avail=yes, need=no, active=unknown

Primary display adapter: #8

My cpu is an "Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz"
Comment 1 Wolfgang Bauer 2017-07-25 07:13:25 UTC
(In reply to Dave Plater from comment #0)
> In plasma5 the work space freezes for a period intermittently and journalctl
> has this output:
> Jul 24 10:14:49 arbuthnot kernel: [drm] GPU HANG: ecode 6:0:0xbd73ffff, in
> plasmashell [3124], reason: Hang on render ring, action: reset
> Jul 24 10:14:49 arbuthnot kernel: [drm] GPU hangs can indicate a bug
> anywhere in the entire gfx stack, including userspace.
> Jul 24 10:14:49 arbuthnot kernel: [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> Jul 24 10:14:49 arbuthnot kernel: [drm] drm/i915 developers can then
> reassign to the right component if it's not a kernel issue.
> Jul 24 10:14:49 arbuthnot kernel: [drm] The gpu crash dump is required to
> analyze gpu hangs, so please always attach it.
> Jul 24 10:14:49 arbuthnot kernel: [drm] GPU crash dump saved to
> /sys/class/drm/card0/error
> Jul 24 10:14:49 arbuthnot kernel: drm/i915: Resetting chip after gpu hang

This rather looks like a kernel/DRM or intel graphics driver problem to me...

Reassigning to the Kernel for now.
Comment 2 Dave Plater 2017-07-25 07:21:12 UTC
As suggested on the factory ml, I've removed drm-kmp-default, nomodeset and rebooted. So far I haven't had a gpu hang.
Comment 3 Dave Plater 2017-07-25 08:43:23 UTC
I think I can safely state that removing drm-kmp-default definitely fixes the gpu hangs. It shouldn't be pulled in for affected gpu's
Comment 4 Takashi Iwai 2017-07-25 09:00:20 UTC
OK, could you check what is the typical way to reproduce the bug?
I'll try to test a SandyBridge machine here, but I'd like to know the procedure.
Comment 5 Dave Plater 2017-07-25 10:48:34 UTC
To reproduce: install drm-kmp-default in a 42.3 installation and boot into plasma5. I normally have a saved session with firefox and konsole both with multiple tabs and one kwrite instance. Four desktops firefox on 1 and konsole/kwrite on 4. The hanging starts when I open thunderbird on desktop 2 and go back to firefox. Using sddm window manager. I'll try xfce which is my backup gui.
Comment 6 Takashi Iwai 2017-07-25 10:53:24 UTC
OK, thanks.  I think I can see the issue reliably on the local machine.
I just need to boot with smaller memory, e.g. mem=1G boot option, start KDE, then open Firefox.  That triggers the GPU hang immediately.
It implies an issue in the page handling.

BTW, are you using intel X driver (i.e. xf86-video-intel is installed?)  On a freshly installed Leap 42.3 system, I didn't have it but the modesetting driver is used instead.

The problem happens no matter which X driver is used, so it doesn't matter much, but the devils live always in details, hence I'd like to make sure.

In anyway, it'd be good to know whether this happens on XFCE, too (with or without copmositor).
Comment 7 Dave Plater 2017-07-25 11:27:12 UTC
i+ | xf86-video-amdgpu  | package | 1.3.0-1.1                  | x86_64 | oss                  
i+ | xf86-video-amdgpu  | package | 1.3.0-1.1                  | x86_64 | Main Repository (OSS)
i+ | xf86-video-fbdev   | package | 0.4.4-9.4                  | x86_64 | oss                  
i+ | xf86-video-fbdev   | package | 0.4.4-9.4                  | x86_64 | Main Repository (OSS)
i+ | xf86-video-intel   | package | 2.99.917.770_gcb6ba2da-1.3 | x86_64 | oss                  
i+ | xf86-video-intel   | package | 2.99.917.770_gcb6ba2da-1.3 | x86_64 | Main Repository (OSS)
i+ | xf86-video-nouveau | package | 1.0.15-1.3                 | x86_64 | oss                  
i+ | xf86-video-nouveau | package | 1.0.15-1.3                 | x86_64 | Main Repository (OSS)
i+ | xf86-video-vesa    | package | 2.3.4-9.4                  | x86_64 | oss                  
i+ | xf86-video-vesa    | package | 2.3.4-9.4                  | x86_64 | Main Repository (OSS)
I just rebooted with drm-kmp-default reinstalled but after deleting nomodeset I set runlevel 3 and tailed journalctl on tty2 and init 5 on tty1 and I haven't had a hang in plasma5 yet.
I've confirmed that "fbcon: inteldrmfb (fb0) is primary device" is in the journal, it only occurs when I boot without nomodeset and drm-kmp-default is installed. Ill try a reboot straight into runlevel 5
Comment 8 Dave Plater 2017-07-25 12:03:56 UTC
It's happened when in firefox and thunderbird's new mail notification popped up. I'm not confident that I can reproduce in xfce because I don't think that it's load will be enough. Now I'm trying to reproduce it, even tried a full screen video, it's hard to reproduce.
Comment 9 Takashi Iwai 2017-07-25 12:06:34 UTC
Could you try to test XFCE with the smaller memory size as I did?
In my case with KDE, mem=1G sufficed to trigger the problem quickly.
You can try a slightly smaller value, too.
Comment 10 Dave Plater 2017-07-25 12:22:08 UTC
(In reply to Takashi Iwai from comment #9)
> Could you try to test XFCE with the smaller memory size as I did?
> In my case with KDE, mem=1G sufficed to trigger the problem quickly.
> You can try a slightly smaller value, too.

You mean video memory, the lowest I can go is 32M framebuffer and 128M graphics. I'm using xfce now with those minimums.
Comment 11 Takashi Iwai 2017-07-25 12:30:55 UTC
No, I meant the whole RAM size.  You can limit the size by passing mem=XXX boot option, where XXX is the size (e.g. 1G, 512M, etc).

The problem of i915 driver is tied with the RAM size.  When user-space use more memory and the free page becomes tight, the system tries to swap out, and the i915 driver tries shrink its page lists.  The problem seems happening during it.
Comment 12 Dave Plater 2017-07-25 13:24:02 UTC
plasma5 had a hang immediately with mem=1G. After ctrl-backspace time 2 I logged into xfce and apart from being very slow switching applications/desktops, I had to close kicad, I haven't had a gpu hang yet even with the thunderbird pop up. Going back to my normal 4G, I've got work to do.
Comment 13 Dave Plater 2017-07-25 13:40:55 UTC
Created attachment 733736 [details]
gpu crash dump

Crash dump from the last hang with mem=1G
Comment 14 Takashi Iwai 2017-07-25 15:00:00 UTC
This looks like a regression caused by the recent PM fix.  I found a paper-over patch in the recent upstream, so I tried to backport it, and this seems working.

The test drm-kmp package is being built in OBS home:tiwai:branches:openSUSE:Leap:42.3:Update/drm repo.

Retrieve the rpm via osc,
  osc getbinaries home:tiwai:branches:openSUSE:Leap:42.3:Update/drm/standard/x86_64

Could you test this kmp?
Comment 15 Takashi Iwai 2017-07-25 15:03:26 UTC
(In reply to Takashi Iwai from comment #14)
> Retrieve the rpm via osc,
>   osc getbinaries
> home:tiwai:branches:openSUSE:Leap:42.3:Update/drm/standard/x86_64

Now finally the package was published, too:
  http://download.opensuse.org/repositories/home:/tiwai:/branches:/openSUSE:/Leap:/42.3:/Update/standard/
Comment 16 Dave Plater 2017-07-26 09:11:56 UTC
Installed via osc getbinaries and then booted into plasma5 with mem=1G and no gpu hangs even with the thunderbird pop up. Looks like you fixed the bug. I'm now on normal 4G memory.
Comment 17 Takashi Iwai 2017-07-26 09:40:42 UTC
Thanks.  I submitted the fix now.
Comment 18 Bernhard Wiedemann 2017-07-26 10:00:38 UTC
This is an autogenerated message for OBS integration:
This bug (1050256) was mentioned in
https://build.opensuse.org/request/show/512623 42.3 / drm
Comment 20 Andreas Stieger 2017-07-26 16:40:32 UTC
Update does not install. The package provides multiversion(kernel), but the two versions conflict in /lib/modules/4.4.76-1-default/updates/drivers/gpu/drm/i915/i915.ko and we are not bumping the kernel version at the same time. 

Rejecting for Leap 42.3 maintenance as is.
Comment 21 Takashi Iwai 2017-07-26 18:47:59 UTC
(In reply to Andreas Stieger from comment #20)
> Update does not install. The package provides multiversion(kernel), but the
> two versions conflict in
> /lib/modules/4.4.76-1-default/updates/drivers/gpu/drm/i915/i915.ko and we
> are not bumping the kernel version at the same time. 
> 
> Rejecting for Leap 42.3 maintenance as is.

Gah, we're still having that issue.

So we need to fix the kernel package at first, check in, then rebuild KMP based on it.

Michal, what is your take?
Comment 22 Michal Marek 2017-07-27 12:18:10 UTC
You can also submit the first update kernel and the drm fix together.
Comment 23 Takashi Iwai 2017-07-27 12:22:13 UTC
The drm update was already submitted :)
So the missing piece is the fix in the kernel side (and the rebuild of KMP with it).
Comment 24 Projeto Linux Kamarada 2017-08-03 15:42:16 UTC
That bug also affected me, as I reported on the mailing list:

https://lists.opensuse.org/opensuse-factory/2017-07/msg00725.html

Removing both the xf86-video-intel and drm-kmp-default packages solved that problem for me. First I removed xf86-video-intel, which didn't solve the problem, then I removed drm-kmp-default and crashes stopped. My system has already 2 days uptime without crashes.
Comment 25 Andreas Stieger 2017-08-03 20:34:21 UTC
We are preparing a maintenance update for the 42.3 kernel. Is the submitted update ready to be re-built against it?
Comment 26 Takashi Iwai 2017-08-04 05:44:49 UTC
Yes, it's built fine against the new kernel.  Basically we keep kABI, so KMP should be always buildable to newer kernels.
Comment 27 Lubomir Vrana 2017-08-08 11:18:15 UTC
Created attachment 735685 [details]
dmesg log gpu hang

Gpu hang again :(

> uname -a
Linux linux-0mvy.suse 4.4.79-4-default #1 SMP Thu Aug 3 14:49:17 UTC 2017 (4dc78e3) x86_64 x86_64 x86_64 GNU/Linux
Comment 28 Andreas Stieger 2017-08-09 18:07:50 UTC
For openSUSE Leap 42.3, test update packages built against the current kernel can be found here:

http://download.opensuse.org/repositories/openSUSE:/Maintenance:/7039/openSUSE_Leap_42.3_Update/
http://download.opensuse.org/update/leap/42.3-test/
Comment 29 Dave Plater 2017-08-10 05:00:17 UTC
It installs easily with the 4.4.79 kernel already installed.
Comment 30 Lubomir Vrana 2017-08-16 19:17:43 UTC
It looks good now
# uname -a
Linux linux-0mvy.suse 4.4.79-19-default #1 SMP Thu Aug 10 20:28:47 UTC 2017 (2dd03e8) x86_64 x86_64 x86_64 GNU/Linux
Comment 31 Swamp Workflow Management 2017-08-16 22:13:19 UTC
openSUSE-RU-2017:2194-1: An update that has two recommended fixes can now be installed.

Category: recommended (important)
Bug References: 1048155,1050256
CVE References: 
Sources used:
openSUSE Leap 42.3 (src):    drm-4.9.33-5.2
Comment 32 Carlos Robinson 2018-01-27 20:58:28 UTC
I have this issue after upgrading my laptop to 42.3 from 42.2, using the offline or DVD upgrade method.

CPU:
  Model: 6.23.10 "Pentium(R) Dual-Core CPU       T4300  @ 2.10GHz"
Video:
  Model: "Intel Mobile 4 Series Chipset Integrated Graphics Controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x2a42 "Mobile 4 Series Chipset Integrated Graphics Controller"
  SubVendor: pci 0x103c "Hewlett-Packard Company"
  SubDevice: pci 0x3069 
  Revision: 0x07
  Driver: "i915"
  Driver Modules: "i915"

(hwinfo output will be attached)

Crash log:

<3.6> 2018-01-27 12:47:05 minas-tirith systemd 1 - -  Started Postfix Mail Transport Agent.
<0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808879] [drm] GPU HANG: ecode 4:0:0xfdefffff, in X [2154], reason: Hang on render ring, action: reset
<0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808883] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808884] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808884] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808885] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<0.6> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808885] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<0.5> 2018-01-27 12:47:17 minas-tirith kernel - - - [ 1128.808914] drm/i915: Resetting chip after gpu hang
<0.5> 2018-01-27 12:47:26 minas-tirith kernel - - - [ 1137.820965] drm/i915: Resetting chip after gpu hang
<0.5> 2018-01-27 12:47:36 minas-tirith kernel - - - [ 1147.820140] drm/i915: Resetting chip after gpu hang


I commented this on the openSUSE mail list, and Dave Plater suggested nomodeset. This works, but the video mode changes to something like 800*600, which is pretty bad. He also suggested to reopen this Bugzilla.

At that moment I had kernel 4.4.104-39, and drm-kmp-default 4.9.33_k4.4.79_4-5.2. I updated to his version, drm-kmp-default-4.9.33_k4.4.104_39-7.24.x86_64.rpm; this is more stable, but in the end the X environment froze: mouse moves, but no response. I could ctrl-alt-f1. 


I see in the log several entries like this (different PID), don't know if related:

<3.6> 2018-01-27 19:58:34 minas-tirith console-kit-daemon 3128 - -  (process:10750): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed


I hibernated the machine and went back home. Restored (not restarted) and I see this in the log:


<3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - -  System resumed.
<3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - -  INFO: running /usr/lib/systemd/system-sleep/grub2.sleep for hibernate
<3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - -  INFO: Running grub-once-restore ..
<3.6> 2018-01-27 21:16:36 minas-tirith systemd-sleep 10886 - -  2018-01-27 21:16:36+01:00 - Thawing the system now...
<3.4> 2018-01-27 21:16:36 minas-tirith systemd-sh - - -  Thawing the system now...
<3.6> 2018-01-27 21:16:37 minas-tirith systemd 1 - -  Stopped Deferred execution scheduler.
<3.6> 2018-01-27 21:16:37 minas-tirith systemd 1 - -  Started Deferred execution scheduler.
<3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - -  Laptop mode
<3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - -  enabled, not active [unchanged]
<3.6> 2018-01-27 21:16:37 minas-tirith systemd-sleep 10886 - -  INFO: Done.
<3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - -  Laptop mode
<3.6> 2018-01-27 21:16:37 minas-tirith laptop-mode - - -  enabled, not active [unchanged]
<3.6> 2018-01-27 21:16:37 minas-tirith systemd-sleep 10886 - -  tput: No value for $TERM and no -T specified
<0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816731] [drm] GPU HANG: ecode 4:0:0xfdeffdfb, in X [2171], reason: Hang on render ring, action: reset
<0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816736] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816736] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816737] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816737] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<0.6> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816738] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<0.5> 2018-01-27 21:16:48 minas-tirith kernel - - - [13685.816792] drm/i915: Resetting chip after gpu hang
<0.5> 2018-01-27 21:17:00 minas-tirith kernel - - - [13697.816112] drm/i915: Resetting chip after gpu hang


I will attach gpu.2.log, and messages log since machine upgrade, and hwinfo --cpu and --gfxcard


My desktop is XFCE and I have 4 GiB of RAM.
Comment 33 Carlos Robinson 2018-01-27 21:00:30 UTC
Created attachment 757808 [details]
CER: Messages log
Comment 34 Carlos Robinson 2018-01-27 21:01:17 UTC
Created attachment 757809 [details]
CER: gpu log
Comment 35 Carlos Robinson 2018-01-27 21:02:04 UTC
Created attachment 757810 [details]
CER: hwinfo output
Comment 36 Carlos Robinson 2018-01-27 21:09:47 UTC
On suggestion from Felix Miata I add inxi output:

minas-tirith:/home/cer/Bugzilla/Bug_1050256 - GPU hang # inxi -c0 -G
Graphics:  Card: Intel Mobile 4 Series Integrated Graphics Controller
           Display Server: X.org 1.18.3 drivers: intel (unloaded: modesetting,fbdev,vesa)
           tty size: 150x51 Advanced Data: N/A for root
minas-tirith:/home/cer/Bugzilla/Bug_1050256 - GPU hang #
Comment 37 Stefan Dirsch 2018-01-28 15:00:20 UTC
Carlos, this is a completely different GPU family than in the initial report (GM45 vs. Sandybridge). Please open a separate bugreport. Also first thing to try with ancient Intel GPUs is uninstalling drm-kmp-default package.
Comment 38 Carlos Robinson 2018-01-28 21:59:43 UTC
Ok, will do. 
Thanks.
Comment 39 Carlos Robinson 2018-01-28 22:10:14 UTC
Done, 
Bug 1077885 - GPU hang (Intel Mobile 4 Series Integrated Graphics Controller)
Comment 40 Swamp Workflow Management 2018-02-21 17:16:59 UTC
SUSE-SU-2018:0509-1: An update that solves one vulnerability and has 8 fixes is now available.

Category: security (moderate)
Bug References: 1041744,1046821,1047277,1047729,1048155,1050256,1055493,1066175,1077885
CVE References: CVE-2017-10810
Sources used:
SUSE Linux Enterprise Workstation Extension 12-SP3 (src):    drm-4.9.33-4.11.1
SUSE Linux Enterprise Desktop 12-SP3 (src):    drm-4.9.33-4.11.1