Bug 694018

Summary: intel [Sandybridge] graphics locks hard, reports hangcheck timer elapsed
Product: [openSUSE] openSUSE 11.4 Reporter: Daniel Morris <danielm>
Component: KernelAssignee: Egbert Eich <eich>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: bruno, forgotten_XhsrPdJAcI, jeffm, p.heinlein, philsinger, pmarques, tilo
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 11.4   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Daniel Morris 2011-05-16 12:13:10 UTC
User-Agent:       Mozilla/5.0 (compatible; Konqueror/4.6; Linux) KHTML/4.6.0 (like Gecko) SUSE

I've just had three hard lock-ups in the space of a few hours on a new machine. Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz using the on-chip graphics.

I'm using kernel-desktop-2.6.37.6-0.5.1.x86_64 & xorg-x11-driver-video-7.6-53.56.1.x86_64.

The problem triggered each time I closed mutt inside an Xterm, with the Xterm located just above the notifier area of the task bar with KDE4 (not sure if this is relevant).

I could still ssh into the machine, and processes seemed to still be running, but the screen no longer updated or responded (I first thought it was time to change the batteries again in the mouse & keyboard, but I now write the date on the last of the alkaline batteries and am switching to re-chargeables with a paper log of charge/install date!).

I tried bringing the machine down to runlevel three and backup to five, but that didn't work. These were the only relevant messages in /var/log/messages and I notice the first time it reported twice in succession:-

May 16 10:22:20 chunk kernel: [417202.246095] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.
May 16 10:22:34 chunk kernel: [417215.750281] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.

May 16 11:30:35 chunk kernel: [ 3333.709458] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.

May 16 12:00:36 chunk kernel: [ 5134.409081] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.

Nothing was recorded in X.org.log at the time (Xorg.0.log.old was last written at 11:53).

I've searched around and can see various similar reports from later kernels from
other distros and references to userspace triggers, but I couldn't see a conclusive reason.

Reproducible: Didn't try

Steps to Reproduce:
1.
2.
3.
Comment 1 Daniel Morris 2011-05-16 13:03:45 UTC
Lockup #4, this time with my xterm on the left hand side of the screen, well away from the notifier and taskbar. I was just closing the body of a mail in my 80x35 window to go back to the pager view. 

Also, this time it recovered after a couple of minutes, as I'd ssh'd in and was poking around the log files. I happened to shiggle the mouse and must have pushed it to the top LH corner and it brought up tiles of all the active applications. At the same time it also dropped the following into Xorg.0.log
(definitely not there seconds before):-

[  6170.029] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[  6170.029]
Backtrace:
[  6170.029] 0: /usr/bin/Xorg (xorg_backtrace+0x28) [0x463678]
[  6170.029] 1: /usr/bin/Xorg (mieqEnqueue+0x1f4) [0x45e614]
[  6170.029] 2: /usr/bin/Xorg (xf86PostMotionEventP+0xc4) [0x4762a4]
[  6170.029] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fbf5635e000+0x4e05) [0x7fbf56362e05]
[  6170.029] 4: /usr/bin/Xorg (0x400000+0x72d77) [0x472d77]
[  6170.029] 5: /usr/bin/Xorg (0x400000+0x1178e3) [0x5178e3]
[  6170.030] 6: /lib64/libc.so.6 (0x7fbf5a04d000+0x32b30) [0x7fbf5a07fb30]
[  6170.030] 7: /lib64/libc.so.6 (ioctl+0x7) [0x7fbf5a118ce7]
[  6170.030] 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x7fbf588c8918]
[  6170.030] 9: /usr/lib64/libdrm_intel.so.1 (drm_intel_gem_bo_map_gtt+0x7e) [0x7fbf5826895e]
[  6170.030] 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fbf5846e000+0x108e8) [0x7fbf5847e8e8]
[  6170.030] 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7fbf5846e000+0x31bb6) [0x7fbf5849fbb6]
[  6170.030] 12: /usr/bin/Xorg (0x400000+0xdc9fb) [0x4dc9fb]
[  6170.030] 13: /usr/bin/Xorg (0x400000+0xe0818) [0x4e0818]
[  6170.030] 14: /usr/bin/Xorg (doImageText+0x223) [0x4313f3]
[  6170.030] 15: /usr/bin/Xorg (ImageText+0x5f) [0x43294f]
[  6170.030] 16: /usr/bin/Xorg (0x400000+0x2cbf0) [0x42cbf0]
[  6170.030] 17: /usr/bin/Xorg (0x400000+0x2f6b1) [0x42f6b1]
[  6170.030] 18: /usr/bin/Xorg (0x400000+0x25ace) [0x425ace]
[  6170.030] 19: /lib64/libc.so.6 (__libc_start_main+0xfd) [0x7fbf5a06bbfd]
[  6170.030] 20: /usr/bin/Xorg (0x400000+0x25679) [0x425679]

Not sure if that helps?
Comment 2 Daniel Morris 2011-05-18 09:29:45 UTC
I've had at least half a dozen of these since. mutt in an xterm seems to be a trigger, but it could just be I'm spending most of my time sorting email at the moment :-(

I've corrected the component to X. I find it a little ironic that I sourced an Intel based system with on-chip graphics intentionally to avoid my previous misery of ATI proprietary drivers.
Comment 3 Stefan Dirsch 2011-05-18 10:04:36 UTC
Still a kernel issue, I believe. See you initial comment:
Comment 4 Daniel Morris 2011-05-18 15:56:55 UTC
I also tried the 2.6.37.6-0.5-default kernel today and got the same problem. 

However, I did a kill -9 on the /usr/bin/Xorg process and to my surprise kdm restarted; so whilst this bug remains exceptionally annoying at least it means I don't have to reboot the machine every time.
Comment 5 Daniel Morris 2011-05-25 08:22:22 UTC
This sounds silly, but using konsole instead of xterm to run mutt seems to avoid the problem. 

I created an 80x35 yellow text on black background profile for konsole, so it even mimics my mutt-in-xterm setup, and I haven't seen it lock for five days (whereas I got three crashes in 45 mins before). This has been with the default kernel, I should probably reboot to the desktop one, as this seemed even more prone to locking.

My defacto terminal setup was:-

'xterm -geometry 80x35 -fg yellow -bg black-fn -b\&h-lucidatypewriter-medium-r-normal-sans-17-120-100-100-m-100-iso10646-1 -title "`uname -n` Yellow & Black xterm"  -s -j -sb -sl 4000 -rw &'
Comment 6 Forgotten User XhsrPdJAcI 2011-05-25 10:40:06 UTC
I see the same issue with my i7-2600 processor. I saw this error mostly a short while after (the matrix) screen saver started.

As workaround I edited the file /etc/X11/xorg.conf.d/50-device.conf with adding the line
"...
Driver "fbdev"
..."

to enforce loading the standard framebuffer driver.

That seems to avoid this graphics hang, so I get the impression this is a graphics driver issue (http://intellinuxgraphics.org/).
Comment 7 Paulo Marques 2011-08-22 13:33:07 UTC
I just want to say that the problem with the "[drm:i915_hangcheck_elapsed]
*ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ." also happens to me on a i7-2600K Sandybridge integrated intel graphics card.

I use Compiz and this driver never worked very reliably. Sometimes, after using a 3D application (like Blender) I would just lose composition or get strange artifacts.

However only since recently it started hanging hard. I sometimes can still switch to tty1 and do a init 3 -> init 5 sequence to get the machine back. Other times I have to ssh from another computer to do that.

I think the start of the hangs coincided with the latest kernel update, but I can't be sure of that.

I've already compiled a vanilla 3.0.0 kernel for that machine, so I might try that to have an extra data point later, since with the current kernel I get at least one hang in a couple of hours working with the machine.

BTW, not working with the machine and just leaving the screensaver on is enough for it to hang, too.

I really think the severity of this bug is understated: for a regular user, this is a hard freeze of the machine, only solvable by a reset, losing all the current work.
Comment 8 Paulo Marques 2011-08-23 21:48:36 UTC
The kernel that I had already compiled was actually a 3.0.0-git20 (I think it was the latest available when I tried it before). Since it was simple to switch to that kernel I tried it anyway.

I've been running with that kernel since yesterday with no hangs (uptime is now almost 23 hours), so the problem is probably kernel related.
Comment 9 Bob Mueller 2011-08-25 16:21:01 UTC
I got the same problem on SuSE 11.4, Kernel 2.6.37.6-0.7-desktop, since i made the last full-automatical Systemupdate last week. My Graphiccard is a combination Intel / Nvidia-Optimus. I used the Default Intel driver. I think it have nothing to do with the screensaver matrix (got it too), because the system hangs sometimes directly after i logged in. Your workaround:

===========
As workaround I edited the file /etc/X11/xorg.conf.d/50-device.conf with adding
the line
"...
Driver "fbdev"
..."
===========

works for me, but it disables all Display-effects. Ok, i can live with it, but i hope for a fix soon.
Comment 10 Peer Heinlein 2011-10-15 08:39:10 UTC
Look's like this is the same problem I have on my ThinkPad 420.

There's a complete system freeze several times per day running an up2date OpenSUSE 11.4 on it.

I don't have real logfiles from this event, because system's hanging up completly (even CapsLock doesn't work any more).

But I also noticed stuff like this in my logfiles:

Oct  7 22:28:35 flash kernel: [  343.419906] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.

I haven't tried the fbdev-workaround. Maybe there's a dependency to Bug #686447?
Comment 11 Paulo Marques 2011-10-19 16:32:10 UTC
I can just add that I'm still running with that same 3.0.0-git20 kernel with no problems, and I'm using the default driver (not fbdev).
Comment 12 Phillip Singer 2011-10-30 21:08:46 UTC
I am one of those 'regular users' mentioned above (trying to avoid dual booting into WinDoze on the box I built last month.  Contents of /proc/version: 2.6.37.1-1.2-desktop (geeko@buildhost) (gcc version 4.5.1 20101208 [gcc-4_5-branch revision 167585] (SUSE Linux) ) #1 SMP PREEMPT 2011-02-21 10:34:10 +0100

Get a complete freeze once a night, with only the mouse remaining functional (and not always that much functionality).  Cannot launch Krunner.  Was blaming it on KDE4, and was just browsing /var/log/messages when I came across this error and traced it back to this error log.

Only drivers on the box are those installed by the OpenSUSE installer.  CPU: I3-2100.  Motherboard DH67CLB3.  Going to see if that fbdev driver is an option
Comment 13 Bob Mueller 2011-11-13 15:47:41 UTC
I have updated my 2.0 Gnome Desktop to Gnome 3.0.2 ...the problem is solved. lol.
Comment 14 Paulo Marques 2011-11-14 12:50:04 UTC
But was the window manager the only thing updated? No kernel, no drm, etc.?
Comment 15 Bob Mueller 2011-11-14 13:05:44 UTC
(In reply to comment #14)
> But was the window manager the only thing updated? No kernel, no drm, etc.?

at this time, only the window manager has been updated per one-click install from here: 


http://en.opensuse.org/openSUSE:GNOME_3.0

after successful installation, several other package-updates have been made​​. i'm sure, there was no kernel update at this time (i dont need to reboot). the system running 3 days without any problem, before i have made a kernel update to current version 2.6.37.6-0.9-desktop. 

the error-message 

*ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ. 

is no longer present.
Comment 16 Jeff Mahoney 2012-08-02 16:00:34 UTC
With the coming release of openSUSE 12.2, openSUSE kernel developers are focusing their efforts there. Reports against openSUSE 11.4 and prior will not get the attention needed to resolve them before openSUSE 12.2 is release and openSUSE 11.4 becomes unmaintained.

Please re-test with openSUSE 12.1 or openSUSE RC2+ and re-open with an updated Product if you still encounter your issue.

We apologize for this issue not getting the attention it deserves but we are focusing our resources in the area where they will have the most impact for our users.  We're working hard to make openSUSE 12.2 the best openSUSE release yet!
Comment 17 Jeff Mahoney 2012-08-02 16:01:07 UTC
With the coming release of openSUSE 12.2, openSUSE kernel developers are focusing their efforts there. Reports against openSUSE 11.4 and prior will not get the attention needed to resolve them before openSUSE 12.2 is release and openSUSE 11.4 becomes unmaintained.

Please re-test with openSUSE 12.1 or openSUSE RC2+ and re-open with an updated Product if you still encounter your issue.

We apologize for this issue not getting the attention it deserves but we are focusing our resources in the area where they will have the most impact for our users.  We're working hard to make openSUSE 12.2 the best openSUSE release yet!