Bug 388793

Summary: quake110.suse.de: soft system hang, within one day of stress
Product: [SUSE Linux Enterprise Real Time Extension] SUSE Linux Enterprise Real Time 10 SP2 (SLERT 10 SP2) Reporter: Daniel Gollub <dgollub>
Component: kernelAssignee: Erik Hamera <erik.hamera>
Status: RESOLVED WONTFIX QA Contact: Erik Hamera <erik.hamera>
Severity: Major    
Priority: P5 - None    
Version: RC1   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: screenlog of quake110.suse.de - SysRq with nmi_watchdog=1
quake110.suse.de - beta7 serial console log of sysrq

Description Daniel Gollub 2008-05-09 15:01:42 UTC
Host: quake110.suse.de
Kernel: SLERT 10 SP2 RC1
Testrun: SLERT 10 SP2 - longterm #1

Full system hang, no SysRq via serial console possible, not reachable via
network even caps-lock LED didn't response.

http://w3.suse.de/~dgollub/SLERT/SLERT-testresults/longterm1/


Last life sign on serial console:
"Fri May  9 01:25:23 EDT 2008 tick!"

Approx. runtime till hang: 20 hours

Longterm test is running simultaneous:
- slert-dbench-disk with 30 processes for 1 hour (in loop)
- slert-hackbench (in loop)
- ltp-realtime unit (in loop)
- cyclictest with SCHED_FIFO 99 (in loop - no memlock or affinity patches)

Rerunning test, to see how reproducible this is.
Comment 1 Daniel Gollub 2008-05-13 11:07:29 UTC
Created attachment 214707 [details]
screenlog of quake110.suse.de - SysRq with nmi_watchdog=1

Host: quake110.suse.de
Kernel: SLERT 10 SP2 RC 1 (mbuild, not from media)
Kernel Build: /mounts/users-space/dgollub/SLES-10-SP2-RT-KERNEL-ARCHIVE/SLERT_SP2_RC1/sle10-sp2-rt-x86_64/kernel-rt-2.6.22.19-0.8.x86_64.rpm

http://w3.suse.de/~dgollub/SLERT/SLERT-testresults/longterm1/

Approx. runtime till soft hang with nmi_watchdog=1: 3 hours
SysRq got triggered approx. 40 hours later.

Serial console log of SysRq of showTasks, showBlockedTasks and others attached. Crashdump is also available:
/mounts/users-space/dgollub/crashdump/SLERT/10/SP2/RC1-mbuild/quake110.suse.de/bnc399793#1/2008-05-13-04:51/vmcore
Comment 2 Daniel Gollub 2008-05-14 08:22:54 UTC
Created attachment 215022 [details]
quake110.suse.de - beta7 serial console log of sysrq

Host: quake110.suse.de
Kernel: Beta7 (mbuild, not from media)
Kernel Build: /mounts/users-space/dgollub/SLES-10-SP2-RT-KERNEL-ARCHIVE/SLERT_
SP2_BETA7_IGNORESPURIOUSIRQ_20080430/sle10-sp2-rt-x86_64/

Regression check with Beta 7 kernel.

http://w3.suse.de/~dgollub/SLERT/SLERT-testresults/longterm2.1/

Approx. runtime till soft hang (without nmi_watchdog): 14 hours

Serial console log of SysRq of showTasks, showBlockedTasks and others attached.
Crashdump is also available:
/mounts/users-space/dgollub/crashdump/SLERT/10/SP2/Beta7-mbuild/quake110.suse.de/bnc399793#2/2008-05-14-01\:49/vmcore
Comment 3 Daniel Gollub 2008-05-15 07:34:18 UTC
prio-preempt involved:

      KERNEL: vmlinux-2.6.22.19-0.8-rt          
    DUMPFILE: ../quake110.suse.de/bnc399793#1/2008-05-13-04:51/vmcore
        CPUS: 4
        DATE: Tue May 13 10:50:13 2008
      UPTIME: 1 days, 19:23:18
LOAD AVERAGE: 49.91, 48.23, 47.46
       TASKS: 235
    NODENAME: quake110
     RELEASE: 2.6.22.19-0.8-rt
     VERSION: #1 SMP PREEMPT RT 2008-05-06 20:07:24 +0200
     MACHINE: x86_64  (3200 Mhz)
      MEMORY: 3.9 GB
       PANIC: "SysRq : Trigger a crashdump"
         PID: 471
     COMMAND: "IRQ-4"
        TASK: ffff81011476e040  [THREAD_INFO: ffff81011171c000]
         CPU: 1
       STATE: TASK_INTERRUPTIBLE (SYSRQ)

crash> ps | grep ">"
>   471      2   1  ffff81011476e040  IN   0.0       0      0  [IRQ-4]
> 11239   9466   2  ffff8100a9239810  RU   0.1  264404   4880  prio-preempt
> 11240   9466   0  ffff8100a9239040  RU   0.1  264404   4880  prio-preempt
> 11241   9466   3  ffff8100a1c0e810  RU   0.1  264404   4880  prio-preempt
crash> task 11239 11240 11241 | grep rt_priori
  rt_priority = 81, 
  rt_priority = 81, 
  rt_priority = 81, 
crash> 
Comment 4 Daniel Gollub 2008-05-27 13:47:01 UTC
Renaming the bug from hard hang to soft  hang, since the initial report could be a fail diagnose since the IRQ handler for the serial console wasn't bumped up to SCHED_FIFO 99. CPU hogs could starve the IRQ thread, which might make SysRq unusable in the initial report.

Lowering severity to major, since this looks like a testcase issue (file write access deadlock with cpu hogs - in combination with CTCS2)
Comment 9 Martin Polster 2008-07-23 08:15:32 UTC
I talked to svollath, who is responsible of the QA-Lab relocation.
Quake110.suse.de will be available before end of next week.
Comment 10 Ihno Krumreich 2008-11-04 14:49:28 UTC
Still valid?
Comment 11 Sven Dietrich 2008-11-13 21:35:54 UTC
Please re-confirm the issue against the Update 4 Kernel.
Comment 12 Sven Dietrich 2009-02-17 17:16:52 UTC
No Response.