Bug 368657

Summary: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen
Product: [SUSE Linux Enterprise Real Time Extension] SUSE Linux Enterprise Real Time 10 SP2 (SLERT 10 SP2) Reporter: Daniel Gollub <dgollub>
Component: kernelAssignee: Stefan Assmann <sassmann>
Status: RESOLVED FIXED QA Contact: Erik Hamera <erik.hamera>
Severity: Critical    
Priority: P5 - None CC: hare, ihno
Version: BETA4   
Target Milestone: GM   
Hardware: x86-64   
OS: SLES 10   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: /proc/interrupts of humperdinck - running SLERT10 SP2 Beta 5 and 'disable_rt_pcix_apic'
/proc/interrups host: graffin
libata-irq-storm.patch
libata-irq-storm.patch
libata-irq-storm.patch
humperdinck.suse.de - /var/log/messages of easter-testrun1
screenlog of graffin.suse.de while running plain ltp without ltp-realtime
dmesg output of K45.suse.de - running ltp-realtime
boot.msg and boot.omsg

Description Daniel Gollub 2008-03-10 13:17:32 UTC
SLERT10 SP2 Beta4 (2.6.22.19-0.5-rt) - on "some heavy" I/O the SATA drive times out.

Host: graffin.suse.de

Very likely it's related to #367474

From serial line:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen^M
ata1.00: cmd ca/00:08:80:60:3f/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out^M
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)^M 
ata1: soft resetting link^M
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)^M
ata1.00: qc timeout (cmd 0x27)^M
ata1.00: failed to read native max address (err_mask=0x4)^M
ata1.00: revalidation failed (errno=-5)^M
ata1: failed to recover some devices, retrying in 5 secs^M
ata1: hard resetting link^M
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)^M
ata1.00: qc timeout (cmd 0x27)^M 
ata1.00: failed to read native max address (err_mask=0x4)^M
ata1.00: revalidation failed (errno=-5)^M
ata1: limiting SATA link speed to 1.5 Gbps^M
ata1.00: limiting speed to UDMA/133:PIO3^M
ata1: failed to recover some devices, retrying in 5 secs^M
ata1: hard resetting link^M 
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)^M
ata1.00: qc timeout (cmd 0x27)^M 
ata1.00: failed to read native max address (err_mask=0x4)^M
ata1.00: revalidation failed (errno=-5)^M
ata1.00: disabled^M
ata1: EH pending after completion, repeating EH (cnt=4)^M
ata1: soft resetting link^M
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)^M
ata1: EH complete^M
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK^M
end_request: I/O error, dev sda, sector 4153472^M 
Buffer I/O error on device sda5, logical block 3088^M
lost page write due to I/O error on sda5^M
sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK^M
end_request: I/O error, dev sda, sector 4153480^M 
Buffer I/O error on device sda5, logical block 3089^M

---

Going to run the host with noapic to check if this is the common interrupt routing problem we're facing on other machines (#337006, #367694)
Comment 1 Daniel Gollub 2008-03-10 21:02:39 UTC
Running SLERT10 SP2 Beta4 (2.6.22.19-0.5-rt) with noapic on graffin.suse.de. On stress with SLERT testsuite in a loop. Machine hangs, sysreq didn't work. So noapic doesn't have effect in this case.

Tejun, any idea? Do you need further information?

--- 

ata1.00: cmd ca/00:08:2c:ac:dc/00:00:00:00:00/ec tag 0 cdb 0x0 data 4096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
nfs: server hilbert3 not responding, still trying
hdb: lost interrupt
nfs: server bartleby.suse.de not responding, still trying
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
nfs: server loki not responding, still trying
Comment 2 Daniel Gollub 2008-03-10 21:31:31 UTC
graffin:~ # lsscsi -H
[0]    sata_svw      
[1]    sata_svw      
[2]    sata_svw      
[3]    sata_svw      
[4]    mptsas  

[...]
01:0e.0 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
01:0e.1 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
06:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064ET PCI-Express Fusion-MPT SAS (rev 02)
[...]
Comment 3 Daniel Gollub 2008-03-10 21:37:37 UTC
Similar/Same issue appeared on different host.

Host: martinu.suse.de
Kernel: SLERT10 SP2 Beta4 (2.6.22.19-0.5-rt)

martinu:~ # lsscsi -H
[0]    sata_nv       
[1]    sata_nv       
[2]    sata_nv       
[3]    sata_nv

[...]
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
[...]

---

-- 60:martinu -- time-stamp -- 2008-03-10 16:40:46 --
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen^M
ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 ^M
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)^M
ata1: soft resetting link^M
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)^M
ata1.00: configured for UDMA/100^M
ata1: EH complete^M

---

Trying to reproduce the issue with looping the SLERT Testsuite with noapic in meanwhile.
Comment 4 Tejun Heo 2008-03-11 02:50:01 UTC
Daniel, the last one is timeout on FLUSH, which usually indicates something is seriously wrong with the disk.  Or, as it's sata_nv, you need to turn off ADMA mode if you're using it.

In the first case, does 'irqpoll' help?
Comment 5 Daniel Gollub 2008-03-11 09:09:24 UTC
(JFYI, for some reason graffin.suse.de survied overnight the SLERT-testsuite in a loop and 'noapic')

graffin.suse.de is now running 2.6.22.19-0.5-rt just with 'irqpoll'.
Comment 6 Sven Dietrich 2008-03-11 17:16:40 UTC
Once you have finished this run, could you try to reproduce in GMC
Comment 7 Daniel Gollub 2008-03-11 22:03:37 UTC
graffin.suse.de is running with 'irqpoll' on 2.6.22.19-0.5-rt for 12h now, with the SLERT testsuite as a loop (managed to survied 4 rounds till now). Should I stop the test with 'irqpoll' and try to reproduce it with GMC?

martinu.suse.de is busy with bug #369024 (blocker)

I try to find another box in meanwhile.
Comment 8 Sven Dietrich 2008-03-11 22:10:40 UTC
Yes, it would be good to find out if this shows up in GMC, asap.
When you run GMC, please revert the rcuify-udp-hash-lock.patch before testing










Comment 9 Daniel Gollub 2008-03-12 00:42:27 UTC
Running custom (removed rcuify-udp-hash-lock.patch from series.conf) build GMC kernel (2.6.22.10-0.5-rt). Looping now SLERT-dbench test to hopefully create enough I/O to reproduce this issue more quickly (if present).
Comment 10 Daniel Gollub 2008-03-12 19:35:52 UTC
graffin.suse.de running custom GMC (2.6.22.10-0.5-rt MINUS rcuify-udp-hash-lock.patch) is still alive - slert-dbench-disk and slert-hackbench got looped 67 times till now.

Going to run the entire SLERT-testsuite in a loop, to be sure this issue isn't just about heavy I/O.

If the box will survive the next 10h this issue is NOT reproduceable in GMC.
Comment 11 Daniel Gollub 2008-03-13 00:07:00 UTC
Created attachment 201021 [details]
/proc/interrupts of humperdinck - running SLERT10 SP2 Beta 5 and 'disable_rt_pcix_apic'

host: humperdinck.suse.de (SLERT10 SP2 Beta5 with 'disable-rt-pcix-apic')

While running SLERT-testsuite same warnings appears on humperdinck.
Regarding /var/log/message of humperdinck (https://bugzilla.novell.com/attachment.cgi?id=201020) only ltp-realtime seems to trigger this issue.


---

Mar 13 00:04:02 humperdinck root: ltp-realtime
Mar 13 00:07:35 humperdinck kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 13 00:07:35 humperdinck kernel: ata1.00: cmd a0/00:00:00:00:20/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Mar 13 00:07:35 humperdinck kernel:          res 40/00:03:00:00:20/00:00:00:00:00/a0 Emask 0x4 (timeout)
Mar 13 00:07:35 humperdinck kernel: ata1: soft resetting link
Mar 13 00:07:35 humperdinck kernel: ata1.00: configured for UDMA/25
Mar 13 00:07:35 humperdinck kernel: ata1: EH complete

---

humperdinck:~ # lsscsi 
[0:0:0:0]    cd/dvd  TEAC     DW-224E-A        1.4A  /dev/sr0
[2:0:6:0]    process ESG-SHV  SCA HSBP M35     1.06  -       
[3:0:0:0]    disk    MAXTOR   ATLAS10K5_73SCA  JNZ3  /dev/sda
[3:0:6:0]    process ESG-SHV  SCA HSBP M35     1.06  -       
humperdinck:~ # lsscsi -H
[0]    ata_piix      
[1]    ata_piix      
[2]    mptspi        
[3]    mptspi
Comment 12 Tejun Heo 2008-03-13 00:44:29 UTC
Daniel, it's different one.  That's timeout on the cdrom.  The command is TEST_UNIT_READY which is probably issued from hald polling the device for media presence.  hal-disable-polling /dev/sr0 should work around the problem.
Comment 13 Daniel Gollub 2008-03-13 17:16:48 UTC
Tejun, maybe it's about that the interrupts gets preempted?
I try your suggestion about hal-disale-polling.

Sven. suggested on the call to change the priorities of the interrupts. Going to try this as well.
Comment 14 Tejun Heo 2008-03-14 02:58:34 UTC
Some controllers are susceptible to IRQ storms which usually take down the interrupt and thus timeouts and device disablement.  This was what I suspected first and couldn't confirm either wise because there was no log leading to the timeouts so I suggested irqpoll.  If you have full log of failure, posting it will be very helpful.  Logging serial console should do the job.

In principal, interrupt handling preemption should work okay but I wouldn't be too surprised if some ATA controllers lock up because of it.  But let's see if we have other explanations.
Comment 15 Stefan Assmann 2008-03-17 10:26:58 UTC
investigating which rt tests causes the timeouts on hosts graffin.suse.de and martinu.suse.de
Comment 16 Stefan Assmann 2008-03-18 09:27:49 UTC
graffin.suse.de:

Mar 17 10:25:52 graffin root: /root/slert-ltp-realtime/slert-3.tcf: 2
Mar 17 10:25:53 graffin kernel: irq 11: nobody cared (try booting with the "irqpoll" option)
Mar 17 10:25:53 graffin kernel:
Mar 17 10:25:53 graffin kernel: Call Trace:
Mar 17 10:25:53 graffin kernel:  [<ffffffff80278dde>] __report_bad_irq+0x1e/0x80
Mar 17 10:25:53 graffin kernel:  [<ffffffff802790ee>] note_interrupt+0x2ae/0x2e0
Mar 17 10:25:53 graffin kernel:  [<ffffffff8027813d>] thread_simple_irq+0x7d/0xa0
Mar 17 10:25:53 graffin kernel:  [<ffffffff80278c23>] do_irqd+0x233/0x3a0
Mar 17 10:25:53 graffin kernel:  [<ffffffff802789f0>] do_irqd+0x0/0x3a0
Mar 17 10:25:53 graffin kernel:  [<ffffffff802789f0>] do_irqd+0x0/0x3a0
Mar 17 10:25:53 graffin kernel:  [<ffffffff802521bb>] kthread+0x4b/0x80
Mar 17 10:25:53 graffin kernel:  [<ffffffff8020ae98>] child_rip+0xa/0x12
Mar 17 10:25:53 graffin kernel:  [<ffffffff80252170>] kthread+0x0/0x80
Mar 17 10:25:53 graffin kernel:  [<ffffffff8020ae8e>] child_rip+0x0/0x12
Mar 17 10:25:53 graffin kernel:
Mar 17 10:25:53 graffin kernel: handlers:
Mar 17 10:25:53 graffin kernel: [<ffffffff88088510>] (ata_interrupt+0x0/0x210 [libata])
Mar 17 10:25:53 graffin kernel: turning off IO-APIC fast mode.

martinu.suse.de:

Mar 17 15:06:18 martinu sassmann: /suse/dgollub/SLERT/slert-ltp-realtime-20080317/slert-4.tcf: 1
Mar 17 15:07:29 martinu kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 17 15:07:29 martinu kernel: ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Mar 17 15:07:29 martinu kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 17 15:07:30 martinu kernel: ata1: soft resetting link
Mar 17 17:53:30 martinu syslog-ng[4032]: STATS: dropped 0
Mar 17 15:07:30 martinu kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 17 15:07:30 martinu kernel: ata1.00: configured for UDMA/100
Mar 17 15:07:30 martinu kernel: ata1: EH complete
Mar 17 15:08:30 martinu kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 17 15:08:30 martinu kernel: ata1.00: cmd ca/00:08:90:00:40/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
Mar 17 15:08:30 martinu kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 17 15:08:30 martinu kernel: ata1: soft resetting link
Mar 17 15:08:30 martinu kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Comment 17 Stefan Assmann 2008-03-18 12:18:22 UTC
slert-3.tcf - async_handler_tsc.c
#define HANDLER_PRIO 98
#define SIGNAL_PRIO 99
[...]
handler_id = create_fifo_thread(handler_thread, (void*)0, HANDLER_PRIO);
signal_id = create_fifo_thread(signal_thread, (void*)0, SIGNAL_PRIO);

slert-4.tcf - gtod_infinite.c
param.sched_priority = sched_get_priority_min(SCHED_FIFO) + 80;
[...]
while(1) {

I doubt these tests fit our needs... by pushing the thread priority above interrupt priority you might sooner or later miss irqs... which is what we observe here
Comment 18 Stefan Assmann 2008-03-18 12:30:41 UTC
will try to reproduce with prior
echo 0 > /proc/sys/kernel/rtlock_timeout
as suggested by sven
Comment 19 Tejun Heo 2008-03-18 12:31:13 UTC
Ah... so, it was IRQ storm after all.  Unfortunately, we don't have a generic way to avoid this kind of problems yet.  The problem is that ATA controllers with SFF compatible interface doesn't have proper IRQ pending bit, so the driver has to anticipate when IRQs will occur and clear IRQ preemptively.  Sometimes, for some reason, hardware decides to raise IRQ when the driver isn't expecting and boom, you get a nice IRQ storm followed by timeouts and offlining of attached devices.

I can hack a quick fix for the problem but I'm not comfortable with distributing such quick fix widely.  The quick fix is to try to clear ATA IRQ on every IRQ which usually works but certain controllers don't like its status register to be accessed at certain times and it has slight possibility of locking up the whole machine solid (controller hanging while holding PCI bus).

So, this is quite a minefield.  Is temporary fix activated with module parameter good enough?
Comment 20 Tejun Heo 2008-03-18 12:33:50 UTC
Stefan, delaying IRQ handling won't cause IRQ storm.  AFAIK, IRQ handlers still run when IRQs are raised, they make the IRQ and defer actual handling to IRQ thread.  Even if the IRQ thread experiences delay, it won't result in IRQ storm but there's no harm in double checking.  :-)
Comment 21 Sven Dietrich 2008-03-18 14:35:48 UTC
The key thing to note about the RT Kernel, is that IRQs are masked at the PIC, and left masked, until the IRQ thread has run.
This is problematic for *shared* edge IRQs, since we can miss them while the PIC has the IRQ line masked.
Assuming the IRQ was processed, and the thread was dispatched, but has not run. Is there a suggestion, that the device could raise more IRQs?
Should it not wait until the IRQ condition is cleared?
What, if any, is the hard-wired timeout for this IRQ condition to be cleared?
Comment 22 Sven Dietrich 2008-03-18 14:41:05 UTC
(In reply to comment #17 from Stefan Assmann)
> slert-3.tcf - async_handler_tsc.c
> #define HANDLER_PRIO 98
> #define SIGNAL_PRIO 99
> [...]
> handler_id = create_fifo_thread(handler_thread, (void*)0, HANDLER_PRIO);
> signal_id = create_fifo_thread(signal_thread, (void*)0, SIGNAL_PRIO);
> 
> slert-4.tcf - gtod_infinite.c
> param.sched_priority = sched_get_priority_min(SCHED_FIFO) + 80;
> [...]
> while(1) {
> 
> I doubt these tests fit our needs... by pushing the thread priority above
> interrupt priority you might sooner or later miss irqs... which is what we
> observe here
> 

I didn't look that the signal_thread does. If its polling for a signal at PRIO > 50, it may never receive the signal. If this is the case, can you also try to switch the SIGNAL and HANDLER prios to 50, and 49, respectively?

Ultimately, I am beginning to think we are dealing with 2 different issues here.

1. IRQ issues leading to device offlining
2. System HANG issues due to test design.

What troubles me, is that under no circumstances, should a set of CPU-bound RT tasks cause hardware-related failured.
Comment 23 Sven Dietrich 2008-03-18 14:49:19 UTC
(In reply to comment #16 from Stefan Assmann)
> graffin.suse.de:
> 
> Mar 17 10:25:52 graffin root: /root/slert-ltp-realtime/slert-3.tcf: 2
> Mar 17 10:25:53 graffin kernel: irq 11: nobody cared (try booting with the
> "irqpoll" option)

The nobody cared message usually indicates we have missed a level IRQ.
Can you show /proc/interrupts again? attachment 201021 [details] does not show IRQ 11.

I assume IRQ 11 is a SHIRQ for PCI - and edge. 

See also my comment #21
Comment 24 Stefan Assmann 2008-03-18 15:14:53 UTC
Created attachment 202779 [details]
/proc/interrups host: graffin

IRQ does not seem to be shared
Comment 25 Stefan Assmann 2008-03-18 15:26:54 UTC
(In reply to comment #19 from Tejun Heo)

Hi Tejun, sorry to ask but things are not that clear to me yet ;)

> Ah... so, it was IRQ storm after all.
What indicates that this is a irq storm? When thinking of irq storm, I assume that there are plenty of IRQs in a short time which can't be processed and thus "getting lost". Is that correct?

> The problem is that ATA controllers
> with SFF compatible interface doesn't have proper IRQ pending bit, so the
> driver has to anticipate when IRQs will occur and clear IRQ preemptively. 
What's SFF?


Comment 26 Stefan Assmann 2008-03-18 15:37:10 UTC
(In reply to comment #22 from Sven-Thorsten Dietrich)

> 
> What troubles me, is that under no circumstances, should a set of CPU-bound RT
> tasks cause hardware-related failured.
> 
Please correct me here.
Scenario: ThreadX very busy doing something with prio 99 (for ex.)
- Controller triggers Interrupt
- Interrupt handler kicks in (immediately?) and delegates Interrupt work to IRQ-Thread (running with prio 50)
- Interrupt handler stops and ThreadX continues
- IRQ-Thread can't workoff IRQ in time (controller timeout) and thus device gets offlined... really bad...

How can we prevent that? Confused...
Comment 27 Ihno Krumreich 2008-03-18 22:37:55 UTC
Resetting to Critical as it seems related to #367474.
(Reviewed by Ihno and Peter)
Comment 28 Tejun Heo 2008-03-19 03:11:27 UTC
Created attachment 202830 [details]
libata-irq-storm.patch

Can you please apply the attached patch and boot with libata.ata_irqstorm_workaround=1 (this can also be turned on and off via sysfs runtime) and see whether the problem goes away?
Comment 29 Tejun Heo 2008-03-19 03:17:34 UTC
Stefan, the problem is that the controller is raising interrupt when the driver is not expecting.  This probably isn't RT kernel's fault although it's possible that different timing of things may have contributed to the controller raising interrupt unexpectedly.

This basically stems from the inadequacy of SFF compliant controller interface which lacks reliable IRQ pending bit and relies on driver state machine correctly anticipating all IRQs.  Oh well, the interface is something like twenty years old now and definitely is showing the age.  :-(

Do we have access to chipset documentation for svw controllers?  Some controllers implement non-standard IRQ pending bit (e.g. ata_piix and sata_sil) so if svw has one, we can solve this problem nicely.
Comment 30 Stefan Assmann 2008-03-19 07:38:33 UTC
status: with echo 0 > /proc/sys/kernel/rtlock_timeout both graffin and martinu survived slert-3.tcf and slert-4.tcf for several hours, no probs.

I'll build a kernel with tejun's patch and run that.
Comment 31 Stefan Assmann 2008-03-19 10:40:13 UTC
Created attachment 202881 [details]
libata-irq-storm.patch

Hi Tejun, 

I replaced your patch with an updated one because you made it against sles10-sp2 source which different from sles10-sp2-rt.
Comment 32 Stefan Assmann 2008-03-20 17:56:26 UTC
graffin login: irq 11: nobody cared (try booting with the "irqpoll" option)

Call Trace:
 [<ffffffff80278d5e>] __report_bad_irq+0x1e/0x80
 [<ffffffff8027906e>] note_interrupt+0x2ae/0x2e0
 [<ffffffff802780bd>] thread_simple_irq+0x7d/0xa0
 [<ffffffff80278ba3>] do_irqd+0x233/0x3a0
 [<ffffffff80278970>] do_irqd+0x0/0x3a0
 [<ffffffff80278970>] do_irqd+0x0/0x3a0
 [<ffffffff8025213b>] kthread+0x4b/0x80
 [<ffffffff8020ae98>] child_rip+0xa/0x12
 [<ffffffff802520f0>] kthread+0x0/0x80
 [<ffffffff8020ae8e>] child_rip+0x0/0x12

this just happeneed with libata.ata_irqstorm_workaround=1
Comment 33 Tejun Heo 2008-03-21 00:38:29 UTC
Created attachment 203275 [details]
libata-irq-storm.patch

Hmmm... So, IRQ storm is happening while a command is still in flight.  How about the attached patch?
Comment 34 Daniel Gollub 2008-03-24 21:49:05 UTC
Created attachment 203620 [details]
humperdinck.suse.de - /var/log/messages of easter-testrun1

Attachted you can find /var/log/message from running the SLERT testsuite in a loop. This testsuite didn't contain any tests of the ltp-realtime unit.

Timeouts from the CD drive starts with the 3rd cycle of the testsuites.

Couldn't reproduce this on graffin.suse.de and martinu.suse.de with this testrun - since both boxes crashed very early in this testrun: #373020

Excluded test which are known to crash the boxes and restarted the testrun of graffin.suse.de to reproduce the initial reported issue, without ltp-realtime tests.
Comment 35 Tejun Heo 2008-03-25 01:24:16 UTC
Daniel, that's a different failure.  If you put a media in the cdrom and try to access it after such condition occurs, does it work?
Comment 36 Daniel Gollub 2008-03-26 13:58:19 UTC
Created attachment 204038 [details]
screenlog of graffin.suse.de while running plain ltp without ltp-realtime

Host: graffin.suse.de
Kernel:  2.6.22.19-SLERT10_BRANCH_20080320021717-rt


Initial issue still appears, even without test from ltp-realtime. Plain ltp triggered this issue as well.. try to narrow done to a single testcase and hopefully make this issue faster reproducible.


Tejun, will inset a CD into the cd drive of humperdinck.suse.de - should we continue the issue of humperdinck in a separated bug?
Comment 37 Tejun Heo 2008-03-26 14:07:38 UTC
AFAICS, the humperdinck one is what this bugzilla is about.  I think creating a new one for the cdrom one is the right thing to do.  Thanks.
Comment 38 Daniel Gollub 2008-03-26 14:11:41 UTC
(In reply to comment #36 from Daniel Gollub)
[...]
> Host: graffin.suse.de
> Kernel:  2.6.22.19-SLERT10_BRANCH_20080320021717-rt
> 
> Initial issue still appears, even without test from ltp-realtime. Plain ltp
> triggered this issue as well.. try to narrow done to a single testcase and
> hopefully make this issue faster reproducible.


CORRECTION: This issue appeared while starting the ltp-realtime testsuite, graffin.suse.de passed plain ltp.

Just rebooted graffin.suse.de and checked the log files before the disk issue appeared. 


--------
Mar 26 11:28:08 graffin logger: Happy easter testing 3
Mar 26 11:28:08 graffin logger: /suse/dgollub/SLERT/slert-ltp-realtime-20080324/slert-1.tcf: 1
Mar 26 11:28:09 graffin logger: /suse/dgollub/SLERT/slert-ltp-realtime-20080324/slert-1.tcf: 2
--- REBOOT ---
Mar 26 14:32:55 graffin syslog-ng[4678]: syslog-ng version 1.6.8 starting


But actually the the testcase "slert-1.tcf" didn't contain any test:
----
#fg 1 async_handler /usr/bin/async_handler
wait
----

Have to review what real happened.
Comment 39 Daniel Gollub 2008-04-02 16:19:33 UTC
Created attachment 205770 [details]
dmesg output of  K45.suse.de - running ltp-realtime

Host: K45.suse.de (Felix's testhost)
Kernel: 2.6.22.19-SLERT10_BRANCH_20080327204348-rt

While running ltp-realtime (without modification - ltp-realtime-20080229-1 ) disk got set offline. Last running test before disk hang up:

K45:/usr/lib/ctcs2/.slert-ltp-realtime.tcf.log.96556 # ls -1trlist | grep -v cyclic| tail -n 3
296917   4 -rw-r--r-- 1 root root    518 2008-04-02 17:13 prio-preempt
296919   4 -rw-r--r-- 1 root root    545 2008-04-02 17:13 prio-wake
296921   4 -rw-r--r-- 1 root root      3 2008-04-02 17:13 pthread_cond_latency


----

cat /proc/interrupts
           CPU0       CPU1       
  0:        243          0   IO-APIC-edge      timer
  1:       8448       2685   IO-APIC-edge      i8042
  4:         14          0   IO-APIC-edge      serial
  8:          1          0   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 12:        274          4   IO-APIC-edge      i8042
 14:        449     805517   IO-APIC-edge      ide0
 16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb5
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 19:    2851112          0   IO-APIC-fasteoi   libata, uhci_hcd:usb3
 22:        168          0   IO-APIC-fasteoi   HDA Intel
 23:          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
4347:        102    3758006   PCI-MSI-edge      eth0
NMI:          0          0   Non-maskable interrupts
LOC:   18200468   25818022   Local interrupts
RES:     556645     478270   Rescheduling interrupts
CAL:        287        386   function call interrupts
TLB:     396600     433155   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          0

---

K45:/usr/lib/ctcs2 # lspci
00:00.0 Host bridge: Intel Corporation 82945G/GZ/P/PL Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation 82945G/GZ/P/PL PCI Express Root Port
00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01)
00:1c.2 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 3 (rev 01)
00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) SATA IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300 (PCIE)]
01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE]
02:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet Controller (Copper) (rev 03)
Comment 40 Tejun Heo 2008-04-03 03:07:28 UTC
Daniel, is this from the patched kernel?  Can you please post full kernel log from boot and the result of "hdparm -I /dev/sda"?
Comment 41 Daniel Gollub 2008-04-03 09:29:28 UTC
Do you mean kernel patched with attachment 203275 [details]? No - it's a plain SLERT KOTD without any patch. Should we try this patch for the issue in comment 39?

(Box is currently unavailable due to office movement - I'll try get requested data within the next few hours.)
Comment 42 Tejun Heo 2008-04-03 09:39:27 UTC
Okay, this is getting confusing.  AFAICS, there are two issues here.

1. ATA controller going completely off-line after a while.  At least one of those cases is proved to be irq storm problem.  Not enough kernel log was posted for other ones.  The patch from comment #33 should work around this problem.  At least similar approach works on non-RT kernel.  (Please don't forget to actually turn it on by specifying module param).

2. Optical drive choking on media presence polling.  Telling hal to stop polling works around this problem at the cost of automatic media presence detection, of course.

Both #1 and #2 require too pervasive changes to properly fix for SLERT10, so we'll have to get by with those workarounds.
Comment 43 Felix Foerster 2008-04-03 12:27:28 UTC
Created attachment 205991 [details]
boot.msg and boot.omsg
Comment 44 Felix Foerster 2008-04-03 12:30:26 UTC
This is the requested info for K45.

-------------

boot.msg and boot.omsg are attached in Comment #43


-------------

hdparm -I /dev/sda

/dev/sda:
A
ATA device, with non-removable media
	Model Number:       WDC WD2500JS-41MVB1                     
	Serial Number:      WD-WCANY1570215
	Firmware Revision:  10.02E01
Standards:
	Supported: 7 6 5 4 
	Likely used: 7
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  488397168
	device size with M = 1024*1024:      238475 MBytes
	device size with M = 1000*1000:      250059 MBytes (250 GB)
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 1
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	Recommended acoustic management value: 128, current value: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	NOP cmd
	   *	READ BUFFER cmd
	   *	WRITE BUFFER cmd
	   *	Look-ahead
	   *	Write cache
	   *	Power Management feature set
		Security Mode feature set
	   *	SMART feature set
	   *	FLUSH CACHE EXT command
	   *	Mandatory FLUSH CACHE command 
	   *	Device Configuration Overlay feature set 
	   *	48-bit Address feature set 
		Automatic Acoustic Management feature set 
	   *	SET FEATURES subcommand required to spinup after power up
		Power-Up In Standby feature set
	   *	DOWNLOAD MICROCODE cmd
	   *	General Purpose Logging feature set
	   *	SMART self-test 
	   *	SMART error logging 
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
	not	supported: enhanced erase
Checksum: correct
Comment 45 Daniel Gollub 2008-04-11 15:22:56 UTC
I guess we're looking here two different bugs:

First: comment #11, comment #16 (at least martinu.suse.de), comment #39 got in contact with high priority spinners, which blocked the hardware interrupt handler (caused by CPU affinity for the hardware interrupt handler).


Second: Seems to be IRQ nobody cared Stefan is investigating on. Stefan should we close this bug as duplicate and reference this to the IRQ nobody cared bug?
Comment 46 Stefan Assmann 2008-04-14 09:03:05 UTC
Actually looking at the initial report, this is not related to the irq nobody cared issue. This just got pushed into that direction. I'd like to reassign to daniel for reproduction of the initial error message with latest beta 6 kernel and otherwise close it.
Comment 47 Daniel Gollub 2008-04-14 14:02:01 UTC
Ok, the initial report is also about the CPU affinity. I can reproduce the issue within 1-2 minutes when pinning the hardware interrupt kernel thread on one CPU and run a high priority spinner like gtod_infinite on the same CPU.

The results is:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen
ata1.00: cmd ca/00:a8:40:2c:3f/00:00:00:00:00/e0 tag 0 cdb 0x0 data 86016 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: soft resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
graffin:~ # ata1.00: qc timeout (cmd 0x27)
ata1.00: failed to read native max address (err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

This doesn't appear on the SLERT 10 SP2 Beta 6 (#2) kernel. Relevant patch:

-------------------------------------------------------------------
Fri Apr 11 11:31:09 CEST 2008 - dgollub@suse.de

- patches.rt/only-run-softirqs-from-irq-thread-when-irq-affinity-is-set-MISSINGHUNK.patch:
  Only run softirqs from the irq thread if the irqaffinity is
  set to 1 CPU (bnc#367694, bnc#373822, bnc#376793, bnc#365997,
  bnc#376244).



The initial issue got fixed with the Beta 6 (#2) Kernel. The outstanding "irq nobody cares"-issue get discussed in bug 376671
Comment 48 Christian Hueller 2009-02-16 11:01:18 UTC
setting fixed in milestone to GM for correct metric numbers