Bug 332588

Summary: libata: pata_ali: exceptions and timeouts in log and extremely slow system
Product: [openSUSE] openSUSE 10.3 Reporter: Frank Seidel <fseidel>
Component: KernelAssignee: Tejun Heo <teheo>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: behlert, kailed, nitroushhh, novell, t.zell
Version: Final   
Target Milestone: ---   
Hardware: i386   
OS: openSUSE 10.3   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on:    
Bug Blocks: 332586    
Attachments: lspci -v of nx9005
hwinfo --storage-ctrl of nx9005
dmesg of boot with 10.3-GM default kernel with "time" bootparameter given
dmesg with libata.pata_dma=1 given on boot
bootlog with ali15x3 ide driver
dmesg output with kotd (just in case ...)
libata_scsi-Fix-ATAPI-transfer-lengths.patch
dmesg of boot with last patch and without libata.pata_dma=1
atapi-dma-for-rw-only.patch
dmesg of boot with both patches applied
dmesg output while inserting a CD into discdrive (having both patches applied to currently running kernel)
output of lspci -nnvvvxxx with libata loaded
output of lspci -nnvvvxxx with old IDE
bug332588-pata_ali-timing-update.patch
output of lspci -nnvvvxxx with libata/pata_ali loaded and this new patch
output of lspci -nnvvvxxx with alim15x4
output of dd with pata_ali (and last patch)
output of dd with alim15x4 (and same medium as from dd with pata_ali)
dmesg output while dd was run (with pata_ali and the last patch)
output of dd with pata_ali (and last patch) - 2nd try

Description Frank Seidel 2007-10-10 14:53:05 UTC
On a HP compaq nx9005 the pata_ali driver now gets used by default for the pata devices (hdd and combi-cdrom).
Besides the systems now takes ages to boot and feels _very_ slow on usage now, i see lots of 
ata2.00: exception ..... frozen
                         (timeout)
and 
ata2: soft resetting link
(when those are shown the system (e.g. the bootup) pauses)
Further - only with pata_ali active - i get a warning after bootup that smart would report 66 unreadable sectors. But without libata i don't get this error.

Disabling libata (with hwprobe=-modules.pata) makes all those symptoms disappear (it then bootsup really fast without such errors).
Will attach more detailed info and logs soon.
Could we fix this or at least blacklist this controller?
Comment 1 Frank Seidel 2007-10-10 14:54:19 UTC
Created attachment 177452 [details]
lspci -v of nx9005
Comment 2 Frank Seidel 2007-10-10 14:54:49 UTC
Created attachment 177453 [details]
hwinfo --storage-ctrl of nx9005
Comment 3 Frank Seidel 2007-10-10 14:56:09 UTC
Created attachment 177454 [details]
dmesg of boot with 10.3-GM default kernel with "time" bootparameter given
Comment 4 Frank Seidel 2007-10-10 15:00:21 UTC
In the beginning those pauses are "just" about five seconds each:
---
[    2.604000] ata1.00: ATA-6: IC25N040ATMR04-0, MO2OAD5A, max UDMA/100
[    2.620000] ata1.00: 78140160 sectors, multi 16: LBA48 
[    2.652000] ata1.00: configured for UDMA/100
[    2.988000] ata2.00: ATAPI: HL-DT-STCD-RW/DVD DRIVE GCC-4241N, 0C29, max MWDMA2
[    3.176000] ata2.00: configured for MWDMA2
[    3.192000] scsi 0:0:0:0: Direct-Access     ATA      IC25N040ATMR04-0 MO2O PQ: 0 ANSI: 5
[    8.712000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

---
but this one is really nasty.. especially as it happens always, on every bootup (while suspend/hibernate don't work yet) and it also seems to "pause" the system sometimes in between/while normal operation..
---
[   38.164000] ALSA sound/pci/ali5451/ali5451.c:1935: ali mixer 1 creating error.
[   63.800000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[   63.816000] ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x5a data 128 in
[   63.816000]          res 40/00:03:00:00:20/00:00:00:00:00/a0 Emask 0x4 (timeout)
[   63.856000] ata2: soft resetting link
[   64.368000] ata2.00: configured for MWDMA1
[   64.384000] ata2: EH complete
[   94.404000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[   94.420000] ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x5a data 128 in
[   94.420000]          res 40/00:03:00:00:20/00:00:00:00:00/a0 Emask 0x4 (timeout)
[   94.460000] ata2: soft resetting link
[   94.968000] ata2.00: configured for MWDMA1
[   94.984000] ata2: EH complete
[  125.004000] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
---
Comment 5 Frank Seidel 2007-10-10 15:01:56 UTC
We now have this machine available here. Please let me know how i can help on this or if you'd need this machine so we could send it to you.
Comment 6 Tejun Heo 2007-10-11 02:07:51 UTC
Does kernel parameter "libata.pata_dma=1" help?
Comment 7 Frank Seidel 2007-10-11 08:51:52 UTC
Yes, absolutely. Booted with this parameter there are no more pauses, no ata exceptions and also the smart-warning is gone.
Comment 8 Frank Seidel 2007-10-11 08:53:19 UTC
Created attachment 177608 [details]
dmesg with libata.pata_dma=1 given on boot
Comment 9 Tejun Heo 2007-10-11 09:27:24 UTC
Thanks.  Can you post kernel boot log with IDE driver (brokenmodules=pata_ali or from SL102)?
Comment 10 Frank Seidel 2007-10-11 10:41:23 UTC
Created attachment 177638 [details]
bootlog with ali15x3 ide driver

A 10.2 installation wasn't on the system and brokenmodules=pata_ali didn't work on the 10.3 system to any reason (even with rebuilt initrd etc.),

but i got those boot messages out of a 10.3 rescue-system booted with brokenmodules=pata_ali
Comment 11 Tejun Heo 2007-10-11 11:45:13 UTC
And you can access ODD without any problem from the rescue boot, right?  libata currently has quite some problems with MWDMA devices on various drivers.  There definitely is something wrong with MWMDA handling but we don't know what yet.  I'll dig into the code.  Thanks.
Comment 12 Frank Seidel 2007-10-11 11:53:33 UTC
ODD means the dvd/cd-drive? yes, i can access it eventhough it feels very slow.

Would it help to send you the machine?
Comment 13 Tejun Heo 2007-10-11 11:58:32 UTC
Yeah, ODD means dvd/cd-drives and it would be great if you can send the machine but as I probably am half-globe away from you, I'll look into the code a bit first.  I'll let you know if I get stuck.  Thanks.
Comment 14 Daniele Tombolini 2007-10-15 19:16:07 UTC
Hi, same problem on old ACER 210TER.
Need brokenmodules=pata_ali or installation hangs on loading pata module.
Extract from dmesg:

ata2: soft resetting link
ata2.01: failed to IDENTIFY (I/O error, err_mask=0x1)
ata2: failed to recover some devices, retrying in 5 secs
ata2: soft resetting link
ata2.01: failed to IDENTIFY (I/O error, err_mask=0x1)
ata2: failed to recover some devices, retrying in 5 secs
ata2: soft resetting link
ata2.01: failed to IDENTIFY (I/O error, err_mask=0x1)
ata2: failed to recover some devices, retrying in 5 secs
ata2: soft resetting link
ata2.00: configured for MWDMA1
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x28 data 4096 in
         res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)


This nootebook run suse since 7.1. Starting from 10.0 needs ACPI=off to install/boot

After installation I can provide more info and log if needed..
Comment 15 Tejun Heo 2007-10-17 08:15:26 UTC
Frank, does installing KOTD fix the problem?  ie. if you install KOTD and remove libata.pata_dma=1, does the system keep working?

Daniele, you can install by adding the following to boot parameter.

  options="libata=pata_dma=1"

After installation, update to KOTD (kernel of the day), remove libata.pata_dma=1 kernel parameter and test and report whether it works.

  ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/

Thanks.
Comment 16 Frank Seidel 2007-10-17 09:21:50 UTC
Sorry, but the kotd (2.6.23.1-20071013190126-default) doesn't seem to solve this. I installed it, rebooted (without the libata.pata_dma=1) and again have the old behaviour (exceptions, timeouts etc.).
Comment 17 Frank Seidel 2007-10-17 09:28:02 UTC
Created attachment 178957 [details]
dmesg output with kotd (just in case ...)
Comment 18 Daniele Tombolini 2007-10-17 17:43:42 UTC
# 15 Sorry, too late, already installed with brokenmodules=pata_ali.

Let me know if I can do something (reinstalling no, please..) to help you.

Comment 19 Tejun Heo 2007-10-18 04:03:41 UTC
Created attachment 179154 [details]
libata_scsi-Fix-ATAPI-transfer-lengths.patch

Does the attached patch fix the problem?
Comment 20 Frank Seidel 2007-10-18 11:54:01 UTC
Created attachment 179213 [details]
dmesg of boot with last patch and without libata.pata_dma=1

I just built a testkernel with this patch and tried it on that machine, but it still shows the same behaviour with this patch (without using pata_dma=1).
Comment 21 Tejun Heo 2007-10-18 14:24:22 UTC
Created attachment 179247 [details]
atapi-dma-for-rw-only.patch

Please test this one.  Detection should work just fine.  Please also test reading, recording and ripping CDs.  Keep an eye on cpu usage and r/w speed and report any anomalies.  Thanks.
Comment 22 Frank Seidel 2007-10-19 11:14:58 UTC
Created attachment 179432 [details]
dmesg of boot with both patches applied

Well, after adding and rebuilding the kernel with this new patch (means having both applied) i can boot without errors and timeouts (while not using libata.pata_dma=1).
BUT, as soon as i put a CD in the discdrive i get those errors and pauses again (possibly also due to hal trying to immediatelly automount the medium).
It seems not possible to really read a disc this way and didn't dare to try writing to a medium this way.
Comment 23 Frank Seidel 2007-10-19 11:15:46 UTC
Created attachment 179433 [details]
dmesg output while inserting a CD into discdrive (having both patches applied to currently running kernel)
Comment 24 Frank Seidel 2007-10-19 11:19:27 UTC
another note: while the machines tries to access the medium the load and cpu-usage climbs up rapidly (top tells new 100% (IO)wait and load nearly reaches 2.0)
Comment 25 Tejun Heo 2007-10-20 00:44:13 UTC
Yeah, as IO is halted for quite some secs, those reactions are expected.  There are some differences in mwdma mode programming between IDE alim driver and pata_ali.  The IDE driver just uses BIOS programmed values while the libata one tries to configure by itself, apparently incorrectly.  Can you please post the result of "lspci -nnvvvxxx" from both the IDE and libata drivers?  Thanks.
Comment 26 Frank Seidel 2007-10-22 11:06:35 UTC
Created attachment 179646 [details]
output of lspci -nnvvvxxx with libata loaded
Comment 27 Frank Seidel 2007-10-22 11:07:15 UTC
Created attachment 179647 [details]
output of lspci -nnvvvxxx with old IDE
Comment 28 Daniele Tombolini 2007-11-18 11:44:36 UTC
Hi, any news here ? 
On live-cd there's no workaround.
hwprobe=-modules.pata doesn't help
brokenmodules=pata_ali doesn't help and it seems avaible only in installation media
No live cd on my nootebook :(
Comment 29 Tejun Heo 2007-11-19 15:46:34 UTC
Created attachment 183935 [details]
bug332588-pata_ali-timing-update.patch

Please apply the attached patch and report whether it fixes the problem && the result of "lspci -nnvvvxxx".  Also, with alim15x3 driver loaded, please do "dd if=/dev/sr0 of=/dev/null bs=1M count=32" with a data cd inserted and post the result of "dmesg" and the dd command (it will report how fast it transferred).

Thanks.
Comment 30 Daniele Tombolini 2007-11-19 18:23:37 UTC
Patch for installed system ? It works well (installed with brokenmodules..).
The big problem is with live-cd where workaround do not work.
My nootebook is very slow, rebuilding the whole kernel takes a lot of time so I'll do ASAP..
Comment 31 Tejun Heo 2007-11-20 00:59:57 UTC
We first need to determine what's wrong to properly fix the problem.  Doesn't "hwprobe=-modules.pata" work with live cd?
Comment 32 Frank Seidel 2007-11-20 13:32:48 UTC
Daniele: No the patch was meant for me to test not for a installed system ;-)

Tejun: I applied this patch (from comment #29) to a fresh installed 10.3 kernelsource (2.6.22.12-2-default). Or was this patch meant to be applied on top of all others (already posted by you here)?

With this patch applied i couldn't see a improvement now.
I'll attach the logs in few moments.
Comment 33 Frank Seidel 2007-11-20 13:34:33 UTC
Created attachment 184062 [details]
output of lspci -nnvvvxxx with libata/pata_ali loaded and this new patch
Comment 34 Frank Seidel 2007-11-20 13:39:15 UTC
Created attachment 184064 [details]
output of lspci -nnvvvxxx with alim15x4

probably the same as in comment #27, but anyway ;-)
Comment 35 Frank Seidel 2007-11-20 13:40:20 UTC
Created attachment 184065 [details]
output of dd with pata_ali (and last patch)
Comment 36 Frank Seidel 2007-11-20 13:41:28 UTC
Created attachment 184066 [details]
output of dd with alim15x4 (and same medium as from dd with pata_ali)
Comment 37 Frank Seidel 2007-11-20 13:42:21 UTC
Created attachment 184067 [details]
dmesg output while dd was run (with pata_ali and the last patch)
Comment 38 Frank Seidel 2007-11-20 13:44:37 UTC
Comment on attachment 184065 [details]
output of dd with pata_ali (and last patch)

(arg, bugzilla seems to have a problem with this attachment... trying to repost)
Comment 39 Frank Seidel 2007-11-20 13:47:45 UTC
Created attachment 184069 [details]
output of dd with pata_ali (and last patch) - 2nd try
Comment 40 Tejun Heo 2007-11-28 00:30:48 UTC
Frank, can you please ship the hardware to me?  I'll send you my address via email.  Thanks.
Comment 43 Daniele Tombolini 2008-04-01 22:47:53 UTC
HI guys, bug still present in opensuse-11 (factory alpha-3). Any news about it ?
Comment 44 Tejun Heo 2008-04-02 00:01:44 UTC
I've spent quite some time with the hardware but it looks there's no reason pata_ali doesn't work when the IDE driver does.  I also asked around other ATA developers but they were lost about it too.  I still have the hardware in my room.  I'll give it another shot and if it doesn't get resolved I'm afraid I'll have to disable DMA for pata_ali for openSUSE 11.
Comment 45 Tejun Heo 2008-04-03 04:55:31 UTC
Okay, did another round of testing with various modifications.  Still no luck.  It seems we'll have to disable pata_ali ATAPI DMA for SL110.  I'm also forwarding patch to disable ATAPI DMA on pata_ali.  Resolving as FIXED.  If anyone has an idea about something to try, please lemme know.  :-(