Bug 431192

Summary: kernel oops related to unusual USB device
Product: [openSUSE] openSUSE 11.0 Reporter: Juergen Weigert <jw>
Component: KernelAssignee: Greg Kroah-Hartman <gregkh>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Minor    
Priority: P5 - None CC: jack
Version: Final   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on:    
Bug Blocks: 357354    
Attachments: full dmesg
pm-suspend.log
dmesg with another two oopses, may be unrelated, though...
dmesg snippet with another oops.
Patch fixing similarly looking problem in SLES10

Description Juergen Weigert 2008-10-01 09:45:11 UTC
This is a lenovo x60s running 11.0 kernel 2.6.25.16-0.1-pae

Today suspend-to-ram stopped working. 
In dmesg I found the following oops (shortenend):

BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c017f6c5>] pipe_write+0xc1/0x43a
Pid: 25605, comm: hal-find-by-cap Tainted: G        N (2.6.25.16-0.1-pae #1)
EIP: 0060:[<c017f6c5>] EFLAGS: 00010286 CPU: 0
EIP is at pipe_write+0xc1/0x43a
Call Trace:
 [<c0179b58>] do_sync_write+0xab/0xe9
 [<c017a3ea>] vfs_write+0x8c/0x136
 [<c017a52d>] sys_write+0x3b/0x60
 [<c01059e4>] sysenter_past_esp+0x6d/0xa9
 [<ffffe430>] 0xffffe430
 =======================
Comment 1 Juergen Weigert 2008-10-01 09:47:09 UTC
Created attachment 242744 [details]
full dmesg
Comment 2 Juergen Weigert 2008-10-01 10:35:20 UTC
Created attachment 242756 [details]
pm-suspend.log

suspend to ram reports error 11 now, and invites me to view the attached logfile.
Comment 3 Forgotten User ZhJd0F0L3x 2008-10-01 14:48:51 UTC
error 11 is EAGAIN, which probably means that a process is in state D (probably hal-find-by-cap*) and can not be stopped.

This is _NOT_ a suspend problem, but the problem is the oops that occured before.

This is a plain old kernel bug.
Comment 4 Juergen Weigert 2008-10-09 09:12:40 UTC
Created attachment 244539 [details]
dmesg with another two oopses, may be unrelated, though...
Comment 5 Pavel Machek 2008-10-14 23:12:30 UTC
...will suspend start working when you reboot?
Comment 6 Pavel Machek 2008-10-14 23:14:57 UTC
Are the preceeding oopses repeatable?
Comment 7 Juergen Weigert 2008-10-15 09:10:55 UTC
yes, after a reboot, the system can susepend/resume again, as normal.
Comment 8 Juergen Weigert 2008-10-15 09:40:45 UTC
Created attachment 245617 [details]
dmesg snippet with another oops.

OOpses are reproducable. 
It took me two attempts to get one.
Comment 9 Juergen Weigert 2008-10-15 09:55:46 UTC
Reducing environment, disconnecting external monitor, disconnecting USB devices, one by one. 

With all the oopses, one particular USB device was present:
USBasp from http://www.fischl.de/usbasp/, latest firmware 2007-10-23
idVendor=16c0, idProduct=05dc
It is a programmer dongle for embedded systems, controlled via 
libusb calls.

I can drop such a device on seife's desk, if needed.
Comment 10 Pavel Machek 2008-10-29 11:11:20 UTC
Ok, if particular device breaks your kernel, debug that particular device ;-). It seems like those oopses will happen even without the suspend/resume, right?
Comment 11 Juergen Weigert 2008-10-29 11:43:18 UTC
Agreed. Seems to be independant of suspend.

I failed to identify a specific driver. My assumption is that generic usb is used via libusb.

As mentioned earlier, I can provide the hardware.

Reassigning to kernel-maintainers for advice.
Comment 12 Pavel Machek 2008-10-30 10:50:48 UTC
Okay, this is not related to suspend, and I would not know how to use the device.

If you can reliably crash kernel using libusb and non-root uid, that's probably important. Is that the case?

Can you reproduce the crash on latest 11.1beta and/or latest vanilla kernel?

Comment 13 Oliver Neukum 2008-10-30 11:32:20 UTC
The oopses are unrelated to usb. It must be verified that indeed usage of the device causes. Mere presence is not proof.
Comment 14 Juergen Weigert 2008-10-30 13:55:11 UTC
Never tried as non-root.
Maybe I can chmod something under /dev/bus/usb to allow me non-root.
Comment 15 Pavel Machek 2008-11-03 08:45:32 UTC
...ok, I guess you need to find reliable way to reproduce the oops, then we can decide if it is usb-related...
Comment 16 Pavel Machek 2009-02-10 08:56:15 UTC
Olivier, usb is currently our best hope :-).
Comment 17 Oliver Neukum 2009-02-10 09:38:10 UTC
Jan,

does this look like the weird memory corruption issue with usbfs you found?
Comment 18 Jan Kara 2009-02-10 14:20:26 UTC
Definitely not exactly the one I saw as that is already fixed in 2.6.25 used in OpenSUSE 11. But yes, symptoms look similar (i.e., random corruption when USB devices are discovered / removed).
Comment 19 Oliver Neukum 2009-02-12 09:43:29 UTC
Please test whether the mere presence of the device is sufficient or some kind of use is needed.
Comment 20 Oliver Neukum 2009-02-12 13:07:08 UTC
Jan,

as this is too similar to ignore, could you attach that fix?
Comment 21 Jan Kara 2009-02-12 14:09:05 UTC
Created attachment 272295 [details]
Patch fixing similarly looking problem in SLES10

OK, here you have the fix from SLES10 SP2
Comment 23 Jan Kara 2009-02-18 13:02:15 UTC
Looking into the logs, all oopses happen during discovering USB devices (in particular some Broadcom USB device seems to be always nearby). I don't think I'm the right person to debug this. Greg, can you have a look?
Comment 24 Greg Kroah-Hartman 2009-02-18 14:56:31 UTC
Big question, does this happen still on 11.1?
Comment 25 Juergen Weigert 2009-02-18 15:23:11 UTC
I'll have access to the relevant hardware again this weekend.
Is it sufficient to test against SLED11-RC3 ?
Comment 26 Greg Kroah-Hartman 2009-02-18 15:25:20 UTC
RC4 would be best of course :)
Comment 27 Juergen Weigert 2009-02-21 14:43:09 UTC
All is well with SLED11-RC3

Cannot provoke any oopses, not even when exercising suspend/resume.
Thanks for fixing, whoever it was and whatever it was :-)

I suggest RESOLVED FIXED.
Comment 28 Greg Kroah-Hartman 2009-02-21 18:24:28 UTC
ok, marking it as such, thanks for testing.