Bug 335676

Summary: kernel OOPS when starting nut UPS daemon
Product: [openSUSE] openSUSE 10.3 Reporter: Stanislav Brabec <sbrabec>
Component: KernelAssignee: Oliver Neukum <oneukum>
Status: RESOLVED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: suse+build
Version: Final   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: minicom.log
minicom.cap
lsusb.log
ups2.log

Description Stanislav Brabec 2007-10-22 14:51:06 UTC
I got kernel OOPS when starting nut UPS daemon to handle USB HID UPS APC SmartUPS 750.

I was able to reproduce this crash at least 3 times (sometimes keyboard started to blink, sometimes not, once the system except keyboard even continued to work for a minute), but when starting the ups daemon with strace, kernel did not oops and even consequent standard start did not oops.
Comment 1 Stanislav Brabec 2007-10-22 14:52:14 UTC
Created attachment 179712 [details]
minicom.log

Console capture of the crash.
Comment 2 Stanislav Brabec 2007-10-22 14:53:22 UTC
Comment on attachment 179712 [details]
minicom.log

------------[ cut here ]------------
kernel BUG at mm/slab.c:2980!
invalid opcode: 0000 [1] SMP 
last sysfs file: /firmware/edd/int13_dev81/extensions
CPU 0 
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq nfs lockd nfs_acl sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 apparmor reiserfs loop dm_mod rfcomm l2cap nvidia(P) i2c_viapro usb_storage i2c_core ide_core shpchp ohci1394 snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd sr_mod k8temp hwmon cdrom hci_usb button bluetooth ieee1394 soundcore parport_pc parport rtc_cmos rtc_core rtc_lib pci_hotplug r8169 sg floppy usbhid hid ff_memless ehci_hcd sd_mod uhci_hcd usbcore edd ext3 mbcache jbd fan pata_via sata_via sata_promise libata scsi_mod thermal processor
Pid: 1051, comm: kjournald Tainted: P   M  N 2.6.22.9-0.4-default #1
RIP: 0010:[<ffffffff8028397b>]  [<ffffffff8028397b>] cache_alloc_refill+0xd4/0x1fe
RSP: 0018:ffff81003e109b00  EFLAGS: 00010082
RAX: 00000000ffffffff RBX: 0000000000000046 RCX: 0000000000000000
RDX: ffff81003eb3e1c0 RSI: 0000000000000000 RDI: ffff81003ea45c80
RBP: ffff810037ff0f80 R08: 0000000000000000 R09: ffff81003e25bb40
R10: 0000000000000001 R11: ffffffff880220ae R12: ffff81003ea45c40
R13: ffff81003eb3e1c0 R14: 000000000000000c R15: ffff81003ebb4580
FS:  00002b08151f4090(0000) GS:ffffffff804ff000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b44604c0e20 CR3: 0000000033630000 CR4: 00000000000006e0
Process kjournald (pid: 1051, threadinfo ffff81003e108000, task ffff81003e763750)
Stack:  000512203e109b28 0000000000000000 ffff81003ea45c50 ffff81003ea45c60
 0000000000010001 0000000000000046 0000000000011220 ffff81003ebb4580
 0000000000011220 ffff81003e109b90 ffff81003ea45bf0 ffffffff80283c77
Call Trace:
 [<ffffffff80283c77>] kmem_cache_alloc+0x6b/0x8c
 [<ffffffff80265e60>] mempool_alloc+0x50/0x104
 [<ffffffff8801ca4a>] :scsi_mod:__scsi_get_command+0x16/0x65
 [<ffffffff88021511>] :scsi_mod:scsi_init_io+0x25/0xe9
 [<ffffffff880217c1>] :scsi_mod:scsi_prep_fn+0x1ec/0x276
 [<ffffffff802e72d2>] elv_next_request+0xfb/0x1ba
 [<ffffffff802e6b6f>] elv_merged_request+0x30/0x39
 [<ffffffff802396ff>] lock_timer_base+0x26/0x4c
 [<ffffffff88022136>] :scsi_mod:scsi_request_fn+0x88/0x3b8
 [<ffffffff802a7e6a>] sync_buffer+0x0/0x3f
 [<ffffffff802eb3f9>] generic_unplug_device+0x18/0x28
 [<ffffffff802ec02b>] blk_backing_dev_unplug+0x5b/0x61
 [<ffffffff80265e60>] mempool_alloc+0x50/0x104
 [<ffffffff802a7ea0>] sync_buffer+0x36/0x3f
 [<ffffffff803f9a4a>] __wait_on_bit+0x40/0x6e
 [<ffffffff802a7e6a>] sync_buffer+0x0/0x3f
 [<ffffffff803f9ae4>] out_of_line_wait_on_bit+0x6c/0x78
 [<ffffffff80243381>] wake_bit_function+0x0/0x23
 [<ffffffff88098166>] :jbd:journal_commit_transaction+0xba4/0x1402
 [<ffffffff802396ff>] lock_timer_base+0x26/0x4c
 [<ffffffff8809c183>] :jbd:kjournald+0x15d/0x394
 [<ffffffff80243353>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8809c026>] :jbd:kjournald+0x0/0x394
 [<ffffffff8024322c>] kthread+0x47/0x73
 [<ffffffff8020aa48>] child_rip+0xa/0x12
 [<ffffffff802431e5>] kthread+0x0/0x73
 [<ffffffff8020aa3e>] child_rip+0x0/0x12


Code: 0f 0b eb fe 41 8b 5d 00 8b 54 24 24 48 89 ee 4c 89 ff e8 87 
RIP  [<ffffffff8028397b>] cache_alloc_refill+0xd4/0x1fe
 RSP <ffff81003e109b00>
Comment 3 Jeff Mahoney 2007-10-22 16:15:32 UTC
Can you reproduce this without the nvidia module loaded?
Comment 4 Arjen de Korte 2007-10-22 19:13:46 UTC
I don't know if this is the cause, but there is a bug in the 'drivers/hidparser.c' upto (and including) nut-2.2.0:

620	/* make a list of reports and their lengths */
621	memset(pDesc->replen, 0, 256);

If reports 64..255 are used by the UPS (and the subdriver), the above would lead to the use of uninitialized data. I have seen the driver segfault'ing because of that.

Since we're trying to clear an array of 256 int's, line 621 should really read:

621	memset(pDesc->replen, 0, 256 * sizeof(int));

Note that in nut-2.2.1 and higher this has been solved upstream, as there have been massive changes to the usbhid-ups driver. It now allocates memory through calloc() instead of malloc() if the memory needs to be cleared anyway.
Comment 5 Jeff Mahoney 2007-10-22 19:23:45 UTC
Arjen -

I'd be surprised if it were related at all. This looks more like something scribbled over the reference counting for the scsi mempool slab.

Stanislav -

Can you try booting with the -debug kernel to see if you catch any corruption?
Comment 6 Arjen de Korte 2007-10-23 10:46:53 UTC
Since the uninitialized value is used to determine how much memory is dynamically allocated, it does cause problems (but maybe not related to this bug), so fixing it is highly recommended anyway.

I have seen this driver attempting to allocate hundreds of megabytes of memory due to this bug, so it might easily exhaust all available memory. I don't know how well memory starvation is handled by other programs.
Comment 7 Stanislav Brabec 2007-10-23 11:41:11 UTC
To comment #3: Yes, I can reproduce this without the nvidia module loaded.
Comment 8 Stanislav Brabec 2007-10-24 11:50:06 UTC
Created attachment 180216 [details]
minicom.cap

Complete serial console capture of the session with debug kernel.
Comment 12 Oliver Neukum 2007-10-25 07:35:28 UTC
Stanislav,

you provided two totally different oopses. Is any of them reproducible?
Secondly, please provide the output of "lsusb -v"
Comment 13 Stanislav Brabec 2007-10-25 10:17:11 UTC
Totally different oopses, but exactly the same situation - "rcupsd start".

It is reproducible at about 90%. I did not verify, whether stacktraces of these crashes are identical. Maybe the bug itself causes only memory corruption and the crash occurs later.

To work-around this crash, you can start the daemon with strace, stop it and then start again without strace.

Bus 005 Device 002: ID 05e3:0701 Genesys Logic, Inc. USB 2.0 IDE Adapter
Bus 005 Device 001: ID 0000:0000  
Bus 004 Device 001: ID 0000:0000  
Bus 003 Device 001: ID 0000:0000  
Bus 002 Device 002: ID 0db0:6970 Micro Star International 
Bus 002 Device 001: ID 0000:0000  
Bus 001 Device 004: ID 051d:0002 American Power Conversion Back-UPS Pro 500/1000/1500
Bus 001 Device 001: ID 0000:0000  
Comment 14 Stanislav Brabec 2007-10-25 10:18:10 UTC
Created attachment 180490 [details]
lsusb.log
Comment 15 Arjen de Korte 2007-10-25 15:57:01 UTC
Just to be sure that your UPS doesn't send something weird, could you post the first few seconds worth of output when you start the driver with

    /usr/lib/ups/driver/usbhid-ups -DD -a <upsname>

(if memory serves, please correct path if wrong).

Also I think it is worth to try the patch I attached in bug 331749. Did you already try that? The work-around with strace might mask the use of uninitialized data space, since this will usually be cleared by it (like other debuggers do).

Note that the unitialized memory was not a problem in earlier versions of NUT, since they didn't use a (dynamically allocated) report buffer. We introduced that in nut-2.0.5, but unfortunately, this triggers a bug on some UPS'es (depending on what their report descriptors look like).
Comment 16 Stanislav Brabec 2007-10-25 17:14:28 UTC
After applying of your userspace patch I was able to reproduce the kernel crash in one attempts from three, and crash did not occur in a few seconds, but after about 30 sec.
Comment 17 Stanislav Brabec 2007-10-25 17:18:41 UTC
Created attachment 180616 [details]
ups2.log

Communication log of
/usr/lib/ups/driver/usbhid-ups -DD -a myups 2>ups2.log

Captured when kernel did not crash (at least not immetiately).
Comment 18 Arjen de Korte 2007-10-25 19:35:09 UTC
(In reply to comment #17 from Stanislav Brabec)

> Communication log of
> /usr/lib/ups/driver/usbhid-ups -DD -a myups 2>ups2.log
> 
> Captured when kernel did not crash (at least not immetiately).

Looking at the ReportID's, the usbhid-ups driver is indeed vulnerable with this UPS (for instance, with the HID Path 'UPS.Output.Current' having an ID of 0x40 (which is higher than the 'safe' range 0x00 - 0x3F). If you only run the driver, does it crash (not neccessarily in debug mode) after a while?

There are still some other bugs in 'usbhid-ups' in nut-2.2.0 that can be triggered depending on the UPS that is attached to it. These should be resolved by the nut-2.2.1 driver that will be released in a couple of weeks. If you're willing to try it out, the version we have in Testing is what it looks like (don't use nut-2.2.1-pre1, that's still the old one).
Comment 19 Stanislav Brabec 2007-10-26 09:38:10 UTC
I have not seen any crash of nut driver (even before patch), both standalone or nut. The crashing side is the kernel and memory corruption occurs inside kernel. Nothing from user space (even incorrectly written) should cause kernel OOPS.

If this crash is limited to this UPS model and x86_64, I would prefer to release fix for bug 331749 now. If anybody else is able to reproduce this crash, I'll postpone the (apparently correct) fix, which actually causes kernel to crash.

Note that this UPS also communicates somehow with GNOME Power Manager and it does not cause OOPS.

To Oliver Neukum: Is there something like "valgrind for kernel"?
Comment 20 Arjen de Korte 2007-10-26 11:30:44 UTC
(In reply to comment #19 from Stanislav Brabec)

> I have not seen any crash of nut driver (even before patch), both standalone or
> nut. The crashing side is the kernel and memory corruption occurs inside
> kernel. Nothing from user space (even incorrectly written) should cause kernel
> OOPS.

I agree.

> If this crash is limited to this UPS model and x86_64, I would prefer to
> release fix for bug 331749 now. If anybody else is able to reproduce this
> crash, I'll postpone the (apparently correct) fix, which actually causes kernel
> to crash.

For me the fixes (both to udev and hidparser.c) obviously work. But I don't have this UPS model and neither do I run x86_64. Does this also happen with the default (i586) kernel?

> Note that this UPS also communicates somehow with GNOME Power Manager and it
> does not cause OOPS.

You don't run the standard and HAL drivers at the same time, do you? That doesn't fly, since only one driver can claim an USB device at the same time. NUT 2.2.0 is still broken in this respect, after successfully claiming the device, it won't notice that another driver may have claimed it afterwards (fixed in nut-2.2.1). This leads to the kernel complaining about the driver using the interface without claiming it.
Comment 21 Stanislav Brabec 2007-10-26 12:03:01 UTC
(In reply to comment #20 from Arjen de Korte)
> (In reply to comment #19 from Stanislav Brabec)

> > Note that this UPS also communicates somehow with GNOME Power Manager and it
> > does not cause OOPS.
> 
> You don't run the standard and HAL drivers at the same time, do you? That
> doesn't fly, since only one driver can claim an USB device at the same time.
> NUT 2.2.0 is still broken in this respect, after successfully claiming the
> device, it won't notice that another driver may have claimed it afterwards
> (fixed in nut-2.2.1). This leads to the kernel complaining about the driver
> using the interface without claiming it.
> 

It happens automatically in 10.3 if you have an USB HID UPS. Yes, I have seen these kernel complains. But the crash occurs even without running X, so it should not be related to device re-claiming.
Comment 22 Arjen de Korte 2007-10-26 12:38:14 UTC
(In reply to comment #21 from Stanislav Brabec)

> It happens automatically in 10.3 if you have an USB HID UPS. Yes, I have seen
> these kernel complains. But the crash occurs even without running X, so it
> should not be related to device re-claiming.

Maybe it is not related to this bug, but this won't work reliably and it will break existing installations. I have filed bug 337059 for this.
Comment 23 Oliver Neukum 2007-10-26 13:12:30 UTC
@Stanislav in #19:

SLAB_DEBUG does valgrind's job in kernel space. The debug kernel has it.
Comment 24 Oliver Neukum 2008-01-07 11:30:34 UTC
Can you provide an strace of "rcupsd start" ?
Comment 25 Christoph Thiel 2008-04-25 12:30:33 UTC
Stanislav, please provide the information requested in comment #24.
Comment 26 Oliver Neukum 2008-07-21 06:29:28 UTC
No response for ages