|
Bugzilla – Full Text Bug Listing |
| Summary: | kernel OOPS when starting nut UPS daemon | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 10.3 | Reporter: | Stanislav Brabec <sbrabec> |
| Component: | Kernel | Assignee: | Oliver Neukum <oneukum> |
| Status: | RESOLVED NORESPONSE | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | suse+build |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
minicom.log
minicom.cap lsusb.log ups2.log |
||
|
Description
Stanislav Brabec
2007-10-22 14:51:06 UTC
Created attachment 179712 [details]
minicom.log
Console capture of the crash.
Comment on attachment 179712 [details]
minicom.log
------------[ cut here ]------------
kernel BUG at mm/slab.c:2980!
invalid opcode: 0000 [1] SMP
last sysfs file: /firmware/edd/int13_dev81/extensions
CPU 0
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq nfs lockd nfs_acl sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 apparmor reiserfs loop dm_mod rfcomm l2cap nvidia(P) i2c_viapro usb_storage i2c_core ide_core shpchp ohci1394 snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd sr_mod k8temp hwmon cdrom hci_usb button bluetooth ieee1394 soundcore parport_pc parport rtc_cmos rtc_core rtc_lib pci_hotplug r8169 sg floppy usbhid hid ff_memless ehci_hcd sd_mod uhci_hcd usbcore edd ext3 mbcache jbd fan pata_via sata_via sata_promise libata scsi_mod thermal processor
Pid: 1051, comm: kjournald Tainted: P M N 2.6.22.9-0.4-default #1
RIP: 0010:[<ffffffff8028397b>] [<ffffffff8028397b>] cache_alloc_refill+0xd4/0x1fe
RSP: 0018:ffff81003e109b00 EFLAGS: 00010082
RAX: 00000000ffffffff RBX: 0000000000000046 RCX: 0000000000000000
RDX: ffff81003eb3e1c0 RSI: 0000000000000000 RDI: ffff81003ea45c80
RBP: ffff810037ff0f80 R08: 0000000000000000 R09: ffff81003e25bb40
R10: 0000000000000001 R11: ffffffff880220ae R12: ffff81003ea45c40
R13: ffff81003eb3e1c0 R14: 000000000000000c R15: ffff81003ebb4580
FS: 00002b08151f4090(0000) GS:ffffffff804ff000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b44604c0e20 CR3: 0000000033630000 CR4: 00000000000006e0
Process kjournald (pid: 1051, threadinfo ffff81003e108000, task ffff81003e763750)
Stack: 000512203e109b28 0000000000000000 ffff81003ea45c50 ffff81003ea45c60
0000000000010001 0000000000000046 0000000000011220 ffff81003ebb4580
0000000000011220 ffff81003e109b90 ffff81003ea45bf0 ffffffff80283c77
Call Trace:
[<ffffffff80283c77>] kmem_cache_alloc+0x6b/0x8c
[<ffffffff80265e60>] mempool_alloc+0x50/0x104
[<ffffffff8801ca4a>] :scsi_mod:__scsi_get_command+0x16/0x65
[<ffffffff88021511>] :scsi_mod:scsi_init_io+0x25/0xe9
[<ffffffff880217c1>] :scsi_mod:scsi_prep_fn+0x1ec/0x276
[<ffffffff802e72d2>] elv_next_request+0xfb/0x1ba
[<ffffffff802e6b6f>] elv_merged_request+0x30/0x39
[<ffffffff802396ff>] lock_timer_base+0x26/0x4c
[<ffffffff88022136>] :scsi_mod:scsi_request_fn+0x88/0x3b8
[<ffffffff802a7e6a>] sync_buffer+0x0/0x3f
[<ffffffff802eb3f9>] generic_unplug_device+0x18/0x28
[<ffffffff802ec02b>] blk_backing_dev_unplug+0x5b/0x61
[<ffffffff80265e60>] mempool_alloc+0x50/0x104
[<ffffffff802a7ea0>] sync_buffer+0x36/0x3f
[<ffffffff803f9a4a>] __wait_on_bit+0x40/0x6e
[<ffffffff802a7e6a>] sync_buffer+0x0/0x3f
[<ffffffff803f9ae4>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243381>] wake_bit_function+0x0/0x23
[<ffffffff88098166>] :jbd:journal_commit_transaction+0xba4/0x1402
[<ffffffff802396ff>] lock_timer_base+0x26/0x4c
[<ffffffff8809c183>] :jbd:kjournald+0x15d/0x394
[<ffffffff80243353>] autoremove_wake_function+0x0/0x2e
[<ffffffff8809c026>] :jbd:kjournald+0x0/0x394
[<ffffffff8024322c>] kthread+0x47/0x73
[<ffffffff8020aa48>] child_rip+0xa/0x12
[<ffffffff802431e5>] kthread+0x0/0x73
[<ffffffff8020aa3e>] child_rip+0x0/0x12
Code: 0f 0b eb fe 41 8b 5d 00 8b 54 24 24 48 89 ee 4c 89 ff e8 87
RIP [<ffffffff8028397b>] cache_alloc_refill+0xd4/0x1fe
RSP <ffff81003e109b00>
Can you reproduce this without the nvidia module loaded? I don't know if this is the cause, but there is a bug in the 'drivers/hidparser.c' upto (and including) nut-2.2.0: 620 /* make a list of reports and their lengths */ 621 memset(pDesc->replen, 0, 256); If reports 64..255 are used by the UPS (and the subdriver), the above would lead to the use of uninitialized data. I have seen the driver segfault'ing because of that. Since we're trying to clear an array of 256 int's, line 621 should really read: 621 memset(pDesc->replen, 0, 256 * sizeof(int)); Note that in nut-2.2.1 and higher this has been solved upstream, as there have been massive changes to the usbhid-ups driver. It now allocates memory through calloc() instead of malloc() if the memory needs to be cleared anyway. Arjen - I'd be surprised if it were related at all. This looks more like something scribbled over the reference counting for the scsi mempool slab. Stanislav - Can you try booting with the -debug kernel to see if you catch any corruption? Since the uninitialized value is used to determine how much memory is dynamically allocated, it does cause problems (but maybe not related to this bug), so fixing it is highly recommended anyway. I have seen this driver attempting to allocate hundreds of megabytes of memory due to this bug, so it might easily exhaust all available memory. I don't know how well memory starvation is handled by other programs. To comment #3: Yes, I can reproduce this without the nvidia module loaded. Created attachment 180216 [details]
minicom.cap
Complete serial console capture of the session with debug kernel.
Stanislav, you provided two totally different oopses. Is any of them reproducible? Secondly, please provide the output of "lsusb -v" Totally different oopses, but exactly the same situation - "rcupsd start". It is reproducible at about 90%. I did not verify, whether stacktraces of these crashes are identical. Maybe the bug itself causes only memory corruption and the crash occurs later. To work-around this crash, you can start the daemon with strace, stop it and then start again without strace. Bus 005 Device 002: ID 05e3:0701 Genesys Logic, Inc. USB 2.0 IDE Adapter Bus 005 Device 001: ID 0000:0000 Bus 004 Device 001: ID 0000:0000 Bus 003 Device 001: ID 0000:0000 Bus 002 Device 002: ID 0db0:6970 Micro Star International Bus 002 Device 001: ID 0000:0000 Bus 001 Device 004: ID 051d:0002 American Power Conversion Back-UPS Pro 500/1000/1500 Bus 001 Device 001: ID 0000:0000 Created attachment 180490 [details]
lsusb.log
Just to be sure that your UPS doesn't send something weird, could you post the first few seconds worth of output when you start the driver with
/usr/lib/ups/driver/usbhid-ups -DD -a <upsname>
(if memory serves, please correct path if wrong).
Also I think it is worth to try the patch I attached in bug 331749. Did you already try that? The work-around with strace might mask the use of uninitialized data space, since this will usually be cleared by it (like other debuggers do).
Note that the unitialized memory was not a problem in earlier versions of NUT, since they didn't use a (dynamically allocated) report buffer. We introduced that in nut-2.0.5, but unfortunately, this triggers a bug on some UPS'es (depending on what their report descriptors look like).
After applying of your userspace patch I was able to reproduce the kernel crash in one attempts from three, and crash did not occur in a few seconds, but after about 30 sec. Created attachment 180616 [details]
ups2.log
Communication log of
/usr/lib/ups/driver/usbhid-ups -DD -a myups 2>ups2.log
Captured when kernel did not crash (at least not immetiately).
(In reply to comment #17 from Stanislav Brabec) > Communication log of > /usr/lib/ups/driver/usbhid-ups -DD -a myups 2>ups2.log > > Captured when kernel did not crash (at least not immetiately). Looking at the ReportID's, the usbhid-ups driver is indeed vulnerable with this UPS (for instance, with the HID Path 'UPS.Output.Current' having an ID of 0x40 (which is higher than the 'safe' range 0x00 - 0x3F). If you only run the driver, does it crash (not neccessarily in debug mode) after a while? There are still some other bugs in 'usbhid-ups' in nut-2.2.0 that can be triggered depending on the UPS that is attached to it. These should be resolved by the nut-2.2.1 driver that will be released in a couple of weeks. If you're willing to try it out, the version we have in Testing is what it looks like (don't use nut-2.2.1-pre1, that's still the old one). I have not seen any crash of nut driver (even before patch), both standalone or nut. The crashing side is the kernel and memory corruption occurs inside kernel. Nothing from user space (even incorrectly written) should cause kernel OOPS. If this crash is limited to this UPS model and x86_64, I would prefer to release fix for bug 331749 now. If anybody else is able to reproduce this crash, I'll postpone the (apparently correct) fix, which actually causes kernel to crash. Note that this UPS also communicates somehow with GNOME Power Manager and it does not cause OOPS. To Oliver Neukum: Is there something like "valgrind for kernel"? (In reply to comment #19 from Stanislav Brabec) > I have not seen any crash of nut driver (even before patch), both standalone or > nut. The crashing side is the kernel and memory corruption occurs inside > kernel. Nothing from user space (even incorrectly written) should cause kernel > OOPS. I agree. > If this crash is limited to this UPS model and x86_64, I would prefer to > release fix for bug 331749 now. If anybody else is able to reproduce this > crash, I'll postpone the (apparently correct) fix, which actually causes kernel > to crash. For me the fixes (both to udev and hidparser.c) obviously work. But I don't have this UPS model and neither do I run x86_64. Does this also happen with the default (i586) kernel? > Note that this UPS also communicates somehow with GNOME Power Manager and it > does not cause OOPS. You don't run the standard and HAL drivers at the same time, do you? That doesn't fly, since only one driver can claim an USB device at the same time. NUT 2.2.0 is still broken in this respect, after successfully claiming the device, it won't notice that another driver may have claimed it afterwards (fixed in nut-2.2.1). This leads to the kernel complaining about the driver using the interface without claiming it. (In reply to comment #20 from Arjen de Korte) > (In reply to comment #19 from Stanislav Brabec) > > Note that this UPS also communicates somehow with GNOME Power Manager and it > > does not cause OOPS. > > You don't run the standard and HAL drivers at the same time, do you? That > doesn't fly, since only one driver can claim an USB device at the same time. > NUT 2.2.0 is still broken in this respect, after successfully claiming the > device, it won't notice that another driver may have claimed it afterwards > (fixed in nut-2.2.1). This leads to the kernel complaining about the driver > using the interface without claiming it. > It happens automatically in 10.3 if you have an USB HID UPS. Yes, I have seen these kernel complains. But the crash occurs even without running X, so it should not be related to device re-claiming. (In reply to comment #21 from Stanislav Brabec) > It happens automatically in 10.3 if you have an USB HID UPS. Yes, I have seen > these kernel complains. But the crash occurs even without running X, so it > should not be related to device re-claiming. Maybe it is not related to this bug, but this won't work reliably and it will break existing installations. I have filed bug 337059 for this. @Stanislav in #19: SLAB_DEBUG does valgrind's job in kernel space. The debug kernel has it. Can you provide an strace of "rcupsd start" ? Stanislav, please provide the information requested in comment #24. No response for ages |