|
Bugzilla – Full Text Bug Listing |
| Summary: | ext3 self-destruct on openSUSE 10.2 (kernel-default-2.6.18.2-34) | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 10.2 | Reporter: | Matthias Andree <matthias.andree> |
| Component: | Kernel | Assignee: | Jan Kara <jack> |
| Status: | RESOLVED INVALID | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | matthias.andree, mfreitas |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | i686 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
|
Description
Matthias Andree
2007-02-20 15:23:16 UTC
Thanks for the report but it's hard to say anything here. I guess you don't have a filesystem image (metadata would be enough) before you ran e2fsck, do you? Obviously, something corrupted your filesystem and it seems the corruption was rather heavy. Given your disk had to remap a few sectors before, I would not trust it completely. Unless you have any more information, this is impossible to debug, sorry. So do you have the corrupted fs image or something like that? I'm afraid I don't have a metadata image. My fault, I didn't think of that and I did not expect corruptions as bad as I've seen, since I have never had such massive data losses with ext2 or ext3 in 10 years. WRT the disk drive, it passes S.M.A.R.T. self tests and has not reallocated more sectors since the original event (7 in total according to smartctl -a) or logged any I/O errors in the current situation. Even if it had, it should not have cost more than a few directories, but: $ sudo find /lost+found/ -type d | wc -l Password: 20366 That's 20366 directories in lost+found, and they're from all over the map, inode numbers from 18,000 to 2,000,000, with a bit more than 2 million available inodes total. I'd suggest to keep this report around for a few weeks, just to see if any further similar reports come in -- or if this was a one-time event. The related bits I found are http://www.ussg.iu.edu/hypermail/linux/kernel/0511.0/0193.html Yes, I'll definitely keep your report in mind. I've actually collected several reports of ext3 corruption in vanilla kernels starting with a bit in a bitmap already cleared (usually it was a block bitmap though). But none of the reports reported a significant filesystem corruption - that differs from your case. So I agree we definitely have a bug somewhere (probably in vanilla kernels) it's just really hard to track it down... So I'll add your report to my collection and close the bug for now. i think i have hit the same bug: my opensuse 10.2 ext3 partition trashed itself two times this week and i have absolutely no indications of hardware failure. summary: - system: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+, HD ST3160811AS (sata), VT8237A SATA 2-Port Controller - kernel: 2.6.18.8-0.3-default #1 SMP Tue Apr 17 08:42:35 UTC 2007 x86_64 - smartctl reports zero Reallocated_Sector_Ct, Extended offline test: Completed without error (the test was performed after the first corruption). sequence of operations: 1) sunday morning: system was stuck on session lock. it was uptime for 7 days and i could not unlock it (reported failure to erase some file or something). ssh didn't worked either. reboot. normal work day. system backup performed to a remote tar.gz. 2) tuesday: by the end of the day system was strange and i noticed the failure messages on dmesg (Aborting journal on device sda1, remounting read-only...). there was not a single message from block subsystem (no read or write problems from the sda device, for example). 4) tuesday night and wednesday: e2fsck reported thousands of errors. zillion files and directories appeared on lost+found. spent the rest of the day recovering my system from backup. 5) thursday: another ext3 panic. i could still work on it (readonly) but eventually running "e2fsck -y" caused it to freeze. 6) friday (today): no badblocks found with e2fsck -c. e2fsck -y caused a new zillion files to be moved to lost+found. restored from backup all over again. e2fsck -f again just to make sure everything is consistent. so here we are. i'm writing this from a fully consistent ext3 filesystem but i have no reasons to believe the problem will go away. i'd like to ask exactly what do you want me to do next time it happens. how do i get the metadata? should i dump it before the corruption occurs again? I remembered something that might be related to the problem: the day my computer crashed I tried reading a broken DVD. Actually it is not really broken - a different computer can read it. but here i had messages like this: end_request: I/O error, dev hdc, sector 5107488 Buffer I/O error on device hdc, logical block 1276872 and so on. note this is a plain IDE/PATA drive. my HD is SATA. in order to eject the dvd (it was completely stuck) i had to force "reseting" the drive, by means of `hdparm -w`. I don't know how/if a device reset on a different controller and drive can cause trouble to the EXT3 fs, but it is the only special thing i remember so it might provide some hint. I still have the dvd. perhaps i should try it again next week... My computer hung today. No apparent ext3 self-destruction, BUT: Aug 6 23:25:18 pitanga kernel: Unable to handle kernel paging request at ffffc3ffffffffff RIP: Aug 6 23:25:18 pitanga kernel: [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a Aug 6 23:25:18 pitanga kernel: PGD 0 Aug 6 23:25:18 pitanga kernel: Oops: 0002 [1] SMP Aug 6 23:25:18 pitanga kernel: last sysfs file: /block/hdc/size Aug 6 23:25:18 pitanga kernel: CPU 0 Aug 6 23:25:18 pitanga kernel: Modules linked in: vmnet vmmon snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device button battery ac apparmor aamatch_pcre loop dm_mod ftdi_sio usbserial snd_hda_intel snd_hda_codec snd_pcm snd_timer snd soundcore snd_page_ alloc nvidia i2c_core ide_cd shpchp cdrom atl1 ehci_hcd uhci_hcd 8139too mii pci_hotplug floppy usbcore parport_pc lp parport ext3 mbcache jbd edd fan sg via82cxxx sata_via libata thermal processor sd_mod scsi_mod ide_disk ide_core Aug 6 23:25:18 pitanga kernel: Pid: 194, comm: kswapd0 Tainted: P U 2.6.18.8-0.3-default #1 Aug 6 23:25:18 pitanga kernel: RIP: 0010:[<ffffffff8810782f>] [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a Aug 6 23:25:18 pitanga kernel: RSP: 0018:ffff810037cedd30 EFLAGS: 00010287 Aug 6 23:25:18 pitanga kernel: RAX: ffffffff881077e5 RBX: ffff81007360dd50 RCX: 0000000000000000 Aug 6 23:25:18 pitanga kernel: RDX: ffff81007360da50 RSI: ffff810037cedd90 RDI: ffffc3ffffffffff Aug 6 23:25:18 pitanga kernel: RBP: ffff81007360dc88 R08: 000000000000003c R09: 000000000007df36 Aug 6 23:25:18 pitanga kernel: R10: 0000000000000020 R11: ffffffff881077d2 R12: 0000000000000000 Aug 6 23:25:18 pitanga kernel: R13: ffff810037cedd90 R14: 0000000000000080 R15: 0000000000000080 Aug 6 23:25:18 pitanga kernel: FS: 00002ab21859cc60(0000) GS:ffffffff80520000(0000) knlGS:00000000f70d56d0 Aug 6 23:25:18 pitanga kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Aug 6 23:25:18 pitanga kernel: CR2: ffffc3ffffffffff CR3: 000000005c832000 CR4: 00000000000006e0 Aug 6 23:25:18 pitanga kernel: Process kswapd0 (pid: 194, threadinfo ffff810037cec000, task ffff810037ccc7f0) Aug 6 23:25:18 pitanga kernel: Stack: ffff81007360dd50 ffff81007360dd50 000000000000003d ffffffff80220d37 Aug 6 23:25:18 pitanga kernel: ffff81007360dd60 ffffffff80232b4b 0000000000000000 ffff810032c08110 Aug 6 23:25:18 pitanga kernel: 0000000000000000 0000000000000080 0000000000000080 ffffffff8022ba93 Aug 6 23:25:18 pitanga kernel: Call Trace: Aug 6 23:25:18 pitanga kernel: [<ffffffff80220d37>] clear_inode+0xd2/0x103 Aug 6 23:25:18 pitanga kernel: [<ffffffff80232b4b>] dispose_list+0x56/0xf6 Aug 6 23:25:18 pitanga kernel: [<ffffffff8022ba93>] shrink_icache_memory+0x1d4/0x203 Aug 6 23:25:18 pitanga kernel: [<ffffffff8023d3ea>] shrink_slab+0xe2/0x15a Aug 6 23:25:18 pitanga kernel: [<ffffffff8025383e>] kswapd+0x35b/0x454 Aug 6 23:25:18 pitanga kernel: [<ffffffff802928a8>] autoremove_wake_function+0x0/0x2e Aug 6 23:25:18 pitanga kernel: [<ffffffff802926e5>] keventd_create_kthread+0x0/0x61 Aug 6 23:25:18 pitanga kernel: [<ffffffff802534e3>] kswapd+0x0/0x454 Aug 6 23:25:18 pitanga kernel: [<ffffffff802926e5>] keventd_create_kthread+0x0/0x61 Aug 6 23:25:18 pitanga kernel: [<ffffffff80230838>] kthread+0xec/0x120 Aug 6 23:25:18 pitanga kernel: [<ffffffff80258ee0>] child_rip+0xa/0x12 Aug 6 23:25:18 pitanga kernel: [<ffffffff802926e5>] keventd_create_kthread+0x0/0x61 Aug 6 23:25:18 pitanga kernel: [<ffffffff8023074c>] kthread+0x0/0x120 Aug 6 23:25:18 pitanga kernel: [<ffffffff80258ed6>] child_rip+0x0/0x12 Aug 6 23:25:18 pitanga kernel: Aug 6 23:25:18 pitanga kernel: Aug 6 23:25:18 pitanga kernel: Aug 6 23:25:18 pitanga kernel: Code: f0 ff 0f 0f 94 c0 84 c0 74 05 e8 0d 2f 10 f8 48 c7 85 88 00 Aug 6 23:25:18 pitanga kernel: RIP [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a Aug 6 23:25:18 pitanga kernel: RSP <ffff810037cedd30> Aug 6 23:25:18 pitanga kernel: CR2: ffffc3ffffffffff anybody listening to this? Thanks for the report. I was on holiday for a few weeks so I was not able to reply earlier. Sorry for that. Answers to your questions: You don't need to dump metadata before corruption. If the corruption happens again, please use e2image to dump filesystem metadata. I suggest running: e2image - | bzip2 >corrupted-disk.bz2 The DVD seems to be unconnected to your problem. At least I don't see how it could cause the problem (but if you can verify the bad DVD doesn't trigger the problem it would be fine - it could still be some driver corrupting memory). That oops is more likely to be connected - it seems like a part of a stack being overwritten. It seems you use nvidia driver - are you able to reproduce the problem without this driver loaded? i've just tried the DVD again but is now seems so badly broken i cannot even mount it... i got several "Buffer I/O error on device hdc" (which is the dvdrom, not related to my ext3 HD) let's see if it does cause any stability problem to my system like last time. my current uptime is 28 days - i'd like to avoid rebooting, changing drivers etc unless absolutely necessary. do you want me to try the 'hdparm -w /dev/hdc' too? as i told you, i did it once to be able to eject the disc. what about the oops? shall we look into ext3_clear_inode function for anything suspicious? any bound which is not checked or something? btw, just in case somebody finds the above call trace interesting (i would if it was my code ;-)
---
000000000000c7e5 <ext3_clear_inode>:
c7e5: 41 54 push %r12
c7e7: 55 push %rbp
c7e8: 48 8d af 38 ff ff ff lea 0xffffffffffffff38(%rdi),%rbp
c7ef: 53 push %rbx
c7f0: 48 89 fb mov %rdi,%rbx
c7f3: 4c 8b 67 90 mov 0xffffffffffffff90(%rdi),%r12
c7f7: 48 8b 7f b8 mov 0xffffffffffffffb8(%rdi),%rdi
c7fb: 48 85 ff test %rdi,%rdi
c7fe: 74 1d je c81d <ext3_clear_inode+0x38>
c800: 48 83 ff ff cmp $0xffffffffffffffff,%rdi
c804: 74 17 je c81d <ext3_clear_inode+0x38>
c806: f0 ff 0f lock decl (%rdi)
c809: 0f 94 c0 sete %al
c80c: 84 c0 test %al,%al
c80e: 74 05 je c815 <ext3_clear_inode+0x30>
c810: e8 00 00 00 00 callq c815 <ext3_clear_inode+0x30>
c815: 48 c7 43 b8 ff ff ff movq $0xffffffffffffffff,0xffffffffffffffb8(%rbx)
c81c: ff
c81d: 48 8b bd 88 00 00 00 mov 0x88(%rbp),%rdi
c824: 48 85 ff test %rdi,%rdi
c827: 74 20 je c849 <ext3_clear_inode+0x64>
c829: 48 83 ff ff cmp $0xffffffffffffffff,%rdi
c82d: 74 1a je c849 <ext3_clear_inode+0x64>
c82f: f0 ff 0f lock decl (%rdi)
c832: 0f 94 c0 sete %al
c835: 84 c0 test %al,%al
c837: 74 05 je c83e <ext3_clear_inode+0x59>
c839: e8 00 00 00 00 callq c83e <ext3_clear_inode+0x59>
c83e: 48 c7 85 88 00 00 00 movq $0xffffffffffffffff,0x88(%rbp)
c845: ff ff ff ff
c849: 48 89 df mov %rbx,%rdi
c84c: e8 00 00 00 00 callq c851 <ext3_clear_inode+0x6c>
c851: 4d 85 e4 test %r12,%r12
c854: 48 c7 45 58 00 00 00 movq $0x0,0x58(%rbp)
c85b: 00
c85c: 74 0c je c86a <ext3_clear_inode+0x85>
c85e: 5b pop %rbx
c85f: 5d pop %rbp
c860: 4c 89 e7 mov %r12,%rdi
c863: 41 5c pop %r12
c865: e9 00 00 00 00 jmpq c86a <ext3_clear_inode+0x85>
c86a: 5b pop %rbx
c86b: 5d pop %rbp
c86c: 41 5c pop %r12
c86e: c3 retq
---
the oops: [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a
that is: c82f lock decl (%rdi)
this looks like x86_64's atomic_dec_and_test, which i guess must be used by a spinlock. above code shows two of such spinlocks, one @0xffffffffffffffb8(%rdi) and another @0x88(%rbp).
however i've found only one spinlock on ext3_clear_inode code path, assuming that compiler has decided to inline ext3_discard_reservation:
void ext3_discard_reservation(struct inode *inode)
{
struct ext3_inode_info *ei = EXT3_I(inode);
struct ext3_block_alloc_info *block_i = ei->i_block_alloc_info;
struct ext3_reserve_window_node *rsv;
spinlock_t *rsv_lock = &EXT3_SB(inode->i_sb)->s_rsv_window_lock;
if (!block_i)
return;
rsv = &block_i->rsv_window_node;
if (!rsv_is_empty(&rsv->rsv_window)) {
spin_lock(rsv_lock);
if (!rsv_is_empty(&rsv->rsv_window))
rsv_window_remove(inode->i_sb, rsv);
spin_unlock(rsv_lock);
}
}
it is interesting to note the inode itself looks sane, otherwise we would never be able to call inode->i_sb->s_op->clear_inode().
but somehow s_rsv_window_lock looks broken.
whatever.
Thanks for the disassembly. The Oops is not in ext3_discard_reservation(). It is in posix_acl_release(EXT3_I(inode)->i_default_acl). It seems i_default_acl whould be -1 (0xffffffffffffffff) but it was (0xffffc3ffffffffff). So I really suspect some memory corruption. Usual suspect for such stuff are nvidia drivers so unless you are able to reproduce the problem without them loaded, I'm afraid we can't help you much. Thanks for looking on it. I thought ACL was disabled, so sorry for the misleading analysis. The will keep trying to reproduce. At least my backup is up-to-date ;-) I'm cleaning up my bugzilla a bit :). I'll close this one as INVALID because of NVidia drivers. If you're able to reproduce the problem without NVidia drivers loaded, please reopen the bug again... Sorry Jan, there's some misalignment here. The machine that trashed its filesystem doesn't have NVidia hardware (but a Matrox G550), and hadn't been running proprietary drivers when it crashed, and it's been rock solid with Ubuntu 7.04 (ext3fs BTW...) and FreeBSD 6.2-RELEASE since I reinstalled it. Just because Miguel (Comment #7) has nvidia hardware doesn't imply I do :-) Reopening bug. It seems your and Miguel's problems are unrelated. I know you don't have nvidia drivers so your report is still valid (only that we don't know how to either reproduce or fix it). BTW: I'm not aware of any other corruption reports for ext3 in OpenSUSE. But you're right that INVALID is not a proper resolution. Would you like WORKSFORME more? Hmm, I don't have any other reports of ext3 corruption (neither with 10.2 nor with 10.3). Since I don't think this is debugable without a way to reproduce I'll close this one. Sorry, Matthias and thanks for the report anyway. I'm trying to copy my HD to another one and my computer hung. Linux pitanga 2.6.18.8-0.7-default #1 SMP Tue Oct 2 17:21:08 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux Here are the last lines of the kernel messages: (fs/jbd/recovery.c, 255): journal_recover: JBD: recovery, exit status 0, recovered transactions 2 to 15 (fs/jbd/recovery.c, 257): journal_recover: JBD: Replayed 3863 and revoked 0/0 blocks kjournald starting. Commit interval 5 seconds EXT3 FS on sdb1, internal journal EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. Unable to handle kernel paging request at 0000000000020028 RIP: [<ffffffff880feabf>] :ext3:ext3_discard_reservation+0x28/0x69 PGD 45ea2067 PUD 45d0c067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/net/lo/address CPU 1 Modules linked in: vmnet vmmon snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device button battery ac loop dm_mod ftdi_sio usbserial snd_hda_intel nvidia s nd_hda_codec snd_pcm snd_timer snd soundcore snd_page_alloc i2c_core shpchp ehci_hcd ide_cd cdrom uhci_hcd pci_hotplug 8139too mii usbcore atl1 floppy parport_pc lp parport ext3 mbcache jbd edd fan via82cxxx sg sata_via libata thermal processor sd_mod scsi_mod ide_disk ide_core Pid: 194, comm: kswapd0 Tainted: P U 2.6.18.8-0.7-default #1 RIP: 0010:[<ffffffff880feabf>] [<ffffffff880feabf>] :ext3:ext3_discard_reservation+0x28/0x69 RSP: 0018:ffff810037f6dd00 EFLAGS: 00010206 RAX: ffff81007cc48c00 RBX: 0000000000020000 RCX: 0000000000000000 RDX: ffff81004a449740 RSI: ffff810037f6dd90 RDI: ffff81004a63e110 RBP: ffff81004a63e110 R08: 0000000000000053 R09: 000000000007df36 R10: 0000000000000060 R11: ffffffff881097d2 R12: 0000000000020020 R13: ffff81007a450000 R14: 0000000000000080 R15: 0000000000000180 FS: 00002b04ab8ce260(0000) GS:ffff810037fc3bc0(0000) knlGS:00000000f7aab6d0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000020028 CR3: 00000000463a0000 CR4: 00000000000006e0 Process kswapd0 (pid: 194, threadinfo ffff810037f6c000, task ffff810037cc7040) Stack: ffff810049161668 ffff81004a63e110 ffff81004a63e048 0000000000020000 ffff810037f6dd90 ffffffff88109851 ffff81004a63e110 ffff81004a63e110 0000000000000054 ffffffff80220d38 ffff81004a63e120 ffffffff80232b4c Call Trace: [<ffffffff88109851>] :ext3:ext3_clear_inode+0x6c/0x8a [<ffffffff80220d38>] clear_inode+0xd2/0x103 [<ffffffff80232b4c>] dispose_list+0x56/0xf6 [<ffffffff8022ba94>] shrink_icache_memory+0x1d4/0x203 [<ffffffff8023d3eb>] shrink_slab+0xe2/0x15a [<ffffffff80253840>] kswapd+0x35b/0x454 [<ffffffff802928a7>] autoremove_wake_function+0x0/0x2e [<ffffffff802926e4>] keventd_create_kthread+0x0/0x61 [<ffffffff802534e5>] kswapd+0x0/0x454 [<ffffffff802926e4>] keventd_create_kthread+0x0/0x61 [<ffffffff80230839>] kthread+0xec/0x120 [<ffffffff80258ee0>] child_rip+0xa/0x12 [<ffffffff802926e4>] keventd_create_kthread+0x0/0x61 [<ffffffff8023074d>] kthread+0x0/0x120 [<ffffffff80258ed6>] child_rip+0x0/0x12 Code: 49 83 7c 24 08 00 74 2e 49 8d bd 00 41 00 00 e8 5a fa 15 f8 RIP [<ffffffff880feabf>] :ext3:ext3_discard_reservation+0x28/0x69 RSP <ffff810037f6dd00> CR2: 0000000000020028 My current problem is that i can't copy data from this HD to another. I tried it twice today, no luck. I have a few other ideas to try: - copying without binary nvidia - copying in a different kernel - copying in a different computer - dumping metadata? any suggestion? This is a single bit error - pointer i_block_alloc_info has been set to 0x20000 instead of being NULL. So two things to try: 1) Boot without the nvidia driver loaded, try whether it fails as well. 2) Check your memory with memtest as this could be also buggy hardware. Jan, i noticed that nvidia driver was not being used (xorg is configured with "nv"), so i just unloaded it and retried. Apparently it worked: cp -a completed without further ext3 problems. Now i'll check files for differences (they might indicate the buggy memory as you suggested). I will leave memtest for the night because i have work to finish here but i keep you informed. btw, thanks for the attention! Just finished comparing the files, no corruption found. Jan, memtest over the weekend revealed a memory error. Guess what: Err-bits = 00020000. |