Bug 246959

Summary: ext3 self-destruct on openSUSE 10.2 (kernel-default-2.6.18.2-34)
Product: [openSUSE] openSUSE 10.2 Reporter: Matthias Andree <matthias.andree>
Component: KernelAssignee: Jan Kara <jack>
Status: RESOLVED INVALID QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: matthias.andree, mfreitas
Version: Final   
Target Milestone: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Matthias Andree 2007-02-20 15:23:16 UTC
I've just had an i686 uniprocessor system (Athlon XP 1800+) effectively wipe itself out last Sunday.

Software: kernel-default-2.6.18.2-34. e2fsprogs-1.39-21.

Disk layout:

/dev/hda1  vfat    (unmounted and empty)
/dev/hda3  ufs     (unmounted, FreeBSD 6.2-PRERELEASE)
/dev/hda5  ext3 as /
/dev/hda6  ext3 as /home
/dev/hda7  xfs  as /musik

There was a preceding issue where the system lost a few files three weeks ago:
the hard disk drive (Seagate Barracuda ATA IV) remapped 5 sectors from /dev/hda5 and lost a few files. This was quickly repaired with e2fsck -fy /dev/hda5, rpm -Va and reinstalling the damaged packages.

The current issue at hand however was violent however. The system remounted the file system R/O after finding a bitmap mismatch (logging below); several commands could not be found any more on the system after that event and the system was unbootable, so I rebooted into the DVD-based (original openSUSE 10.2 box DVD) rescue system and ran e2fsck there. This came up with lots of inconsistencies in e2fsck -pf /dev/hda, so I ran e2fsck -fy /dev/hda5 (e2fsprogs-1.39-21) which then relocated nearly 107,000 (one hundred and seven thousand!) files to /lost+found on /dev/hda5. /boot/grub and several /etc files and executables were missing, rendering the system unbootable, so I moved the rest into OLD/ and installed Ubuntu 6.10, since I don't trust the openSUSE 10.2 kernel for the nonce, until it's clear what caused this.

I don't think it was e2fsck though, since before the reboot, some commands and /boot/grub had already gone missing, so I suspect kernel bugs here that systematically trashed the system.

/dev/hda6 (according to e2fsck -pf) and /dev/hda7 (according to xfs_check) are undamaged.

These are, in a sense, the final words the system uttered over the network to the loghost, I haven't found any other suspicious messages after the bootup at 21:14 that day.

Feb 18 21:29:59 rho su: (to beagleindex) root on none
Feb 18 21:30:51 rho su: (to beagleindex) root on none
Feb 18 21:33:49 rho syslogd: /var/log/messages: Read-only file system
Feb 18 21:33:49 rho syslogd: /var/log/warn: Read-only file system
Feb 18 21:33:49 rho syslogd: /var/log/warn: Read-only file system
Feb 18 21:33:49 rho kernel: EXT3-fs error (device hda5): ext3_free_inode: bit already cleared for inode 1522610
Feb 18 21:33:49 rho kernel: Aborting journal on device hda5.
Feb 18 21:33:49 rho kernel: ext3_abort called.
Feb 18 21:33:49 rho kernel: EXT3-fs error (device hda5): ext3_journal_start_sb: Detected aborted journal
Feb 18 21:33:49 rho kernel: Remounting filesystem read-only
Feb 18 21:33:49 rho kernel: EXT3-fs error (device hda5) in ext3_delete_inode: IO failure
Feb 18 21:33:49 rho kernel: __journal_remove_journal_head: freeing b_committed_data
Feb 18 21:33:49 rho kernel: __journal_remove_journal_head: freeing b_committed_data
Comment 2 Jan Kara 2007-02-21 09:51:34 UTC
Thanks for the report but it's hard to say anything here. I guess you don't have a filesystem image (metadata would be enough) before you ran e2fsck, do you? Obviously, something corrupted your filesystem and it seems the corruption was rather heavy. Given your disk had to remap a few sectors before, I would not trust it completely. Unless you have any more information, this is impossible to debug, sorry. So do you have the corrupted fs image or something like that?
Comment 3 Matthias Andree 2007-02-21 10:22:53 UTC
I'm afraid I don't have a metadata image. My fault, I didn't think of that and I did not expect corruptions as bad as I've seen, since I have never had such massive data losses with ext2 or ext3 in 10 years.

WRT the disk drive, it passes S.M.A.R.T. self tests and has not reallocated more sectors since the original event (7 in total according to smartctl -a) or logged any I/O errors in the current situation. Even if it had, it should not have cost more than a few directories, but:

$ sudo find /lost+found/ -type d  | wc -l
Password:
20366

That's 20366 directories in lost+found, and they're from all over the map, inode numbers from 18,000 to 2,000,000, with a bit more than 2 million available inodes total.

I'd suggest to keep this report around for a few weeks, just to see if any further similar reports come in -- or if this was a one-time event. The related bits I found are http://www.ussg.iu.edu/hypermail/linux/kernel/0511.0/0193.html
Comment 4 Jan Kara 2007-02-21 10:55:22 UTC
Yes, I'll definitely keep your report in mind. I've actually collected several reports of ext3 corruption in vanilla kernels starting with a bit in a bitmap already cleared (usually it was a block bitmap though). But none of the reports reported a significant filesystem corruption - that differs from your case. So I agree we definitely have a bug somewhere (probably in vanilla kernels) it's just really hard to track it down... So I'll add your report to my collection and close the bug for now.
Comment 5 Miguel Freitas 2007-07-20 18:18:16 UTC
i think i have hit the same bug: my opensuse 10.2 ext3 partition trashed itself two times this week and i have absolutely no indications of hardware failure.

summary:

- system: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+, HD ST3160811AS (sata), VT8237A SATA 2-Port Controller

- kernel: 2.6.18.8-0.3-default #1 SMP Tue Apr 17 08:42:35 UTC 2007 x86_64

- smartctl reports zero Reallocated_Sector_Ct, Extended offline test: Completed without error (the test was performed after the first corruption).

sequence of operations:

1) sunday morning: system was stuck on session lock. it was uptime for 7 days and i could not unlock it (reported failure to erase some file or something). ssh didn't worked either. reboot. normal work day. system backup performed to a remote tar.gz.

2) tuesday: by the end of the day system was strange and i noticed the failure messages on dmesg (Aborting journal on device sda1, remounting read-only...). there was not a single message from block subsystem (no read or write problems from the sda device, for example).

4) tuesday night and wednesday: e2fsck reported thousands of errors. zillion files and directories appeared on lost+found. spent the rest of the day recovering my system from backup.

5) thursday: another ext3 panic. i could still work on it (readonly) but eventually running "e2fsck -y" caused it to freeze.

6) friday (today): no badblocks found with e2fsck -c. e2fsck -y caused a new zillion files to be moved to lost+found. restored from backup all over again. e2fsck -f again just to make sure everything is consistent.

so here we are.

i'm writing this from a fully consistent ext3 filesystem but i have no reasons to believe the problem will go away.

i'd like to ask exactly what do you want me to do next time it happens.

how do i get the metadata?
should i dump it before the corruption occurs again?
Comment 6 Miguel Freitas 2007-07-25 18:26:09 UTC
I remembered something that might be related to the problem: the day my computer crashed I tried reading a broken DVD. Actually it is not really broken - a different computer can read it. but here i had messages like this:

end_request: I/O error, dev hdc, sector 5107488
Buffer I/O error on device hdc, logical block 1276872

and so on.

note this is a plain IDE/PATA drive. my HD is SATA.

in order to eject the dvd (it was completely stuck) i had to force "reseting" the drive, by means of `hdparm -w`.

I don't know how/if a device reset on a different controller and drive can cause trouble to the EXT3 fs, but it is the only special thing i remember so it might provide some hint.

I still have the dvd. perhaps i should try it again next week...
Comment 7 Miguel Freitas 2007-08-07 14:58:38 UTC
My computer hung today. No apparent ext3 self-destruction, BUT:

Aug  6 23:25:18 pitanga kernel: Unable to handle kernel paging request at ffffc3ffffffffff RIP:
Aug  6 23:25:18 pitanga kernel:  [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a
Aug  6 23:25:18 pitanga kernel: PGD 0
Aug  6 23:25:18 pitanga kernel: Oops: 0002 [1] SMP
Aug  6 23:25:18 pitanga kernel: last sysfs file: /block/hdc/size
Aug  6 23:25:18 pitanga kernel: CPU 0
Aug  6 23:25:18 pitanga kernel: Modules linked in: vmnet vmmon snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device button battery
 ac apparmor aamatch_pcre loop dm_mod ftdi_sio usbserial snd_hda_intel snd_hda_codec snd_pcm snd_timer snd soundcore snd_page_
alloc nvidia i2c_core ide_cd shpchp cdrom atl1 ehci_hcd uhci_hcd 8139too mii pci_hotplug floppy usbcore parport_pc lp parport
ext3 mbcache jbd edd fan sg via82cxxx sata_via libata thermal processor sd_mod scsi_mod ide_disk ide_core
Aug  6 23:25:18 pitanga kernel: Pid: 194, comm: kswapd0 Tainted: P     U 2.6.18.8-0.3-default #1
Aug  6 23:25:18 pitanga kernel: RIP: 0010:[<ffffffff8810782f>]  [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a
Aug  6 23:25:18 pitanga kernel: RSP: 0018:ffff810037cedd30  EFLAGS: 00010287
Aug  6 23:25:18 pitanga kernel: RAX: ffffffff881077e5 RBX: ffff81007360dd50 RCX: 0000000000000000
Aug  6 23:25:18 pitanga kernel: RDX: ffff81007360da50 RSI: ffff810037cedd90 RDI: ffffc3ffffffffff
Aug  6 23:25:18 pitanga kernel: RBP: ffff81007360dc88 R08: 000000000000003c R09: 000000000007df36
Aug  6 23:25:18 pitanga kernel: R10: 0000000000000020 R11: ffffffff881077d2 R12: 0000000000000000
Aug  6 23:25:18 pitanga kernel: R13: ffff810037cedd90 R14: 0000000000000080 R15: 0000000000000080
Aug  6 23:25:18 pitanga kernel: FS:  00002ab21859cc60(0000) GS:ffffffff80520000(0000) knlGS:00000000f70d56d0
Aug  6 23:25:18 pitanga kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Aug  6 23:25:18 pitanga kernel: CR2: ffffc3ffffffffff CR3: 000000005c832000 CR4: 00000000000006e0
Aug  6 23:25:18 pitanga kernel: Process kswapd0 (pid: 194, threadinfo ffff810037cec000, task ffff810037ccc7f0)
Aug  6 23:25:18 pitanga kernel: Stack:  ffff81007360dd50 ffff81007360dd50 000000000000003d ffffffff80220d37
Aug  6 23:25:18 pitanga kernel:  ffff81007360dd60 ffffffff80232b4b 0000000000000000 ffff810032c08110
Aug  6 23:25:18 pitanga kernel:  0000000000000000 0000000000000080 0000000000000080 ffffffff8022ba93
Aug  6 23:25:18 pitanga kernel: Call Trace:
Aug  6 23:25:18 pitanga kernel:  [<ffffffff80220d37>] clear_inode+0xd2/0x103
Aug  6 23:25:18 pitanga kernel:  [<ffffffff80232b4b>] dispose_list+0x56/0xf6
Aug  6 23:25:18 pitanga kernel:  [<ffffffff8022ba93>] shrink_icache_memory+0x1d4/0x203
Aug  6 23:25:18 pitanga kernel:  [<ffffffff8023d3ea>] shrink_slab+0xe2/0x15a
Aug  6 23:25:18 pitanga kernel:  [<ffffffff8025383e>] kswapd+0x35b/0x454
Aug  6 23:25:18 pitanga kernel:  [<ffffffff802928a8>] autoremove_wake_function+0x0/0x2e
Aug  6 23:25:18 pitanga kernel:  [<ffffffff802926e5>] keventd_create_kthread+0x0/0x61
Aug  6 23:25:18 pitanga kernel:  [<ffffffff802534e3>] kswapd+0x0/0x454
Aug  6 23:25:18 pitanga kernel:  [<ffffffff802926e5>] keventd_create_kthread+0x0/0x61
Aug  6 23:25:18 pitanga kernel:  [<ffffffff80230838>] kthread+0xec/0x120
Aug  6 23:25:18 pitanga kernel:  [<ffffffff80258ee0>] child_rip+0xa/0x12
Aug  6 23:25:18 pitanga kernel:  [<ffffffff802926e5>] keventd_create_kthread+0x0/0x61
Aug  6 23:25:18 pitanga kernel:  [<ffffffff8023074c>] kthread+0x0/0x120
Aug  6 23:25:18 pitanga kernel:  [<ffffffff80258ed6>] child_rip+0x0/0x12
Aug  6 23:25:18 pitanga kernel:
Aug  6 23:25:18 pitanga kernel:
Aug  6 23:25:18 pitanga kernel:
Aug  6 23:25:18 pitanga kernel: Code: f0 ff 0f 0f 94 c0 84 c0 74 05 e8 0d 2f 10 f8 48 c7 85 88 00
Aug  6 23:25:18 pitanga kernel: RIP  [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a
Aug  6 23:25:18 pitanga kernel:  RSP <ffff810037cedd30>
Aug  6 23:25:18 pitanga kernel: CR2: ffffc3ffffffffff


anybody listening to this?
Comment 8 Jan Kara 2007-08-14 13:32:51 UTC
Thanks for the report. I was on holiday for a few weeks so I was not able to reply earlier. Sorry for that.

Answers to your questions: You don't need to dump metadata before corruption.
If the corruption happens again, please use e2image to dump filesystem metadata.
I suggest running: e2image - | bzip2 >corrupted-disk.bz2

The DVD seems to be unconnected to your problem. At least I don't see how it could cause the problem (but if you can verify the bad DVD doesn't trigger the problem it would be fine - it could still be some driver corrupting memory). That oops is more likely to be connected - it seems like a part of a stack being overwritten. It seems you use nvidia driver - are you able to reproduce the problem without this driver loaded?
Comment 9 Miguel Freitas 2007-09-17 19:17:55 UTC
i've just tried the DVD again but is now seems so badly broken i cannot even mount it... i got several "Buffer I/O error on device hdc" (which is the dvdrom, not related to my ext3 HD)

let's see if it does cause any stability problem to my system like last time. my current uptime is 28 days - i'd like to avoid rebooting, changing drivers etc unless absolutely necessary.

do you want me to try the 'hdparm -w /dev/hdc' too? as i told you, i did it once to be able to eject the disc.

what about the oops? shall we look into ext3_clear_inode function for anything suspicious? any bound which is not checked or something?
Comment 10 Miguel Freitas 2007-09-17 23:27:15 UTC
btw, just in case somebody finds the above call trace interesting (i would if it was my code ;-)

---

000000000000c7e5 <ext3_clear_inode>:
    c7e5:       41 54                   push   %r12
    c7e7:       55                      push   %rbp
    c7e8:       48 8d af 38 ff ff ff    lea    0xffffffffffffff38(%rdi),%rbp
    c7ef:       53                      push   %rbx
    c7f0:       48 89 fb                mov    %rdi,%rbx
    c7f3:       4c 8b 67 90             mov    0xffffffffffffff90(%rdi),%r12
    c7f7:       48 8b 7f b8             mov    0xffffffffffffffb8(%rdi),%rdi
    c7fb:       48 85 ff                test   %rdi,%rdi
    c7fe:       74 1d                   je     c81d <ext3_clear_inode+0x38>
    c800:       48 83 ff ff             cmp    $0xffffffffffffffff,%rdi
    c804:       74 17                   je     c81d <ext3_clear_inode+0x38>
    c806:       f0 ff 0f                lock decl (%rdi)
    c809:       0f 94 c0                sete   %al
    c80c:       84 c0                   test   %al,%al
    c80e:       74 05                   je     c815 <ext3_clear_inode+0x30>
    c810:       e8 00 00 00 00          callq  c815 <ext3_clear_inode+0x30>
    c815:       48 c7 43 b8 ff ff ff    movq   $0xffffffffffffffff,0xffffffffffffffb8(%rbx)
    c81c:       ff
    c81d:       48 8b bd 88 00 00 00    mov    0x88(%rbp),%rdi
    c824:       48 85 ff                test   %rdi,%rdi
    c827:       74 20                   je     c849 <ext3_clear_inode+0x64>
    c829:       48 83 ff ff             cmp    $0xffffffffffffffff,%rdi
    c82d:       74 1a                   je     c849 <ext3_clear_inode+0x64>
    c82f:       f0 ff 0f                lock decl (%rdi)
    c832:       0f 94 c0                sete   %al
    c835:       84 c0                   test   %al,%al
    c837:       74 05                   je     c83e <ext3_clear_inode+0x59>
    c839:       e8 00 00 00 00          callq  c83e <ext3_clear_inode+0x59>
    c83e:       48 c7 85 88 00 00 00    movq   $0xffffffffffffffff,0x88(%rbp)
    c845:       ff ff ff ff
    c849:       48 89 df                mov    %rbx,%rdi
    c84c:       e8 00 00 00 00          callq  c851 <ext3_clear_inode+0x6c>
    c851:       4d 85 e4                test   %r12,%r12
    c854:       48 c7 45 58 00 00 00    movq   $0x0,0x58(%rbp)
    c85b:       00
    c85c:       74 0c                   je     c86a <ext3_clear_inode+0x85>
    c85e:       5b                      pop    %rbx
    c85f:       5d                      pop    %rbp
    c860:       4c 89 e7                mov    %r12,%rdi
    c863:       41 5c                   pop    %r12
    c865:       e9 00 00 00 00          jmpq   c86a <ext3_clear_inode+0x85>
    c86a:       5b                      pop    %rbx
    c86b:       5d                      pop    %rbp
    c86c:       41 5c                   pop    %r12
    c86e:       c3                      retq

---

the oops: [<ffffffff8810782f>] :ext3:ext3_clear_inode+0x4a/0x8a

that is: c82f   lock decl (%rdi)

this looks like x86_64's atomic_dec_and_test, which i guess must be used by a spinlock. above code shows two of such spinlocks, one @0xffffffffffffffb8(%rdi) and another @0x88(%rbp).

however i've found only one spinlock on ext3_clear_inode code path, assuming that compiler has decided to inline ext3_discard_reservation:


void ext3_discard_reservation(struct inode *inode)
{
        struct ext3_inode_info *ei = EXT3_I(inode);
        struct ext3_block_alloc_info *block_i = ei->i_block_alloc_info;
        struct ext3_reserve_window_node *rsv;
        spinlock_t *rsv_lock = &EXT3_SB(inode->i_sb)->s_rsv_window_lock;

        if (!block_i)
                return;

        rsv = &block_i->rsv_window_node;
        if (!rsv_is_empty(&rsv->rsv_window)) {
                spin_lock(rsv_lock);
                if (!rsv_is_empty(&rsv->rsv_window))
                        rsv_window_remove(inode->i_sb, rsv);
                spin_unlock(rsv_lock);
        }
}

it is interesting to note the inode itself looks sane, otherwise we would never be able to call inode->i_sb->s_op->clear_inode().

but somehow s_rsv_window_lock looks broken.

whatever.
Comment 11 Jan Kara 2007-09-19 09:24:18 UTC
Thanks for the disassembly. The Oops is not in ext3_discard_reservation(). It is in posix_acl_release(EXT3_I(inode)->i_default_acl). It seems i_default_acl whould be -1 (0xffffffffffffffff) but it was (0xffffc3ffffffffff). So I really suspect some memory corruption. Usual suspect for such stuff are nvidia drivers so unless you are able to reproduce the problem without them loaded, I'm afraid we can't help you much.
Comment 12 Miguel Freitas 2007-09-19 10:42:56 UTC
Thanks for looking on it. I thought ACL was disabled, so sorry for the misleading analysis.

The will keep trying to reproduce. At least my backup is up-to-date ;-)
Comment 13 Jan Kara 2007-10-01 16:32:11 UTC
I'm cleaning up my bugzilla a bit :). I'll close this one as INVALID because of NVidia drivers. If you're able to reproduce the problem without NVidia drivers loaded, please reopen the bug again...
Comment 14 Matthias Andree 2007-10-01 16:49:24 UTC
Sorry Jan, there's some misalignment here.

The machine that trashed its filesystem doesn't have NVidia hardware (but a Matrox G550), and hadn't been running proprietary drivers when it crashed, and it's been rock solid with Ubuntu 7.04 (ext3fs BTW...) and FreeBSD 6.2-RELEASE since I reinstalled it.

Just because Miguel (Comment #7) has nvidia hardware doesn't imply I do :-)

Reopening bug.
Comment 15 Jan Kara 2007-10-01 17:09:06 UTC
It seems your and Miguel's problems are unrelated. I know you don't have nvidia drivers so your report is still valid (only that we don't know how to either reproduce or fix it). BTW: I'm not aware of any other corruption reports for ext3 in OpenSUSE. But you're right that INVALID is not a proper resolution. Would you like WORKSFORME more?
Comment 16 Jan Kara 2007-10-24 13:24:21 UTC
Hmm, I don't have any other reports of ext3 corruption (neither with 10.2 nor with 10.3). Since I don't think this is debugable without a way to reproduce I'll close this one. Sorry, Matthias and thanks for the report anyway.
Comment 17 Miguel Freitas 2008-02-07 13:31:18 UTC
I'm trying to copy my HD to another one and my computer hung.

Linux pitanga 2.6.18.8-0.7-default #1 SMP Tue Oct 2 17:21:08 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux

Here are the last lines of the kernel messages:

(fs/jbd/recovery.c, 255): journal_recover: JBD: recovery, exit status 0, recovered transactions 2 to 15
(fs/jbd/recovery.c, 257): journal_recover: JBD: Replayed 3863 and revoked 0/0 blocks
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sdb1, internal journal
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
Unable to handle kernel paging request at 0000000000020028 RIP:
 [<ffffffff880feabf>] :ext3:ext3_discard_reservation+0x28/0x69
PGD 45ea2067 PUD 45d0c067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /class/net/lo/address
CPU 1
Modules linked in: vmnet vmmon snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device button battery ac loop dm_mod ftdi_sio usbserial snd_hda_intel nvidia s
nd_hda_codec snd_pcm snd_timer snd soundcore snd_page_alloc i2c_core shpchp ehci_hcd ide_cd cdrom uhci_hcd pci_hotplug 8139too mii usbcore atl1 floppy
parport_pc lp parport ext3 mbcache jbd edd fan via82cxxx sg sata_via libata thermal processor sd_mod scsi_mod ide_disk ide_core
Pid: 194, comm: kswapd0 Tainted: P     U 2.6.18.8-0.7-default #1
RIP: 0010:[<ffffffff880feabf>]  [<ffffffff880feabf>] :ext3:ext3_discard_reservation+0x28/0x69
RSP: 0018:ffff810037f6dd00  EFLAGS: 00010206
RAX: ffff81007cc48c00 RBX: 0000000000020000 RCX: 0000000000000000
RDX: ffff81004a449740 RSI: ffff810037f6dd90 RDI: ffff81004a63e110
RBP: ffff81004a63e110 R08: 0000000000000053 R09: 000000000007df36
R10: 0000000000000060 R11: ffffffff881097d2 R12: 0000000000020020
R13: ffff81007a450000 R14: 0000000000000080 R15: 0000000000000180
FS:  00002b04ab8ce260(0000) GS:ffff810037fc3bc0(0000) knlGS:00000000f7aab6d0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000020028 CR3: 00000000463a0000 CR4: 00000000000006e0
Process kswapd0 (pid: 194, threadinfo ffff810037f6c000, task ffff810037cc7040)
Stack:  ffff810049161668 ffff81004a63e110 ffff81004a63e048 0000000000020000
 ffff810037f6dd90 ffffffff88109851 ffff81004a63e110 ffff81004a63e110
 0000000000000054 ffffffff80220d38 ffff81004a63e120 ffffffff80232b4c
Call Trace:
 [<ffffffff88109851>] :ext3:ext3_clear_inode+0x6c/0x8a
 [<ffffffff80220d38>] clear_inode+0xd2/0x103
 [<ffffffff80232b4c>] dispose_list+0x56/0xf6
 [<ffffffff8022ba94>] shrink_icache_memory+0x1d4/0x203
 [<ffffffff8023d3eb>] shrink_slab+0xe2/0x15a
 [<ffffffff80253840>] kswapd+0x35b/0x454
 [<ffffffff802928a7>] autoremove_wake_function+0x0/0x2e
 [<ffffffff802926e4>] keventd_create_kthread+0x0/0x61
 [<ffffffff802534e5>] kswapd+0x0/0x454
 [<ffffffff802926e4>] keventd_create_kthread+0x0/0x61
 [<ffffffff80230839>] kthread+0xec/0x120
 [<ffffffff80258ee0>] child_rip+0xa/0x12
 [<ffffffff802926e4>] keventd_create_kthread+0x0/0x61
 [<ffffffff8023074d>] kthread+0x0/0x120
 [<ffffffff80258ed6>] child_rip+0x0/0x12


Code: 49 83 7c 24 08 00 74 2e 49 8d bd 00 41 00 00 e8 5a fa 15 f8
RIP  [<ffffffff880feabf>] :ext3:ext3_discard_reservation+0x28/0x69
 RSP <ffff810037f6dd00>
CR2: 0000000000020028

Comment 18 Miguel Freitas 2008-02-07 13:37:32 UTC
My current problem is that i can't copy data from this HD to another. I tried it twice today, no luck.

I have a few other ideas to try:

- copying without binary nvidia
- copying in a different kernel
- copying in a different computer
- dumping metadata?

any suggestion?
Comment 19 Jan Kara 2008-02-07 16:00:19 UTC
This is a single bit error - pointer i_block_alloc_info has been set to 0x20000 instead of being NULL. So two things to try:
 1) Boot without the nvidia driver loaded, try whether it fails as well.
 2) Check your memory with memtest as this could be also buggy hardware.
Comment 20 Miguel Freitas 2008-02-07 16:24:36 UTC
Jan, i noticed that nvidia driver was not being used (xorg is configured with "nv"), so i just unloaded it and retried. Apparently it worked: cp -a completed without further ext3 problems.

Now i'll check files for differences (they might indicate the buggy memory as you suggested). I will leave memtest for the night because i have work to finish here but i keep you informed.

btw, thanks for the attention!
Comment 21 Miguel Freitas 2008-02-07 21:20:00 UTC
Just finished comparing the files, no corruption found.
Comment 22 Miguel Freitas 2008-02-11 11:09:10 UTC
Jan, memtest over the weekend revealed a memory error.

Guess what: Err-bits = 00020000.