Bugzilla – Bug 389656
Machine hard hang using reiserfs
Last modified: 2018-07-03 19:39:42 UTC
Fresh install from Factory today on an IBM Thinkpad T43P. If packagekitd is running, it will hard hang the computer. Keyboard/Mouse does not respond. Can't switch to tty etc. Only way out is to push-hold power button. Just prior to it hanging, I get a notification from the applet saying that it couldn't establish trust with the repo. Attaching pk_backend_zypp Note; If I rename packagekitd to prevent it from running, everything works fine.
There is nothing strange in /var/log/messages and I switched to tty10 to see if anything showed up there prior to it hanging but nothing.
The exact message from gpk-update-icon is; A security trust relationsh is not present Repo signature verification failed [ Do not show this notification again ] Note! "relationsh" as written above is what is says. Looks like the notification window is not big enough to fit the text.
Created attachment 214600 [details] strace packagekitd Attaching to packagekitd or gpk-update-icon shows nothing prior to the hang. When hooking up strace to gpk-update-icon and packagekitd, it does not hang. Attaching the output anyway.
Created attachment 214601 [details] strace pkg-update-icon
Is this repeatable? I'm finding it hard to imagine that PackageKit could hard hang the computer, usually this only occurs with kernel or X issues.
I got a hang this morning, but I've had the new PacK since yesterday so I'm not sure its related (nothing in /var/log/messages either).
Created attachment 214932 [details] pk_backend_zypp This happens within minutes of logging in. And it happens everytime if packagekitd is not renamed.
Ok, now I got the hang even after renaming packagekitd. Assigning to kernel
Changing summary. Also, I still have another install on the same machine (different kernel) that does not hang
Magnus, do the usual Sysrq work? Do you get anything out of it?
I tried; Sysrq, Alt+Sysrq etc but seems that I don't know the right combination :-/ What is the right combination and is it enabled by default or do I have to manually enable it?
I also tried the following but it doesn't reboot; # Hold down the Alt and SysRq (Print Screen) keys. # While holding those down, type the following in order. Nothing will appear to happen until the last letter is pressed: REISUB
Booted with sysrq=1 but it still doesn't do anything
This seems to be related to reiserfs. I installed Beta3 fresh from the DVD with all defaults. No hangs, lockups or anything. I then reinstalled from the same source with ReiserFS and machine hung after a couple of minutes.
but you seem to be the only one with this problem? I lower this as reiserfs is no longer our default and we have no idea what the problem is atm.
I can reproduce this issue by installing a fresh machine in VirtualBox with the GNOME pattern. So as long as you select ReiserFS instead of ext3, the machine will hard hang after you login. If I move beagle out of /etc/xdg/autostart I don't get the hang. If I then start beagle manually, it hangs.
I can't reproduce this with a fresh install on reiserfs either.
*** Bug 396166 has been marked as a duplicate of this bug. ***
I could not initially reproduce off the DVD, however I was able to after installing GNOME and then launching firefox, when it started to index something specific (indexing had run before but not locked anything up). Something to do with user_xattr's (beagle makes heavy use of those)?
(In reply to comment #19 from JP Rosevear) > installing GNOME and then launching firefox, when it started to index something > specific (indexing had run before but not locked anything up). Something to do > with user_xattr's (beagle makes heavy use of those)? Indexing firefox history does not use xattrs (AFAIR). But could be related to xattrs, since reiserfs has them turned on by default. Could be some other bug coincidentally uncovered by packagekit or beagle. A good test would be to add "BEAGLE_DISABLE_XATTR=1" in the /usr/bin/beagled script (scroll to the end near the exec lines) and test if beagle+xattr is causing it.
I added the line here: //-- PROCESS_NAME="beagled" BEAGLE_DISABLE_XATTR=1 if [ $fg -eq 1 ]; then exec -a $PROCESS_NAME $CMDLINE exit 1 else exec -a $PROCESS_NAME $CMDLINE & fi //-- started beagled manually, all was working fine, but to be sure, I rebooted, after login, the process "beagled /usr/lib/beagle/BeagleDaemon.exe --replace --bg" started and the system hand again.
I had no problems reproducing this issue (using jpr's trick of browsing the internet with FF for a while) on one physical (64bit) and one virtual machine (VirtualBox, 32bit) using latest Factory. mboman@linux-2ztj:~> rpm -qa|grep -i reiser reiserfs-3.6.19-132 libreiserfs-0.3.0.5-116 mboman@linux-2ztj:~> uname -a Linux linux-2ztj 2.6.25.4-10-default #1 SMP 2008-05-28 16:25:04 +0200 x86_64 x86_64 x86_64 GNU/Linux It does not hang if I change fstab from "acl,user_xattr" to "noatime,noacl"
Changing fstab seems to work. I tried a search and got "Search service not running", though I see bleaged running. If I click Start search service it open more beagled process but the not running message persist.
[Bug # 394329] might be a duplicate of this one
*** Bug 394329 has been marked as a duplicate of this bug. ***
I added noatime, noacl to my fstab, but the system still hangs. /dev/sda7 / reiserfs defaults,noatime,noacl 1 1 I am not sure what these options are doing. Do they turn xattr off? I removed most parts of beagle, except libbeagle. But the hang is still there. # rpm -qa | grep reiser reiserfs-3.6.19-131 libreiserfs-0.3.0.5-115 # rpm -qa | grep beagle libbeagle1-0.3.5.1-3
Jochen, Try to remove "defaults" from that fstab line to make it identical to what works here and for Sergio. Also, the Firefox beagle plugin might still work with only libbeagle installed (have not looked in to it) so perhaps try to disable that addon in FF?
Removed defaults from fstab and disabled beagle plugin. But the hang is still there.
Can you boot using the -debug flavor? If there aren't any messages on tty10, then there might be a mutex deadlock occuring. I thought I had hunted all those down in the reiserfs xattr code. The -debug flavor enables lockdep, which should dump messages on the console if it encounters a deadlock.
*** Bug 398113 has been marked as a duplicate of this bug. ***
Hi Jeff, do you mean this with -debug flavor: $demsg [..] Kernel command line: root=/dev/disk/by-id/scsi-SATA_HDS722516VLAT20_VNR4GEC4G5KSYK-part7 resume=/dev/sda6 splash=silent vga=0 x317 -debug [..] How do i get the console output when the system has deadlocked?
*** Bug 400449 has been marked as a duplicate of this bug. ***
Created attachment 222236 [details] sysrq output
Just a quick note that my system hangs in fsync as it lock_kernel. Pid: 3485, comm: beagled Tainted: G N (2.6.25.5-1.1-debug #1) EIP: 0060:[<f8c92c3f>] EFLAGS: 00200293 CPU: 0 EIP is at __discard_prealloc+0x4/0xa5 [reiserfs] EAX: f5c61f34 EBX: f5c61f34 ECX: f711d524 EDX: f711d500 ESI: f8c18000 EDI: f8c28148 EBP: f5c61e64 ESP: f5c61e60 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 8005003b CR2: b60b7ad0 CR3: 35c4a000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<f8c92d0d>] reiserfs_discard_all_prealloc+0x2d/0x3c [reiserfs] [<f8caecbf>] do_journal_end+0x405/0xc16 [reiserfs] [<f8caf52b>] journal_end_sync+0x5b/0x63 [reiserfs] [<f8cafc4c>] reiserfs_commit_for_inode+0x101/0x183 [reiserfs] [<f8c9d6dd>] reiserfs_sync_file+0x36/0x74 [reiserfs] [<c0193028>] do_fsync+0x4a/0x79 [<c0193076>] __do_fsync+0x1f/0x2f [<c01930a5>] sys_fsync+0xd/0xf [<c0105906>] sysenter_past_esp+0x5f/0x89 [<ffffe430>] 0xffffe430 =======================
Ok, I'm able to reproduce this pretty easily using postmark to generate the fsync load. It appears to only occur when ACLs/xattrs are enabled.
(In reply to comment #35 from Jeff Mahoney) > Ok, I'm able to reproduce this pretty easily using postmark to generate the > fsync load. It appears to only occur when ACLs/xattrs are enabled. Sure, but they are enabled by default.
Yeah, I know. I was just making a note so that I can limit the search scope.
my fstab looks like this: joe@joe:~> cat /etc/fstab /dev/sda7 / reiserfs noatime,noacl 1 1 /dev/sda5 joe@joe:~> mount /dev/sda7 on / type reiserfs (rw,noatime,noacl) proc on /var/lib/ntp/proc type proc (ro) Afaik this disables acls, but i still have thoes hangs.
I insert some printk and see these messages: =========================== init_list_head: f5077a9c list_add before: f8c47118<-f8c47118->f8c47118 f8c47118<-f8c47118->f8c47118 f8c47118<-f8c47118->f8c47118 list_add after: f5077a9c<-f8c47118->f5077a9c f8c47118<-f5077a9c->f8c47118 f5077a9c<-f8c47118->f5077a9c init_list_head: f5077a9c [.......] list_add before: f5077a9c<-f8c47118->f5077a9c f5077a9c<-f5077a9c->f5077a9c f5077a9c<-f5077a9c->f5077a9c list_add after: f5077a9c<-f8c47118->f507ea9c f8c47118<-f507ea9c->f5077a9c f507ea9c<-f5077a9c->f5077a9c __discard_prealloc: f8c47118<-f507ea9c->f5077a9c f507ea9c<-f5077a9c->f5077a9c f507ea9c<-f5077a9c->f5077a9c =========================== It seems that some inodes were deleted before discard_prealloc. And the issue _seems_ to be fixed by the following patch: ================== --- inode.c.orig 2008-06-17 16:34:42.000000000 +0800 +++ inode.c 2008-06-17 16:35:19.000000000 +0800 @@ -43,6 +43,10 @@ if (journal_begin(&th, inode->i_sb, jbegin_count)) goto out; + + if (REISERFS_I(inode)->i_prealloc_count > 0) + reiserfs_discard_prealloc(&th, inode); + reiserfs_update_inode_transaction(inode); err = reiserfs_delete_object(&th, inode); ================= just FYI.
Good catch. Thanks! I wonder if comment #38 might be something different. I eliminated the use of struct file in the xattr code, so reiserfs_file_release never gets called. Committing the fix now.
BTW, I think this also explains the earlier bug report regarding the prealloc list.
This patch is in the latest KOTD. Can you verify the problem has been solved?
*** Bug 401627 has been marked as a duplicate of this bug. ***
For my INSPIRON 1501 worked fine this solution 1. Disable de search options (beagle, index, etc ) 2. Edit the /etc/fstab for the reiserfs from this >> acl,user_xattr to this >> noatime,noacl Thanxs :)
tried kernel: 2.6.25.7-SL110_BRANCH_20080617184901-pae and 2.6.25.7-SL110_BRANCH_20080618144016-default The system still hangs.
2.6.25.7-SL110_BRANCH_20080619122426-pae works fine for me.
hey ,,,,my laptop still hangs up :( now takes a while to hangs up ... but it does :( this is the /var/log/message Jun 19 15:29:33 tux gconfd (alex-3091): starting (version 2.22.0), pid 3091 user 'alex' Jun 19 15:29:33 tux gconfd (alex-3091): Resolved address "xml:readonly:/etc/gconf/gconf.xml.mandatory" to a read-only configuration source at position 0 Jun 19 15:29:33 tux gconfd (alex-3091): Resolved address "xml:readwrite:/home/alex/.gconf" to a writable configuration source at position 1 Jun 19 15:29:33 tux gconfd (alex-3091): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration source at position 2 Jun 19 15:29:33 tux gconfd (alex-3091): Resolved address "xml:readonly:/etc/gconf/gconf.xml.vendor" to a read-only configuration source at position 3 Jun 19 15:29:33 tux gconfd (alex-3091): Resolved address "xml:readonly:/etc/gconf/gconf.xml.schemas" to a read-only configuration source at position 4 Jun 19 15:29:36 tux pulseaudio[3182]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3182]: pid.c: Stale PID file, overwriting. Jun 19 15:29:36 tux pulseaudio[3182]: main.c: setrlimit(RLIMIT_NICE, (31, 31)) failed: Operación no permitida Jun 19 15:29:36 tux pulseaudio[3182]: main.c: setrlimit(RLIMIT_RTPRIO, (9, 9)) failed: Operación no permitida Jun 19 15:29:36 tux pulseaudio[3183]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3183]: pid.c: Daemon already running. Jun 19 15:29:36 tux pulseaudio[3183]: main.c: pa_pid_file_create() failed. Jun 19 15:29:36 tux pulseaudio[3185]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3185]: pid.c: Daemon already running. Jun 19 15:29:36 tux pulseaudio[3185]: main.c: pa_pid_file_create() failed. Jun 19 15:29:36 tux pulseaudio[3187]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3187]: pid.c: Daemon already running. Jun 19 15:29:36 tux pulseaudio[3187]: main.c: pa_pid_file_create() failed. Jun 19 15:29:36 tux pulseaudio[3189]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3189]: pid.c: Daemon already running. Jun 19 15:29:36 tux pulseaudio[3189]: main.c: pa_pid_file_create() failed. Jun 19 15:29:36 tux pulseaudio[3191]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3191]: pid.c: Daemon already running. Jun 19 15:29:36 tux pulseaudio[3191]: main.c: pa_pid_file_create() failed. Jun 19 15:29:36 tux pulseaudio[3193]: core-util.c: setpriority(): Permiso denegado Jun 19 15:29:36 tux pulseaudio[3193]: pid.c: Daemon already running. Jun 19 15:29:36 tux pulseaudio[3193]: main.c: pa_pid_file_create() failed. Jun 19 15:29:36 tux kernel: hda-intel: Invalid position buffer, using LPIB read method instead. Jun 19 15:29:41 tux gconfd (alex-3091): Resolved address "xml:readwrite:/home/alex/.gconf" to a writable configuration source at position 0 Jun 19 15:29:53 tux kernel: ISO 9660 Extensions: Microsoft Joliet Level 3 Jun 19 15:29:53 tux kernel: ISO 9660 Extensions: RRIP_1991A Jun 19 15:29:53 tux hald: mounted /dev/sr0 on behalf of uid 1000 Jun 19 15:29:53 tux gnome-keyring-daemon[3093]: adding removable location: volume_label_SU1100_001 at /media/SU1100.001 Jun 19 15:29:54 tux su: (to alex) root on none Jun 19 15:29:54 tux kernel: CPU0 attaching NULL sched-domain. Jun 19 15:29:54 tux kernel: CPU1 attaching NULL sched-domain. Jun 19 15:29:54 tux kernel: CPU0 attaching sched-domain: Jun 19 15:29:54 tux kernel: domain 0: span 00000000,00000000,00000000,00000003 Jun 19 15:29:54 tux kernel: groups: 00000000,00000000,00000000,00000001 00000000,00000000,00000000,00000002 Jun 19 15:29:54 tux kernel: CPU1 attaching sched-domain: Jun 19 15:29:54 tux kernel: domain 0: span 00000000,00000000,00000000,00000003 Jun 19 15:29:54 tux kernel: groups: 00000000,00000000,00000000,00000002 00000000,00000000,00000000,00000001 Jun 19 15:30:03 tux kernel: process `skype' is using obsolete setsockopt SO_BSDCOMPAT Jun 19 15:30:14 tux su: (to root) alex on /dev/pts/1
Alex, Have you tried the kernel from http://ftp.suse.com/pub/projects/kernel/kotd/SL110_BRANCH/i386/ ? (In reply to comment #47 from Alex Rodriguez) > hey ,,,,my laptop still hangs up :( > now takes a while to hangs up ... but it does :( >
I think this bug should be altered from version factory to released. This exact bug just wasted a day of my life. I have installed, reinstalled, reseeded all hardware in my PC because openSUSE11_GM froze on me at random intervals but within 5 minutes. I did a default package selection on REISERFS and running beagle / firefox / etc always hung the system. Only now that by accident reinstalled on ext3 could I keep my pc running long enough to find this bug. imho 11GM should not have been released with this issue or a warning posted not to install on REISERFS. stefan
Updating information accordingly. In the meantime added to Most Annoying Bugs.
Hi guys, it's no clear for me what is triggering this bug, two pc with reiserfs and not hang at all here.. So what I have to do to prevent hang ?
(In reply to comment #51 from Daniele Tombolini) > Hi guys, it's no clear for me what is triggering this bug, two pc with reiserfs > and not hang at all here.. > So what I have to do to prevent hang ? From the earlier reports, any program accessing the reiserfs extended attributes was causing the hang. You can use "getfattr/setfattr" to confirm this behaviour. Beagle by default uses extended attributes to store some book-keeping information, so a lot of people experienced this problem with beagle enabled. You can force beagle to not use xattr by setting the environment variable "BEAGLE_DISABLE_XATTR=1".
Same problem here with opensuse-11.0 just upgraded from 10.3 using DVD. I also use reiserfs :-/
so why is this bug still in state NEEDINFO? What else is needed that can be provided? From testing myself and reading back it is about REISERFS here freezing with get/setattr calls mainly noticable beagled, right? stefan
(In reply to comment #54 from stefan van ruiten) > so why is this bug still in state NEEDINFO? What else is needed that can be > provided? From testing myself and reading back it is about REISERFS here > freezing with get/setattr calls mainly noticable beagled, right? > > stefan > on my machine xattr is disabled (noatime, noacl), but it still hangs. Do i have a different bug?
It was waiting on a response for the KOTD. The reply was received, but the commenter didn't lift the NEEDINFO flag.
*subscribes* I want to install openSUSE 11.0 on at least 2 - 4 boxes so I need to know when this is fixed (add to CC) thanks
I'm currently able to reproduce in qemu. I'm not convinced that it's actually a reiserfs problem, but is just made very obvious by it. 1) A previous comment mentions beta 3 didn't exhibit this problem, and no reiserfs changes outside of one introduced as part of this bug report (or a related one) were made after that date. 2) The stack dump that I'm looking at shows everything waiting on the big kernel lock (BKL). The two reiserfs traces that would hold the BKL are in io_schedule(), and the BKL is special in that it is automatically dropped and reacquired across schedule(). One is waiting for I/O to complete (with the lock dropped), and the other has just completed I/O and is waiting to reacquire it. What I suspect may be happening is that an unrelated change is missing an unlock_kernel() call. ext3 doesn't use the BKL except during fs mount and during ioctls. ReiserFS uses it *everywhere* as a global lock so it would be much more susceptible to this issue. I was hoping to catch something like this with the -debug kernel request I made earlier but I had forgotten that lockdep isn't enabled in our debug kernels. It wreaks havoc on large systems, limiting its usefulness in debugging. I've built a kernel with lockdep to see if I can get any more information that way. The level of uncertainty is that I'm syncing against our KOTD. Unfortunately, I haven't been able to reproduce since then. Does the KOTD end up avoiding this crash for other users? I don't mean to handwave, but prior to building this kernel, I *could* reproduce.
Hi Jeff, i can test your KOTD. Which should i download? Please provide a download link. Any other settings i should use? (noatime, noacl, etc...)
I've added a lockdep flavor to our kernel CVS, so it will be published at: http://ftp.suse.com/pub/projects/kernel/kotd/SL110_BRANCH/x86_64/kernel-lockdep.rpm and http://ftp.suse.com/pub/projects/kernel/kotd/SL110_BRANCH/i386/kernel-lockdep.rpm ... but that doesn't sync out until later today. In the meantime, I've published the RPMs at: http://ftp.suse.com/pub/people/jeffm/suse/testpkgs/389656/ There's only x86_64 there for now, but once the i386 build completes, I'll add that as well.
Well i need a i386 build. Short question about lockdep usage. As far as i understood it prevents BAD locking and logs the bad usage somewhere. So which logs do you need to track the bug?
Ok, I've posted the i386 build. It will dump the logs to the console, so you'll need a way to capture them (like a serial line to another machine). It's possible it will detect the possibility of the deadlock before the problem actually occurs. In that case, it will be logged to the console and the syslog.
Had two hangs while connected with a serial line. Unfortunately there was nothing special in the logs i could capture. Is there something wrong in my setup? This is the kernel command line i used: Kernel command line: root=/dev/disk/by-id/scsi-SATA_HDS722516VLAT20_VNR4GEC4G5KSYK-part7 resume=/dev/sda6 splash=silent vga=0x317 console=tty0 console=ttyS0,9600n8 On the other machine i used putty to connect to the console and logged in as root.
five days without a hang up :)
I have another one: My pc locks up randomly when it's under load. The same pc, the same load, but with openSUSE 10.3 running does not lock up. Everything is formatted as ext3 (and mounted with acl,user_xattr) and xfs.
o.O The hang occurs on ext3 file system ???
It would appear my one of my systems, the one using Reiserfs, is afflicted with the same malady. From my posting on the opensuse mailing list: A brief tale of two systems. System 1: Barebones "server" install running basic OpenSuse with SSLExplorer and nothing else. System uses EXT3. Installed initially with OpenSuSE 10.0, and upgraded successfully with zero issues to 10.1, 10.2, 10.3, and now 11.0. System runs great. System is a VMWare-Server guest running under VMWare Server on a host running Windows 2003. Machine CPU is Intel Pentium III. System 2: Full Gnome desktop install. Uses ReiserFS. Initially installed with 9.3 (I think!). Successfully upgraded to 10.0, 10.1, 10.2, 10.3 with zero issues. System runs by itself (it is not a guest nor does it host any guest OSes) on an AMD Athlon XP 3000+. Recently upgraded to 11.0 which brought on the onset of the problem. The problem manifests itself with random system hard-freezes (no keyboard response etc ... have to power-cycle). Originally it appeared that system would freeze when running Gnome as a non-root user (about 4 freezes within 30 mins of coming up). Then that seemed to go away after removing powersaved but that was a fluke. System stayed up for about two or so days and then froze up again requiring a hard start. On startup, system ran ok for about 30 or so minutes until put under load (DVD recode) when it would freeze up again. Rebooted system to command line (had to run fsck a few times to recover from the hard starts). Logged in as root, put system under load (rpmbuild -ba --clean xorg-....server). System froze up again within 2 minutes. At this time I am at a complete loss as to where the issue lies - do I blame ReiserFS and convert (backup/restore) the system? Do I downgrade (backup/restore) to 10.3?
It sure does. I have been using this specific pc without a hardware change for about 2 years now. It simply never "just locked up". After installing openSUSE 11.0 (Dual boot with 10.3) it did that twice within a day. The first time I was transcoding some video file and the second time I was compiling mythtv, Both are things I do regularly in openSUSE 10.3. There is absolutely nothing in /var/log/messages that would indicate the source of the problem. Of course it might still be hardware related, it just seems quite unlikely. The filesystems get mounted like this: /dev/disk/by-id/scsi-SATA_Maxtor_6Y080M0_Y2KGSTNE-part3 / ext3 acl,user_xattr 1 1 /dev/disk/by-id/scsi-SATA_WDC_WD2000JD-00WD-WMAEH2179009-part2 swap swap defaults 0 0 /dev/disk/by-id/scsi-SATA_Maxtor_6Y080M0_Y2KGSTNE-part1 /backup ext3 acl,user_xattr 1 2 /dev/system/multimedia /hyper xfs defaults 1 2 /dev/disk/by-id/scsi-SATA_Maxtor_6Y080M0_Y2KGSTNE-part4 /suse103 ext3 acl,user_xattr 1 2 I have openSUSE 11.0 installed with all the standard packages, including (lib)beagle.
Finally recovered machine (it had to be hard started when it froze up during rpmbuild -ba xorg...server...). ReiserFS / was corrupted, had to run fsck. FSCK showed clean, put kernel 2.6.25.7-SL110_BRANCH_20080624060341 on it. As I type this (on the machine iteself), it is running fine. Gnome is working, the same rpmbuild is running. / is mounted the usual way (with xattr etc) - I wanted to give this kernel a test run so I did not try to disable xattrs. I will keep this bug updated with the results of the test. In the meantime, after the last hard crash during the rpmbuild, I had to run fsck. fskc reports clean now (ran it one more time just to be sure), but I am stuck with /usr/src/packages/BUILD/xxx directories that I cannot delete. Attempts to delete those directory trees give the following error: cannot remove `xorg-server-1.4.2/GL': Input/output error Attempts to do ls -il on a directory to get the inode result in the same error. However, I was able to mv the above-mentioned BUILD to BUILD.000. So now it looks like the / resiferfs has corruption caused by last hard crash that is not recoverable with fsck.reiserfs. *MAJOR SIGH* Looks like I am going to have to take the backup/restore route after all, and go with ext3 this time.
kernel 2.6.25.7-SL110_BRANCH_20080624060341 is still running stable on the machine. I was able to delete those "undeletable" directories.
Luiz, unless you're actually planning on fixing this bug, please don't assign it away from me.
(In reply to comment #68 from Andreas S) please disregard the comment about hard locking on an ext3 filesystem. After I removed the fglrx video driver my system is as stable as it used to be. I guess I have to look into filing a bug report with ati :(
Ok, that's good feedback from both Andreas and Mobeen. How many other reporters are also running the fglrx driver? I see people keep CC'ing into this bug, so I know the effects are wide - but how many get fixed when trying the KOTD kernel? It doesn't even have to be the -lockdep flavor, though it would be more helpful if it were and logs were provided. I know it may appear as if this bug is stagnant, but I'm trying hard to reproduce it locally. I haven't been able since the last post saying so, and that makes it difficult to proceed. I've audited the reiserfs code for BKL usage and haven run into any hiccups. Everywhere the lock is taken, it's released properly.
I've been getting the lockup consistently on my laptop. As with many folk, turning beagle off (just 'beagle-shutdown' first thing after logging in, I haven't uninstalled it) makes it vastly more stable. I'm using intel embedded-graphics on it, so I'm not running the fglrx driver. I'm perfectly willing to run a KOTD kernel on it for trouble-shooting. I'll do that tonight.
I do not use any third-party video drivers, just whatever comes with the kernel. I have not experienced a lockup since I started running with KOTD 2.6.25.7-SL110_BRANCH_20080624060341. I can still boot back to the 2.6.25.5-1.1 kernel, put the filesystem under load (just from the console, no GUI), and experience the freeze.
What workload are you using? You might be hitting the prealloc issue that was fixed Jun 17, but hasn't been released in an update. This bug needs to be confirmed fixed before we release an update, though.
I've seen random hangs, sometimes of the system, more often though my SeaMonkey browser process going into "dead"/"uninterruptable sleep" state ("D" in ps) when opening some panels. My fellow SeaMonkey developers have told me such a process state is most likely something going wrong (i.e. being locked) at kernel level, and after a reboot I always see reiserfsck replaying journals on at least the /home partition, so I guessed the filesystem might be involved. I saw this on two computers running Nvidia and intel graphic drivers, the desktop has had that I guess since Factory went to 2.6.25, the laptop after I went from 10.3 to 11.0. I didn't yet test the KOTD, as I'm always a bit reluctant to install packages that aren't in a YaST-managable repository, but I might give that a try.
Jeff, to replicate the problem, I boot to the "problem" kernel, and do an rpmbuild -ba --clean on the xorg-x11-server src rpm.
I'am also using an ATI video DRIVER It hangs out, when i use somo video sofware example .. >> fretsonfire runs fine, but if a video option is changed, (the game restarts) and my laptops hangs :s
(In reply to comment #79 from Alex Rodriguez) > I'am also using an ATI video DRIVER > > It hangs out, when i use somo video sofware > > example .. >> fretsonfire runs fine, but if a video option is changed, (the > game restarts) and my laptops hangs :s > Ok, that very much sounds like it's fglrx related then. Out of curiosity, is your hardware supported by radeonhd?
After using the kernel-lockdep-2.6.25.9-2 kernel extensively last night I didn't have a single hard-lock of my laptop. I lost keyboard and mouse once, but I suspect that's an unrelated problem as the system was still visibly working. Beagle ran the whole time and it didn't hiccup.
Great, thanks for the feedback. This is all very promising. Can you check your system log to ensure there hasn't been any output indicating a locking issue?
The only log-entries that mention locking are some lockdep entries during resume-from-suspend, which I suspect are expected. Jun 26 19:27:27 mark kernel: lockdep: fixing up alternatives. Everything else looks clean.
Had 3 hangs with kernel-lockdep-2.6.25.9-2. The system was unreachable neither through network or serial line. So it wasn't just a loss of keyboard and mouse like in comment #81
Ok, since none of the compartmentalized tests I've tried doing have been getting any results, I moved my /home over to reiserfs. Sure enough, I'm seeing hangs very quickly after boot. Updating to the KOTD seems to have made them bit less frequent, but I did run into the following oops nearly immediately. Now that I can actually trigger it, I hope to have a fix soonish. BUG: unable to handle kernel NULL pointer dereference at 0000000000000057 IP: [<ffffffff8812ba16>] :reiserfs:sprintf_le_key+0x1a/0x3b0 PGD 6f44c067 PUD 6f438067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map CPU 0 Modules linked in: nfs lockd nfs_acl sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables af_packet autofs4 ipv6 bridge bnep cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device binfmt_misc microcode netconsole configfs fuse sha256_generic aes_x86_64 aes_generic cbc ohci_hcd dm_crypt loop arc4 pcmcia ecb crypto_blkcipher usbhid ppdev rfcomm l2cap hid parport_pc rtc_cmos iwl3945 nsc_ircc sr_mod i2c_i801 rtc_core yenta_socket iTCO_wdt joydev i2c_core rtc_lib irda rsrc_nonstatic ff_memless thinkpad_acpi firmware_class parport sg crc_ccitt pcmcia_core video iTCO_vendor_support battery ac cdrom hci_usb output snd_hda_intel button snd_pcm uinput mac80211 snd_timer snd_page_alloc snd_hwdep intel_agp bluetooth e1000e snd soundcore cfg80211 linear sd_mod ehci_hcd uhci_hcd usbcore dm_snapshot reiserfs edd dm_mod ext3 mbcache jbd fan ata_piix ahci libata scsi_mod dock thermal processor Pid: 3766, comm: beagled Tainted: G N 2.6.25.9-SL110_BRANCH_20080702130425-lockdep #1 RIP: 0010:[<ffffffff8812ba16>] [<ffffffff8812ba16>] :reiserfs:sprintf_le_key+0x1a/0x3b0 RSP: 0018:ffff81006f5d1a18 EFLAGS: 00010202 RAX: ffff81006f5d1c30 RBX: 000000000000004f RCX: 000000000000000b RDX: ffff81006f5d1c28 RSI: 000000000000004f RDI: ffffffff88158e56 RBP: ffff81006f5d1a38 R08: ffff81006f5d1d78 R09: 00000000ffffffff R10: ffffffff8815924d R11: ffffffffffffffff R12: ffffffff88159220 R13: ffffffff88158e20 R14: ffffffff88158e56 R15: ffff81006f5d1b28 FS: 0000000040fb2950(0063) GS:ffffffff80651000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000057 CR3: 000000006f585000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process beagled (pid: 3766, threadinfo ffff81006f5d0000, task ffff81007a341140) Stack: ffffffff8814456b ffffffff88159220 ffffffff88158e20 ffffffff88158e56 ffff81006f5d1b18 ffffffff8812b37c 0000000000000001 ffff81007a341178 ffff81007a341140 0000000000000000 ffff81006f5d1a98 ffffffff802572f2 Call Trace: [<ffffffff8812b37c>] :reiserfs:prepare_error_buf+0xe2/0x560 [<ffffffff8812b97d>] :reiserfs:__reiserfs_warning+0x63/0xe2 [<ffffffff8813c362>] :reiserfs:reiserfs_xattr_get+0x246/0x269 [<ffffffff8813cf98>] :reiserfs:user_get+0x55/0x5b [<ffffffff8813c07b>] :reiserfs:reiserfs_getxattr+0x80/0xa6 [<ffffffff802c3a64>] vfs_getxattr+0xb6/0xcb [<ffffffff802c3b30>] getxattr+0xb7/0x11a [<ffffffff802c3c79>] sys_lgetxattr+0x5d/0x7d [<ffffffff8020c01a>] system_call_after_swapgs+0x8a/0x8f DWARF2 unwinder stuck at system_call_after_swapgs+0x8a/0x8f Leftover inexact backtrace: Code: 50 33 f8 48 81 c4 d8 00 00 00 5b 41 5c 41 5d c9 c3 55 48 85 f6 48 89 e5 41 56 49 89 fe 41 55 41 54 53 48 89 f3 0f 84 59 03 00 00 <48> 8b 46 08 48 89 c2 b8 0f 00 00 00 48 c1 ea 3c 80 fa 03 77 03 RIP [<ffffffff8812ba16>] :reiserfs:sprintf_le_key+0x1a/0x3b0 RSP <ffff81006f5d1a18> CR2: 0000000000000057 ---[ end trace 603720e8ee3f5a37 ]---
*** Bug 400419 has been marked as a duplicate of this bug. ***
*** Bug 396200 has been marked as a duplicate of this bug. ***
I've been running 2.6.25.6-lockdep for 4 days and haven't been able to reproduce the problem. I've just updated to 2.6.25.10-lockdep and will be using that, but things are looking good.
I had two hangs with 2.6.25.10-SL110_BRANCH_20080703042026-lockdep this morning. Both hangs while upgrading to the latest yast version (e.g. yast-2.17.4-4 plus 120 other yast related packages). I finally managed to upgrade yast from the console. No X server running during upgrade. Jeff you have set your /home to reiserfs so you probably won't see if package installation triggers a kernel lock. My machine uses reiserfs for / and ext2 for /boot.
11.0 update kernel released, version-release is 2.6.25.9-0.2
Ok, since the update is released, I'd like some input. I'm now running the update kernel as listed in comment #90. In response to comment #89, I actually converted my entire system (except /boot since I use LVM) to reiserfs. I didn't do a fresh install, though - I just copied my ext3-based system to a different LV and called that /. Package installations, regular user interactions, etc are all on reiserfs. I even re-enabled Beagle, and then wiped out all the extended attributes (find /home/jeffm -exec setfattr -x user.Beagle {} \;) and restarted it just to ensure that the xattr load typically seen on a new installation with Beagle running would be reproduced. Since then, the user.Beagle xattrs have re-appeared on those files. I'm not using any proprietary modules on my system.
Yesterday, I installed a BIOS update (ASRock A780FullDisplayPort board, ref. 1.50 - I updated in the hope that this fixes the incomplete ACPI PSB table, but it didn't). Before, the problem didn't seem to appear very often, but afterwards, it took only a few minutes after starting beagled to appear. Thanks for releasing the update just in time, now it seems to be stable again (keeping my finger crossed). I've several other older Linux machines with ReiserFS partitions which are running fine (ok, some of them have the user home directory mounted via NFS, so those don't count), but I have disabled Beagle there for quite a while (it was quite annoying in the past), so the problem didn't show up. Beagle has now finished indexing this user, so it seems to be ok.
Running the updated kernel here since it was released. Tried heavy disk I/O, beagle is active, the machine has been stable so far.
I am running the updated kernel. I don't know if it is related, but I now get crashes when selecting "Open in new Window" on a webpage link. This happens in Konqueror and Firefox. I am running Compiz KDE.
Was this a system crash or a windowing system crash? ie: Did you need to reboot entirely?
System crash. The system freezes and the display is either frozen or start to go white slowly. Same as the crashes earlier with reiserfs running. I did some further tests. It is only with Compiz enabled, and with certain effects. When I choose "Low effects" it is stable again. I will slowly re-enable my original effects to see if I can narrow it down.
If the level of effects change the frequency of the crash, I'd speculate that you're not hitting this bug any longer.
The new kernel is working like a champ here. Just to test, I booted back into the "stock" kernel, and lo and behold, before too long I had the hang. Running this kernel for 2+ days now without any issues. I am convinced the issue is resolved by this kernel update.
No freezes with the new kernel since 4 days.
Good to see that this bug is getting fixed. I made a trip to forums.opensuse.org and saw an abundance of fear, uncertainty and doubt, literally :-), regarding beagle and OpenSUSE. I don't know what is OpenSUSE's policy about 'community postings' but an authoratative posting that this problem is fixed by a kernel update would help in removing some of the confusion.
No freezes with new kernel. installed beagle and had it crawl a few gigs of reiserfs partition mounted with noatime, acl, user_xattr
+1 5 days without freezes :D Beagle, CompizFusion, ATI driver ON .... All workin fine FINALLY :)
Great. Between the feedback in this report and in the forums, I consider this bug fixed. Thanks for the help!
2.6.25.10-SL110_BRANCH_20080703042026 surely has made the situation much better for me, though I sometimes still see similar problems that sometimes very much look like being related to the kernel, but this may be some other bug in some other module, maybe the proprietary (argh) NVidia module, some hardware problem or whatever. My real question is though: Is this fixed in the current stock Factory kernel?
This particular problem is fixed in Factory. There is another much more rare issue described in 399966 that I'm still actively trying to reproduce.
I've had no system crashes since the new kernel (with beagle enabled) Thanks
I've received a few requests to explain why I consider the problem fixed, outside of the feedback from users indicating that they no longer observe the hangs. ------------------------------------------------------------------- Tue Jun 17 20:39:37 CEST 2008 - jeffm@suse.de - patches.fixes/reiserfs-discard-xattr-prealloc: reiserfs: discard prealloc in reiserfs_delete_inode (bnc#389656). I believe this patch fixed the problem. Comments #45 and #85 indicate that the problem existed with the patch applied, but I think it could be a different issue there (like bug 399966). Here's what happened: In 2.6.25, I submitted a patch (originally proposed by Christoph Hellwig and Dave Hansen) to remove the use of struct file from the reiserfs xattr implementation. This had an unintended side effect: In order to try to expedite the write process and reduce fragmentation, reiserfs will attempt to allocate several blocks at once (preallocation) and then use those blocks when future writes require them. These blocks are reserved in the generic allocator and freed by reiserfs_discard_prealloc, which is usually called by reiserfs_file_release(). The list is attached directly to the in-memory reiserfs inode, and indirectly to the in-memory journal structure via a list of inodes. When a transaction is completed, the journal code discards all open reservations from any inode by traversing the list and discarding the reservations. This works perfectly for regular files, but since xattrs stopped using struct file, reiserfs_file_release() doesn't get called anymore. The list is left attached to the inode and the inode is left in the list associated with the in-memory journal. If a journal transaction completes after the inode is freed, we'd hit a seg fault with a number of locks held, which would cause the system to hang. Alternatively, under load, the memory previously used by the inode could have been reallocated, and we could follow a bad pointer, also resulting in a hang. The third possibility is that the list pointers are "valid" and then get followed, and we end up with memory (and possibly) disk bitmap corruption. This third case should be extremely rare. So, at the end of the day, this is kind of a classic structure lifetime issue. The structure has gone away when things still reference it, and bad things happen. I've added a call to reiserfs_discard_prealloc() to reiserfs_delete_inode() and this seems to have eliminated the problem for most users. I'm still auditing the code to ensure that the non-delete case is covered as well.
I have been running kernel 2.6.26-3 from the HEAD branch in August. I didn't had a single hang for a month. So i believe this hang and/or the other hang i had with previous kernel versions have been fixed. Thanks
*** Bug 385715 has been marked as a duplicate of this bug. ***