Bug 235818

Summary: Kernel Bug in kernel-default-2.6.18.2-34 on x86-64 SMP machine
Product: [openSUSE] openSUSE 10.2 Reporter: Andreas Vetter <vetter>
Component: KernelAssignee: Nick Piggin <npiggin>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: aj, asklein, auxsvr, jeffm
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: hwinfo

Description Andreas Vetter 2007-01-17 15:51:31 UTC
Kernel Bug in kernel-default-2.6.18.2-34 on x86-64 SMP machine:

Jan 12 16:13:55 wpyc009 kernel: BUG: warning at fs/inotify.c:171/set_dentry_child_flags()
Jan 12 16:13:55 wpyc009 kernel:
Jan 12 16:13:55 wpyc009 kernel: Call Trace:
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802d54dc>] set_dentry_child_flags+0x66/0x132
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802d560f>] remove_watch_no_event+0x67/0x76
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802d5a7d>] inotify_destroy+0x92/0xbf
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802d5b9a>] inotify_release+0x1a/0x73
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80210559>] __fput+0xae/0x182
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80221b2a>] filp_close+0x5c/0x64
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80236bdd>] put_files_struct+0x6c/0xc3
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802131c3>] do_exit+0x2b0/0x8fc
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802450b9>] cpuset_exit+0x0/0x6c
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff802295f6>] get_signal_to_deliver+0x46e/0x49d
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80227fdd>] do_signal+0x55/0x74a
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff8025c8e8>] thread_return+0x0/0xef
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff8029463d>] __remove_hrtimer+0x27/0x39
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80255f03>] hrtimer_cancel+0xc/0x16
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff8025d736>] do_nanosleep+0x47/0x70
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80255df0>] hrtimer_nanosleep+0x58/0x118
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80226034>] do_wait+0x9c1/0xa44
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80294821>] hrtimer_wakeup+0x0/0x22
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff80258097>] sysret_signal+0x1c/0x27
Jan 12 16:13:55 wpyc009 kernel:  [<ffffffff8025831b>] ptregscall_common+0x67/0xac
Jan 12 16:13:55 wpyc009 kernel:
Comment 1 Andreas Vetter 2007-01-17 15:53:14 UTC
Created attachment 113411 [details]
hwinfo
Comment 2 Greg Kroah-Hartman 2007-01-17 17:48:10 UTC
This is just a "warning" that something bad might have happened, the kernel caught it and continued on.

Did the system continue to work just fine, or did other things go wrong?

Is it easy to trigger this warning?  what were you doing at the time?
Comment 3 Andreas Vetter 2007-01-18 09:05:13 UTC
The machine works fine after that. I have no idea how to trigger it, since this is a machine in a pool for students and acct was not started by accident.

The entry just before the bug is:
Jan 12 16:13:53 wpyc009 sshd[23810]: Accepted publickey for ferfurth from 132.187.42.39 port 59835 ssh2

Since this seems to be filesystem dependent:
The machine has / and /tmp on a reiserfs. Nobody can insert floppies, CDs, USB devices. 
It has /home and /usr/local on NFS, sometimes the NFS server responds very slowly. 
Comment 4 Andreas Vetter 2007-01-19 15:56:02 UTC
Hmm, we have several machines (same hardware) that freeze, when the X-server is killed with CTRL-ALT-Backspace. We have to powercycle them. Unfortunately, it is not reproducible. I hope the new Xorg update helps for this issue.
Comment 5 Greg Kroah-Hartman 2007-01-20 00:40:05 UTC
Can you provide the output of 'hwinfo' attached to this bug?
Comment 6 Andreas Vetter 2007-01-21 14:41:04 UTC
(In reply to comment #1)
> Created an attachment (id=113411) [edit]
> hwinfo

already done 

Comment 7 Lars Marowsky-Bree 2007-01-23 10:38:41 UTC
Does the new Xorg update help as you hope in comment #4?
Comment 8 Andreas Vetter 2007-01-23 12:08:44 UTC
Looks good until now.
Comment 9 Lars Marowsky-Bree 2007-01-23 12:48:52 UTC
Perfect! ;-)
Comment 10 Andreas Vetter 2007-01-23 14:01:46 UTC
Too early :-(
One of the machines was completely frozen again.
Nothing in the logs.
User says they tried bzflag, and then it was frozen. 
Unfortunately I can't find it with "lastcomm". Obviously "lastcomm" only logs finished commands. How can I log all commands?
Comment 11 Andreas Vetter 2007-02-05 13:06:45 UTC
Different machine, similar Bug:

Feb  1 10:24:29 wpyc007 kernel: BUG: warning at fs/inotify.c:171/set_dentry_child_flags()
Feb  1 10:24:29 wpyc007 kernel:
Feb  1 10:24:29 wpyc007 kernel: Call Trace:
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802d54dc>] set_dentry_child_flags+0x66/0x132
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802d560f>] remove_watch_no_event+0x67/0x76
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff88072fdd>] :reiserfs:reiserfs_delete_inode+0x0/0xf6
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802d5a7d>] inotify_destroy+0x92/0xbf
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802d5b9a>] inotify_release+0x1a/0x73
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff80210559>] __fput+0xae/0x182
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff80221b2a>] filp_close+0x5c/0x64
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff80236bdd>] put_files_struct+0x6c/0xc3
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802131c3>] do_exit+0x2b0/0x8fc
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802450b9>] cpuset_exit+0x0/0x6c
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff802295f6>] get_signal_to_deliver+0x46e/0x49d
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff80227fdd>] do_signal+0x55/0x74a
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff80229cab>] sys_recvfrom+0x11d/0x137
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff80258097>] sysret_signal+0x1c/0x27
Feb  1 10:24:29 wpyc007 kernel:  [<ffffffff8025831b>] ptregscall_common+0x67/0xac
Feb  1 10:24:29 wpyc007 kernel:
Comment 12 Andreas Vetter 2007-02-05 13:09:28 UTC
Machine from comment #11 is still working correctly without reboot. Maybe the lockups and this bug are two different things.
Comment 14 Jan Kara 2007-02-08 09:07:28 UTC
I've been investigating the inotify problem - actually, it does not seem to be rare (I've found several bugreports with the similar warning). But no one else complains about the hang - so that one is probably unrelated. I'm reassigning to Nick who is trying to track down the inotify problem in the mainline. He may be glad for further debugging input ;)
Comment 15 Nick Piggin 2007-02-20 15:50:04 UTC
Sorry, still working on this in the upstream kernel.

Andreas: I'm pretty sure it is harmless. Actually the flag is only used to
indicate whether there is an inotify watch on the parent directory without
taking a lock. The warning just means we've found the flag set when it should
not have been, so we'll just have been doing a bit of extra locking in that
case.
Comment 16 Peter B 2007-08-30 01:11:56 UTC
A similar warning in my system:

BUG: warning at fs/inotify.c:181/set_dentry_child_flags()
 [<c01872af>] set_dentry_child_flags+0xcf/0x11e
 [<c0187351>] remove_watch_no_event+0x53/0x5f
 [<c0187a68>] inotify_destroy+0x77/0x9f
 [<c0187b52>] inotify_release+0xc/0x57
 [<c016560f>] __fput+0xac/0x16a
 [<c0162f2f>] filp_close+0x52/0x59
 [<c0121efd>] put_files_struct+0x65/0xa7
 [<c0122f34>] do_exit+0x224/0x791
 [<c02a6ed5>] do_page_fault+0x27d/0x507
 [<c0123517>] sys_exit_group+0x0/0xd
 [<c0103d5d>] sysenter_past_esp+0x56/0x79

, reiserfs filesystem, no crash, no problem as far as I'm aware, occurred only once on linux 2.6.18.8-0.5-default.
Comment 17 Jan Kara 2007-09-18 12:18:27 UTC
*** Bug 308585 has been marked as a duplicate of this bug. ***
Comment 18 Jan Kara 2007-09-18 12:22:06 UTC
*** Bug 309752 has been marked as a duplicate of this bug. ***
Comment 19 Nick Piggin 2007-12-03 06:54:15 UTC
OK, I have taken another look at this problem (sorry it has taken so long).
And come up with one fix to close a real race. Another patch to remove
the debugging code -- which actually wasn't so helpful to track down any
problem (the race was found by inspection) -- and is itself a bit racy.

Posted it to linux-fsdevel for public review, and we will go with that
solution if no objections are raised in the meantime.

Thanks,
Nick
Comment 20 Jeff Mahoney 2008-01-08 21:13:34 UTC
*** Bug 352290 has been marked as a duplicate of this bug. ***
Comment 21 Nick Piggin 2008-01-08 22:24:06 UTC
I have patches in -mm for this for a few releases. No problems so far.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/inotify-fix-race.patch
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/inotify-remove-debug-code.patch

I'm wondering whether I should put these into the OpenSUSE kernel, or wait
for them to go upstream first?
Comment 22 Andreas Jaeger 2008-01-09 09:36:54 UTC
I suggest to submit this to kernel CVS *HEAD* so that it gets testing in factory - and then move to the 10.3 kernel.

I also suggest to push for upstream inclusion.

Thanks!
Comment 23 Nick Piggin 2008-07-09 05:24:04 UTC
Closing this as wontfix. The warnings are rather rare and they are false positives by all accounts anyway. KDE4 actually triggers them more often we found, however I have fixed the problem in recent kernels so 10.3 is probably OK to stay unpatched.