Bugzilla – Bug 319657
some sort of race can crash the runtime at start-up
Last modified: 2007-09-15 21:24:46 UTC
---- Reported by trow@ximian.com 2005-11-28 17:33:22 MST ---- I am using mono 1.1.10 on suse 10.0. In bludgeon (a testing tool for Beagle), I spawn off a beagled process and then immediately start sending 'ping' messages to it every 50ms. We do this until the daemon responds, and then we know that it is ready to accept other messages. About one time out of three, the beagled's helper process runtime segfaults: Native stacktrace: mono-beagled-helper(mono_handle_native_sigsegv+0xba) [0x81471da] mono-beagled-helper [0x81354cf] [0xffffe440] mono-beagled-helper [0x80fc552] mono-beagled-helper(mono_once+0xc6) [0x8102a66] mono-beagled-helper [0x80fc68c] mono-beagled-helper(mono_runtime_init+0xc9) [0x80daec9] mono-beagled-helper [0x8136431] mono-beagled-helper(mono_main+0x194) [0x805ce94] mono-beagled-helper(__fxstat64+0x12b) [0x805bf3b] /lib/tls/libc.so.6(__libc_start_main+0xd0) [0x40135ea0] mono-beagled-helper(sinh+0x41) [0x805be91] This is very mysterious: I do not know why sending messages to the beagled would cause the helper process to crash. We might be doing something stupid in Beagle (so this might be our bug), but regardless --- we still shouldn't be able to crash the runtime. Sleeping for 1s before we start pinging allows us to avoid the crash, so this must be some sort of race in start-up. Beagle's message-passing code uses async I/O, so the bug might be related to that. ---- Additional Comments From dick@ximian.com 2005-11-28 20:04:49 MST ---- Does https://bugzilla.novell.com/show_bug.cgi?id=MONO76731 (reported by Joe a couple of weeks ago) look like the same thing to you? If so, now that you've described a way to reproduce it I'll be able to take a look. Could you look up the symbols between mono_once and the mono_handle_native_sigsegv please? The comment in the other bug about mono_once not having changed in years still stands, though. ---- Additional Comments From joeshaw@novell.com 2005-11-29 12:02:48 MST ---- I am pretty certain it is a duplicate, yes. ---- Additional Comments From trow@ximian.com 2005-11-29 16:56:50 MST ---- I agree that this does look suspiciously like a dup of https://bugzilla.novell.com/show_bug.cgi?id=MONO76731. Dick: Any suggestions on how to look up those symbols? Since the thing that is crashing is a process spawned from another process spawned from another process, debugging is a bit awkward... ---- Additional Comments From gonzalo@ximian.com 2005-11-29 17:56:34 MST ---- May be I'm stating something obvious but what about 'ulimit -c blah' and then using the core file? ---- Additional Comments From trow@ximian.com 2005-12-05 20:38:27 MST ---- If there something I could do to produce a more useful backtrace, just let me know. (gdb) thread apply all bt Thread 2 (process 7160): #0 0x00210202 in ?? () Cannot access memory at address 0x20696863 Thread 1 (process 7158): #0 0xffffe410 in __kernel_vsyscall () #1 0x40148541 in raise () from /lib/tls/libc.so.6 #2 0x40149dbb in abort () from /lib/tls/libc.so.6 #3 0x0814722b in mono_handle_native_sigsegv () #4 0x081354cf in mono_codegen () #5 <signal handler called> #6 0x080fe00e in mono_pthread_key_for_tls () #7 0x080fc552 in mono_pthread_key_for_tls () #8 0x08102a66 in mono_once () #9 0x080fc68c in mono_pthread_key_for_tls () #10 0x080daec9 in mono_runtime_init () #11 0x08136431 in mono_codegen () #12 0x0805ce94 in mono_main () #13 0x0805bf3b in ?? () #14 0x00000008 in ?? () #15 0xbfde2764 in ?? () #16 0xbfde2738 in ?? () #17 0x40135ea0 in __libc_start_main () from /lib/tls/libc.so.6 #18 0x40135ea0 in __libc_start_main () from /lib/tls/libc.so.6 #19 0x0805be91 in ?? () ---- Additional Comments From dick@ximian.com 2005-12-09 12:25:50 MST ---- Are you running this on PPC? I've been unable to reproduce the crash on x86 (tried with smp hardware too) so I've been examining the code to try and spot the smoking gun. mono_pthread_key_for_tls () is only called from mini-ppc.c. When you produced the backtrace, were there any other threads running? ---- Additional Comments From trow@ximian.com 2005-12-09 12:58:15 MST ---- No, this is on x86, on a single CPU system (my thinkpad). Those were the only two threads in my backtrace. This crash is very sensitive to timing -- trivial perturbations of my code (i.e. adding and removing Console.WriteLines) cause it to appear and disappear, and change its frequency. I'll try to put together a simple example that reliably reproduces it. ---- Additional Comments From dick@ximian.com 2005-12-14 12:08:19 MST ---- *** https://bugzilla.novell.com/show_bug.cgi?id=MONO76731 has been marked as a duplicate of this bug. *** ---- Additional Comments From joeshaw@novell.com 2006-02-09 15:25:33 MST ---- FWIW, I haven't seen this in quite some time. I am fine with closing it at this point. ---- Additional Comments From dick@ximian.com 2006-02-10 09:26:22 MST ---- Someone posted a new bug recently that might help track this down: https://bugzilla.novell.com/show_bug.cgi?id=MONO77393 I've been busy with other things, but I'll get to it soon! ---- Additional Comments From joeshaw@novell.com 2006-02-28 16:53:45 MST ---- Hmm, not so fast on closing this one, I'm afraid. A Beagle user just hit this with 1.1.13.2. Its appearance has definitely declined in frequency; I haven't seen it in a while. Anyway, hopefully this is a dup of 77393 as you mentioned. ---- Additional Comments From dick@ximian.com 2006-03-03 11:17:14 MST ---- Setting as a duplicate of 77393 (which is currently NEEDINFO as it doesn't happen for me on SVN head.) *** This bug has been marked as a duplicate of https://bugzilla.novell.com/show_bug.cgi?id=MONO77393 *** Unknown operating system unknown. Setting to default OS "Other". This bug was marked DUPLICATE in the database it was moved from. Changing resolution to "MOVED"