Bug 319657 (MONO76841) - some sort of race can crash the runtime at start-up
Summary: some sort of race can crash the runtime at start-up
Status: RESOLVED MOVED
Alias: MONO76841
Product: Mono: Runtime
Classification: Mono
Component: misc (show other bugs)
Version: 1.1
Hardware: Other Other
: P3 - Medium : Normal
Target Milestone: ---
Assignee: Dick Porter
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-29 00:33 UTC by trow
Modified: 2007-09-15 21:24 UTC (History)
0 users

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 19:39:31 UTC


---- Reported by trow@ximian.com 2005-11-28 17:33:22 MST ----

I am using mono 1.1.10 on suse 10.0.

In bludgeon (a testing tool for Beagle), I spawn off a beagled process and
then immediately start sending 'ping' messages to it every 50ms.  We do
this until the daemon responds, and then we know that it is ready to accept
other messages.

About one time out of three, the beagled's helper process runtime segfaults:

Native stacktrace:

        mono-beagled-helper(mono_handle_native_sigsegv+0xba) [0x81471da]
        mono-beagled-helper [0x81354cf]
        [0xffffe440]
        mono-beagled-helper [0x80fc552]
        mono-beagled-helper(mono_once+0xc6) [0x8102a66]
        mono-beagled-helper [0x80fc68c]
        mono-beagled-helper(mono_runtime_init+0xc9) [0x80daec9]
        mono-beagled-helper [0x8136431]
        mono-beagled-helper(mono_main+0x194) [0x805ce94]
        mono-beagled-helper(__fxstat64+0x12b) [0x805bf3b]
        /lib/tls/libc.so.6(__libc_start_main+0xd0) [0x40135ea0]
        mono-beagled-helper(sinh+0x41) [0x805be91]

This is very mysterious: I do not know why sending messages to the beagled
would cause the helper process to crash.  We might be doing something
stupid in Beagle (so this might be our bug), but regardless --- we still
shouldn't be able to crash the runtime.

Sleeping for 1s before we start pinging allows us to avoid the crash, so
this must be some sort of race in start-up.  Beagle's message-passing code
uses async I/O, so the bug might be related to that.



---- Additional Comments From dick@ximian.com 2005-11-28 20:04:49 MST ----

Does https://bugzilla.novell.com/show_bug.cgi?id=MONO76731 (reported by Joe a couple of weeks ago) look like the
same thing to you?

If so, now that you've described a way to reproduce it I'll be able to
take a look.  Could you look up the symbols between mono_once and the
mono_handle_native_sigsegv please?

The comment in the other bug about mono_once not having changed in
years still stands, though.



---- Additional Comments From joeshaw@novell.com 2005-11-29 12:02:48 MST ----

I am pretty certain it is a duplicate, yes.



---- Additional Comments From trow@ximian.com 2005-11-29 16:56:50 MST ----

I agree that this does look suspiciously like a dup of https://bugzilla.novell.com/show_bug.cgi?id=MONO76731.

Dick: Any suggestions on how to look up those symbols?  Since the
thing that is crashing is a process spawned from another process
spawned from another process, debugging is a bit awkward...



---- Additional Comments From gonzalo@ximian.com 2005-11-29 17:56:34 MST ----

May be I'm stating something obvious but what about 'ulimit -c blah'
and then using the core file?



---- Additional Comments From trow@ximian.com 2005-12-05 20:38:27 MST ----

If there something I could do to produce a more useful backtrace, just
let me know.

(gdb) thread apply all bt

Thread 2 (process 7160):
#0  0x00210202 in ?? ()
Cannot access memory at address 0x20696863

Thread 1 (process 7158):
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x40148541 in raise () from /lib/tls/libc.so.6
#2  0x40149dbb in abort () from /lib/tls/libc.so.6
#3  0x0814722b in mono_handle_native_sigsegv ()
#4  0x081354cf in mono_codegen ()
#5  <signal handler called>
#6  0x080fe00e in mono_pthread_key_for_tls ()
#7  0x080fc552 in mono_pthread_key_for_tls ()
#8  0x08102a66 in mono_once ()
#9  0x080fc68c in mono_pthread_key_for_tls ()
#10 0x080daec9 in mono_runtime_init ()
#11 0x08136431 in mono_codegen ()
#12 0x0805ce94 in mono_main ()
#13 0x0805bf3b in ?? ()
#14 0x00000008 in ?? ()
#15 0xbfde2764 in ?? ()
#16 0xbfde2738 in ?? ()
#17 0x40135ea0 in __libc_start_main () from /lib/tls/libc.so.6
#18 0x40135ea0 in __libc_start_main () from /lib/tls/libc.so.6
#19 0x0805be91 in ?? ()




---- Additional Comments From dick@ximian.com 2005-12-09 12:25:50 MST ----

Are you running this on PPC?

I've been unable to reproduce the crash on x86 (tried with smp
hardware too) so I've been examining the code to try and spot the
smoking gun.  mono_pthread_key_for_tls () is only called from mini-ppc.c.

When you produced the backtrace, were there any other threads running?




---- Additional Comments From trow@ximian.com 2005-12-09 12:58:15 MST ----

No, this is on x86, on a single CPU system (my thinkpad).  Those were
the only two threads in my backtrace.

This crash is very sensitive to timing -- trivial perturbations of my
code (i.e. adding and removing Console.WriteLines) cause it to appear
and disappear, and change its frequency.  I'll try to put together a
simple example that reliably reproduces it.



---- Additional Comments From dick@ximian.com 2005-12-14 12:08:19 MST ----

*** https://bugzilla.novell.com/show_bug.cgi?id=MONO76731 has been marked as a duplicate of this bug. ***



---- Additional Comments From joeshaw@novell.com 2006-02-09 15:25:33 MST ----

FWIW, I haven't seen this in quite some time.  I am fine with closing
it at this point.



---- Additional Comments From dick@ximian.com 2006-02-10 09:26:22 MST ----

Someone posted a new bug recently that might help track this down: https://bugzilla.novell.com/show_bug.cgi?id=MONO77393

I've been busy with other things, but I'll get to it soon!



---- Additional Comments From joeshaw@novell.com 2006-02-28 16:53:45 MST ----

Hmm, not so fast on closing this one, I'm afraid.  A Beagle user just
hit this with 1.1.13.2.  Its appearance has definitely declined in
frequency; I haven't seen it in a while.  Anyway, hopefully this is a
dup of 77393 as you mentioned.



---- Additional Comments From dick@ximian.com 2006-03-03 11:17:14 MST ----

Setting as a duplicate of 77393 (which is currently NEEDINFO as it
doesn't happen for me on SVN head.)

*** This bug has been marked as a duplicate of https://bugzilla.novell.com/show_bug.cgi?id=MONO77393 ***


Unknown operating system unknown. Setting to default OS "Other".
This bug was marked DUPLICATE in the database it was moved from.
    Changing resolution to "MOVED"