Bug 318041 (MONO75007) - [PATCH] Deadlock with --debug
Summary: [PATCH] Deadlock with --debug
Status: RESOLVED FIXED
Alias: MONO75007
Product: Mono: Runtime
Classification: Mono
Component: debug (show other bugs)
Version: 1.1
Hardware: Other Other
: P3 - Medium : Major
Target Milestone: ---
Assignee: Paolo Molaro
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-05-20 20:19 UTC by Joe Shaw
Modified: 2007-09-15 21:24 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
stack trace (6.93 KB, text/plain)
2005-05-20 20:19 UTC, Thomas Wiest
Details
stack trace w/ symbols (13.18 KB, text/plain)
2005-05-23 22:28 UTC, Thomas Wiest
Details
Trace from my box (13.59 KB, text/plain)
2005-06-11 01:50 UTC, Thomas Wiest
Details
Patch from gonz (1.44 KB, patch)
2005-06-11 03:07 UTC, Thomas Wiest
Details | Diff
stack traces (36.01 KB, text/plain)
2005-06-22 22:13 UTC, Thomas Wiest
Details
fix up this issue (1.34 KB, patch)
2005-06-22 22:49 UTC, Thomas Wiest
Details | Diff
Stack trace (23.60 KB, text/plain)
2005-06-23 00:56 UTC, Thomas Wiest
Details
stack trace (4.55 KB, text/plain)
2005-07-06 21:58 UTC, Thomas Wiest
Details
Patch. (2.74 KB, patch)
2005-07-20 06:42 UTC, Thomas Wiest
Details | Diff
A patch that actually compiles and runs ;-) (4.93 KB, patch)
2005-07-21 20:58 UTC, Thomas Wiest
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 19:18:05 UTC


---- Reported by joeshaw@novell.com 2005-05-20 13:19:05 MST ----

Sorry, I'm not sure if this is a JIT bug or an IO layer bug.

With the new messaging system we're using in Beagle, we are frequently
getting hangs in a constructor for a class which handles socket connections
on a Unix domain socket.  The constructor itself is very simple, it just
sets one field based on its parameters -- one line.  We've tested 1.1.7 and
all the 1.1.7.x releases.  It happens on all of them, but not on 1.1.4 or
1.1.6 (but does on 1.1.6.2, apparently).  It's serious enough that we feel
we can't release the next version of Beagle without a resolution.

Here's the interesting bit: after it hangs I attach gdb to it.  I try
calling mono_print_method_for_ip() on some of the unknown stack frames, and
it hangs on all of them.  So maybe this is an internal locking problem?

I'll attach the stack traces.  For some reason 'thread apply all bt' didn't
quite work.



---- Additional Comments From joeshaw@novell.com 2005-05-20 13:19:36 MST ----

Created an attachment (id=167976)
stack trace




---- Additional Comments From vargaz@gmail.com 2005-05-20 13:59:05 MST ----

This stacktrace, like the others is not very usable since it is made
of an executable without debug info (and probably stripped too).




---- Additional Comments From joeshaw@novell.com 2005-05-20 14:06:14 MST ----

ok, i'll try to dup with svn HEAD



---- Additional Comments From joeshaw@novell.com 2005-05-23 15:27:30 MST ----

Ok, duplicated it, attaching the trace



---- Additional Comments From joeshaw@novell.com 2005-05-23 15:28:02 MST ----

Created an attachment (id=167977)
stack trace w/ symbols




---- Additional Comments From joeshaw@novell.com 2005-05-23 16:56:34 MST ----

I'm able to duplicate this fairly reliably now, and the top frames of
the stack traces are all the same for all the threads in the 3 times
I've attached gdb to it so far.




---- Additional Comments From joeshaw@novell.com 2005-05-24 14:46:23 MST ----

Some additional info: this only appears to happen on SMP.  My SUSE 9.3
box is SMP and it hangs every time.  When I went to duplicate this on
Ubuntu, I couldn't at first and I noticed that they ship an i386 UP
kernel by default.  After I installed the i686 SMP kernel it started
happening there too.



---- Additional Comments From bmaurer@users.sf.net 2005-06-10 18:50:49 MST ----

Created an attachment (id=167978)
Trace from my box




---- Additional Comments From bmaurer@users.sf.net 2005-06-10 18:52:50 MST ----

I got a very similar stack trace. Note threads 1 and 2:

Thread 1:
mono_class_init             -- Holds loader lock
mono_domain_assembly_search -- Holds domain lock

Thread 2:
mono_method_get_object      -- Holds domain lock
mono_class_from_name        -- Holds loader lock




---- Additional Comments From bmaurer@users.sf.net 2005-06-10 20:07:20 MST ----

Created an attachment (id=167979)
Patch from gonz




---- Additional Comments From bmaurer@users.sf.net 2005-06-10 20:07:53 MST ----

This is a patch from gonz that seems to work here, but since its a
race, it might be that I haven't triggered it yet...



---- Additional Comments From gonzalo@ximian.com 2005-06-11 01:12:06 MST ----

Btw, that patch is not meant to get into svn, but just to show that
the problem is that the hook functions called while holding the
assemblies_mutex also want to lock the appdomain, leading to this problem.



---- Additional Comments From bmaurer@users.sf.net 2005-06-17 00:33:19 MST ----

*** https://bugzilla.novell.com/show_bug.cgi?id=MONO74455 has been marked as a duplicate of this bug. ***



---- Additional Comments From bmaurer@users.sf.net 2005-06-17 00:42:46 MST ----

*** https://bugzilla.novell.com/show_bug.cgi?id=MONO75050 has been marked as a duplicate of this bug. ***



---- Additional Comments From bmaurer@users.sf.net 2005-06-17 00:57:48 MST ----

*** https://bugzilla.novell.com/show_bug.cgi?id=MONO74938 has been marked as a duplicate of this bug. ***



---- Additional Comments From bmaurer@users.sf.net 2005-06-20 16:24:40 MST ----

This is fixed in HEAD, 1.1.7.4, and 1.1.8.1.



---- Additional Comments From james@ximian.com 2005-06-22 15:12:52 MST ----

I'm still seeing this in 1.1.7.6.  Either that or it's a new deadlock,
I'm not sure, will attach stack traces.



---- Additional Comments From james@ximian.com 2005-06-22 15:13:17 MST ----

Created an attachment (id=167980)
stack traces




---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:27:24 MST ----

Thread 6 (Thread 1109838768 (LWP 25017)):

#12 0x081124f2 in EnterCriticalSection (section=0x883f290) at
critical-sections.c:151
#13 0x081124f2 in EnterCriticalSection (section=0x883f28c) at
critical-sections.c:151
#14 0x080c8e92 in mono_loader_lock () at loader.c:1331
#15 0x0809a903 in mono_class_from_name (image=0x8858350,
    name_space=0x87ed831 "System.Reflection", name=0x61b9 <Address
0x61b9 out of bounds>)
    at class.c:3272
#16 0x080f8077 in mono_method_get_object (domain=0x8896f00,
method=0x41480120,
    refclass=0x41471e60) at reflection.c:5430
#17 0x080d1426 in ves_icall_Type_GetConstructors_internal (type=0x0,
bflags=20,
    reftype=0xfffffffc) at icall.c:2935

Thread 4 (Thread 1112140720 (LWP 25045)):

#12 0x081124f2 in EnterCriticalSection (section=0x8896f08) at
critical-sections.c:151
#13 0x081124f2 in EnterCriticalSection (section=0x8896f04) at
critical-sections.c:151
#14 0x080f7ec3 in mono_type_get_object (domain=0x8896f00, type=0x61d5)
at reflection.c:5354
---Type <return> to continue, or q <return> to quit---
#15 0x080ef582 in reflection_methodbuilder_from_ctor_builder
(rmb=0x4249dc70, mb=0x91b1880)
    at reflection.c:1312
#16 0x080fd857 in ctorbuilder_to_mono_method (klass=0x9373908,
mb=0x91b1880)
    at reflection.c:7944
#17 0x080fefa3 in ensure_runtime_vtable (klass=0x9373908) at
reflection.c:8495
#18 0x080ff931 in mono_reflection_create_runtime_class (tb=0x8951d90)
at reflection.c:8736

These two are racing. the second one is doing domain then loader.



---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:48:46 MST ----

>These two are racing. the second one is doing domain then loader.
I meant loader then domain.



---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:49:26 MST ----

Created an attachment (id=167981)
fix up this issue




---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:50:16 MST ----

This patch takes the domain lock in the
mono_reflection_create_runtime_class method. This ensures that when
the domain is locked inside the method, it will be a standard
recursive lock and not deadlock.



---- Additional Comments From thunder@ximian.com 2005-06-22 17:56:11 MST ----

Created an attachment (id=167982)
Stack trace




---- Additional Comments From thunder@ximian.com 2005-06-22 17:57:00 MST ----

I'm still seeing a deadlock.  Trace attached.



---- Additional Comments From bmaurer@users.sf.net 2005-06-22 18:07:24 MST ----

That's a completely different deadlock; it is somewhere in managed code.



---- Additional Comments From miguel@ximian.com 2005-06-22 19:01:48 MST ----

Dan,

The trace you pasted indicates that this is not a runtime bug, this is
another kind of problem.

This seems to be an issue happening on managed-land, we need to get
our hands on the machine, hopefully without bundles so we can add
printfs and so forth.  

Could we get: ip address of the machine, accounts to login and fix and
a recipe to reproduce the bug?





---- Additional Comments From miguel@ximian.com 2005-06-22 19:10:11 MST ----

Gonzalo inspected the trace further, Dan Mill's latest trace is not a
deadlock, it is again a child process whose output is being read and
its blocking on the child.

Could you please debug that, and make sure that the problem is on your
end?  Am setting the bug to `fixed' for now. 

Lets open a new bug if we have further information.



---- Additional Comments From naresh@novell.com 2005-06-29 13:53:42 MST ----

Note:  ZLM testing is verifying this in the Mono 1.1.7.7 build.



---- Additional Comments From joeshaw@novell.com 2005-07-06 14:57:09 MST ----

I am seeing another deadlock.  I am not sure it's exactly the same
one, but it hangs at the same point in my app as the domain/loader
lock.  I couldn't get mono_print_method_from_ip() to give me anything
for most of the topmost ?? stack frames.



---- Additional Comments From joeshaw@novell.com 2005-07-06 14:58:36 MST ----

Created an attachment (id=167983)
stack trace




---- Additional Comments From joeshaw@novell.com 2005-07-06 14:59:37 MST ----

oh, yeah, this is on 1.1.8.2 on an SMP box on SUSE 9.3.



---- Additional Comments From bmaurer@users.sf.net 2005-07-06 21:24:52 MST ----

This only happens with --debug



---- Additional Comments From martin@ximian.com 2005-07-07 11:44:28 MST ----

Why did you assing that to me ?
Unless someone gives me an SMP machine to test this on, all I can do
about this is close it as WONTFIX.



---- Additional Comments From bmaurer@users.sf.net 2005-07-07 11:48:46 MST ----

You don't need an smp box to test this issue. It is a classic locking
order issue. One thread acquires the locks in order a b, the other in
order b a. You can fix this without being able to reproduce. You just
need to make sure that the debugger takes locks in the correct order.

Also, there are tons of smp boxes inside the firewall: you can use
hardhat and x86-mono.



---- Additional Comments From martin@ximian.com 2005-07-07 12:28:04 MST ----

Ok, so how exactly do I reproduce this ?



---- Additional Comments From joeshaw@novell.com 2005-07-07 12:49:23 MST ----

I'd recommend building beagle and running it some number of times.  As
is the nature with a locking race, it's difficult to get it to happen
reliably or distill into a test case.

BenM has built it and run it in the past; he might have a machine for
you to use.  Otherwise, I can help you build it if necessary.



---- Additional Comments From bmaurer@users.sf.net 2005-07-19 23:42:51 MST ----

Created an attachment (id=167984)
Patch.




---- Additional Comments From bmaurer@users.sf.net 2005-07-19 23:44:41 MST ----

This patch changes locking so that we use the debugger lock ratehr
than the loader lock. The loader lock just can't be used inside the
debugger lock.



---- Additional Comments From bmaurer@users.sf.net 2005-07-21 13:58:06 MST ----

Created an attachment (id=167985)
A patch that actually compiles and runs ;-)




---- Additional Comments From bmaurer@users.sf.net 2005-07-21 13:58:25 MST ----

Sigh, I am really good at submitting the oldest version of the patch
on my hard drive.



---- Additional Comments From joeshaw@novell.com 2005-07-22 15:30:36 MST ----

I've been indexing like crazy for the last three hours, with several
IndexHelper restarts and haven't gotten a single deadlock with this
patch.  Typically by now I would have at least gotten a few.



---- Additional Comments From bmaurer@users.sf.net 2005-07-23 02:38:45 MST ----

Joe reported to me that his beagle process has lasted 13 hours now
without deadlocking. He states that it usually takes less than an hour
to deadlock. So am pretty sure this patch fixes the issue.

Paolo, can you please review?



---- Additional Comments From joeshaw@novell.com 2005-07-25 10:51:00 MST ----

Woo hoo!  Beagle has been running all weekend without a deadlock,
which is unheard of on this SMP box.



---- Additional Comments From lupus@ximian.com 2005-07-26 12:02:23 MST ----

Ben, please commit with a changelog entry.



---- Additional Comments From bmaurer@users.sf.net 2005-07-26 15:00:27 MST ----

Fixed in HEAD and in 1.1.8.x. *not* in 1.1.7.x, by Miguel's request.

NOTE ABOUT REOPENING THIS BUG
-----------------------------

Please do *NOT* reopen this bug if you find another deadlock. A new
bug should be opened.

Imported an attachment (id=167976)
Imported an attachment (id=167977)
Imported an attachment (id=167978)
Imported an attachment (id=167979)
Imported an attachment (id=167980)
Imported an attachment (id=167981)
Imported an attachment (id=167982)
Imported an attachment (id=167983)
Imported an attachment (id=167984)
Imported an attachment (id=167985)

Unknown bug field "cf_op_sys_details" encountered while moving bug
   <cf_op_sys_details>On mono 1.1.8.2, SMP</cf_op_sys_details>
Unknown operating system unknown. Setting to default OS "Other".