Bugzilla – Bug 318041
[PATCH] Deadlock with --debug
Last modified: 2007-09-15 21:24:46 UTC
---- Reported by joeshaw@novell.com 2005-05-20 13:19:05 MST ---- Sorry, I'm not sure if this is a JIT bug or an IO layer bug. With the new messaging system we're using in Beagle, we are frequently getting hangs in a constructor for a class which handles socket connections on a Unix domain socket. The constructor itself is very simple, it just sets one field based on its parameters -- one line. We've tested 1.1.7 and all the 1.1.7.x releases. It happens on all of them, but not on 1.1.4 or 1.1.6 (but does on 1.1.6.2, apparently). It's serious enough that we feel we can't release the next version of Beagle without a resolution. Here's the interesting bit: after it hangs I attach gdb to it. I try calling mono_print_method_for_ip() on some of the unknown stack frames, and it hangs on all of them. So maybe this is an internal locking problem? I'll attach the stack traces. For some reason 'thread apply all bt' didn't quite work. ---- Additional Comments From joeshaw@novell.com 2005-05-20 13:19:36 MST ---- Created an attachment (id=167976) stack trace ---- Additional Comments From vargaz@gmail.com 2005-05-20 13:59:05 MST ---- This stacktrace, like the others is not very usable since it is made of an executable without debug info (and probably stripped too). ---- Additional Comments From joeshaw@novell.com 2005-05-20 14:06:14 MST ---- ok, i'll try to dup with svn HEAD ---- Additional Comments From joeshaw@novell.com 2005-05-23 15:27:30 MST ---- Ok, duplicated it, attaching the trace ---- Additional Comments From joeshaw@novell.com 2005-05-23 15:28:02 MST ---- Created an attachment (id=167977) stack trace w/ symbols ---- Additional Comments From joeshaw@novell.com 2005-05-23 16:56:34 MST ---- I'm able to duplicate this fairly reliably now, and the top frames of the stack traces are all the same for all the threads in the 3 times I've attached gdb to it so far. ---- Additional Comments From joeshaw@novell.com 2005-05-24 14:46:23 MST ---- Some additional info: this only appears to happen on SMP. My SUSE 9.3 box is SMP and it hangs every time. When I went to duplicate this on Ubuntu, I couldn't at first and I noticed that they ship an i386 UP kernel by default. After I installed the i686 SMP kernel it started happening there too. ---- Additional Comments From bmaurer@users.sf.net 2005-06-10 18:50:49 MST ---- Created an attachment (id=167978) Trace from my box ---- Additional Comments From bmaurer@users.sf.net 2005-06-10 18:52:50 MST ---- I got a very similar stack trace. Note threads 1 and 2: Thread 1: mono_class_init -- Holds loader lock mono_domain_assembly_search -- Holds domain lock Thread 2: mono_method_get_object -- Holds domain lock mono_class_from_name -- Holds loader lock ---- Additional Comments From bmaurer@users.sf.net 2005-06-10 20:07:20 MST ---- Created an attachment (id=167979) Patch from gonz ---- Additional Comments From bmaurer@users.sf.net 2005-06-10 20:07:53 MST ---- This is a patch from gonz that seems to work here, but since its a race, it might be that I haven't triggered it yet... ---- Additional Comments From gonzalo@ximian.com 2005-06-11 01:12:06 MST ---- Btw, that patch is not meant to get into svn, but just to show that the problem is that the hook functions called while holding the assemblies_mutex also want to lock the appdomain, leading to this problem. ---- Additional Comments From bmaurer@users.sf.net 2005-06-17 00:33:19 MST ---- *** https://bugzilla.novell.com/show_bug.cgi?id=MONO74455 has been marked as a duplicate of this bug. *** ---- Additional Comments From bmaurer@users.sf.net 2005-06-17 00:42:46 MST ---- *** https://bugzilla.novell.com/show_bug.cgi?id=MONO75050 has been marked as a duplicate of this bug. *** ---- Additional Comments From bmaurer@users.sf.net 2005-06-17 00:57:48 MST ---- *** https://bugzilla.novell.com/show_bug.cgi?id=MONO74938 has been marked as a duplicate of this bug. *** ---- Additional Comments From bmaurer@users.sf.net 2005-06-20 16:24:40 MST ---- This is fixed in HEAD, 1.1.7.4, and 1.1.8.1. ---- Additional Comments From james@ximian.com 2005-06-22 15:12:52 MST ---- I'm still seeing this in 1.1.7.6. Either that or it's a new deadlock, I'm not sure, will attach stack traces. ---- Additional Comments From james@ximian.com 2005-06-22 15:13:17 MST ---- Created an attachment (id=167980) stack traces ---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:27:24 MST ---- Thread 6 (Thread 1109838768 (LWP 25017)): #12 0x081124f2 in EnterCriticalSection (section=0x883f290) at critical-sections.c:151 #13 0x081124f2 in EnterCriticalSection (section=0x883f28c) at critical-sections.c:151 #14 0x080c8e92 in mono_loader_lock () at loader.c:1331 #15 0x0809a903 in mono_class_from_name (image=0x8858350, name_space=0x87ed831 "System.Reflection", name=0x61b9 <Address 0x61b9 out of bounds>) at class.c:3272 #16 0x080f8077 in mono_method_get_object (domain=0x8896f00, method=0x41480120, refclass=0x41471e60) at reflection.c:5430 #17 0x080d1426 in ves_icall_Type_GetConstructors_internal (type=0x0, bflags=20, reftype=0xfffffffc) at icall.c:2935 Thread 4 (Thread 1112140720 (LWP 25045)): #12 0x081124f2 in EnterCriticalSection (section=0x8896f08) at critical-sections.c:151 #13 0x081124f2 in EnterCriticalSection (section=0x8896f04) at critical-sections.c:151 #14 0x080f7ec3 in mono_type_get_object (domain=0x8896f00, type=0x61d5) at reflection.c:5354 ---Type <return> to continue, or q <return> to quit--- #15 0x080ef582 in reflection_methodbuilder_from_ctor_builder (rmb=0x4249dc70, mb=0x91b1880) at reflection.c:1312 #16 0x080fd857 in ctorbuilder_to_mono_method (klass=0x9373908, mb=0x91b1880) at reflection.c:7944 #17 0x080fefa3 in ensure_runtime_vtable (klass=0x9373908) at reflection.c:8495 #18 0x080ff931 in mono_reflection_create_runtime_class (tb=0x8951d90) at reflection.c:8736 These two are racing. the second one is doing domain then loader. ---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:48:46 MST ---- >These two are racing. the second one is doing domain then loader. I meant loader then domain. ---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:49:26 MST ---- Created an attachment (id=167981) fix up this issue ---- Additional Comments From bmaurer@users.sf.net 2005-06-22 15:50:16 MST ---- This patch takes the domain lock in the mono_reflection_create_runtime_class method. This ensures that when the domain is locked inside the method, it will be a standard recursive lock and not deadlock. ---- Additional Comments From thunder@ximian.com 2005-06-22 17:56:11 MST ---- Created an attachment (id=167982) Stack trace ---- Additional Comments From thunder@ximian.com 2005-06-22 17:57:00 MST ---- I'm still seeing a deadlock. Trace attached. ---- Additional Comments From bmaurer@users.sf.net 2005-06-22 18:07:24 MST ---- That's a completely different deadlock; it is somewhere in managed code. ---- Additional Comments From miguel@ximian.com 2005-06-22 19:01:48 MST ---- Dan, The trace you pasted indicates that this is not a runtime bug, this is another kind of problem. This seems to be an issue happening on managed-land, we need to get our hands on the machine, hopefully without bundles so we can add printfs and so forth. Could we get: ip address of the machine, accounts to login and fix and a recipe to reproduce the bug? ---- Additional Comments From miguel@ximian.com 2005-06-22 19:10:11 MST ---- Gonzalo inspected the trace further, Dan Mill's latest trace is not a deadlock, it is again a child process whose output is being read and its blocking on the child. Could you please debug that, and make sure that the problem is on your end? Am setting the bug to `fixed' for now. Lets open a new bug if we have further information. ---- Additional Comments From naresh@novell.com 2005-06-29 13:53:42 MST ---- Note: ZLM testing is verifying this in the Mono 1.1.7.7 build. ---- Additional Comments From joeshaw@novell.com 2005-07-06 14:57:09 MST ---- I am seeing another deadlock. I am not sure it's exactly the same one, but it hangs at the same point in my app as the domain/loader lock. I couldn't get mono_print_method_from_ip() to give me anything for most of the topmost ?? stack frames. ---- Additional Comments From joeshaw@novell.com 2005-07-06 14:58:36 MST ---- Created an attachment (id=167983) stack trace ---- Additional Comments From joeshaw@novell.com 2005-07-06 14:59:37 MST ---- oh, yeah, this is on 1.1.8.2 on an SMP box on SUSE 9.3. ---- Additional Comments From bmaurer@users.sf.net 2005-07-06 21:24:52 MST ---- This only happens with --debug ---- Additional Comments From martin@ximian.com 2005-07-07 11:44:28 MST ---- Why did you assing that to me ? Unless someone gives me an SMP machine to test this on, all I can do about this is close it as WONTFIX. ---- Additional Comments From bmaurer@users.sf.net 2005-07-07 11:48:46 MST ---- You don't need an smp box to test this issue. It is a classic locking order issue. One thread acquires the locks in order a b, the other in order b a. You can fix this without being able to reproduce. You just need to make sure that the debugger takes locks in the correct order. Also, there are tons of smp boxes inside the firewall: you can use hardhat and x86-mono. ---- Additional Comments From martin@ximian.com 2005-07-07 12:28:04 MST ---- Ok, so how exactly do I reproduce this ? ---- Additional Comments From joeshaw@novell.com 2005-07-07 12:49:23 MST ---- I'd recommend building beagle and running it some number of times. As is the nature with a locking race, it's difficult to get it to happen reliably or distill into a test case. BenM has built it and run it in the past; he might have a machine for you to use. Otherwise, I can help you build it if necessary. ---- Additional Comments From bmaurer@users.sf.net 2005-07-19 23:42:51 MST ---- Created an attachment (id=167984) Patch. ---- Additional Comments From bmaurer@users.sf.net 2005-07-19 23:44:41 MST ---- This patch changes locking so that we use the debugger lock ratehr than the loader lock. The loader lock just can't be used inside the debugger lock. ---- Additional Comments From bmaurer@users.sf.net 2005-07-21 13:58:06 MST ---- Created an attachment (id=167985) A patch that actually compiles and runs ;-) ---- Additional Comments From bmaurer@users.sf.net 2005-07-21 13:58:25 MST ---- Sigh, I am really good at submitting the oldest version of the patch on my hard drive. ---- Additional Comments From joeshaw@novell.com 2005-07-22 15:30:36 MST ---- I've been indexing like crazy for the last three hours, with several IndexHelper restarts and haven't gotten a single deadlock with this patch. Typically by now I would have at least gotten a few. ---- Additional Comments From bmaurer@users.sf.net 2005-07-23 02:38:45 MST ---- Joe reported to me that his beagle process has lasted 13 hours now without deadlocking. He states that it usually takes less than an hour to deadlock. So am pretty sure this patch fixes the issue. Paolo, can you please review? ---- Additional Comments From joeshaw@novell.com 2005-07-25 10:51:00 MST ---- Woo hoo! Beagle has been running all weekend without a deadlock, which is unheard of on this SMP box. ---- Additional Comments From lupus@ximian.com 2005-07-26 12:02:23 MST ---- Ben, please commit with a changelog entry. ---- Additional Comments From bmaurer@users.sf.net 2005-07-26 15:00:27 MST ---- Fixed in HEAD and in 1.1.8.x. *not* in 1.1.7.x, by Miguel's request. NOTE ABOUT REOPENING THIS BUG ----------------------------- Please do *NOT* reopen this bug if you find another deadlock. A new bug should be opened. Imported an attachment (id=167976) Imported an attachment (id=167977) Imported an attachment (id=167978) Imported an attachment (id=167979) Imported an attachment (id=167980) Imported an attachment (id=167981) Imported an attachment (id=167982) Imported an attachment (id=167983) Imported an attachment (id=167984) Imported an attachment (id=167985) Unknown bug field "cf_op_sys_details" encountered while moving bug <cf_op_sys_details>On mono 1.1.8.2, SMP</cf_op_sys_details> Unknown operating system unknown. Setting to default OS "Other".