Bugzilla – Bug 324531
Dead Lock with thread abort/locks in signal handlers
Last modified: 2009-01-27 18:44:06 UTC
---- Reported by james@stev.org 2007-06-11 11:25:39 MST ---- Please fill in this template when reporting a bug, unless you know what you are doing. Description of Problem: Steps to reproduce the problem: 1. Run The Code Below Actual Results: This locks up somewhere in the mysql libs and never returns from strace i beleave it was happening on either the creation or the destruction of a thread. It does not occur in MS .NET on windows with the same mysql server and driver. Expected Results: To run Indefinatly How often does this happen? Randomly but within 5 - 10 minutes for me Additional Information: Code using System; using System.Collections.Generic; using System.Text; using MySql.Data.MySqlClient; namespace Test { class Program { static void Main(string[] args) { try { MySqlConnection mysql; mysql = new MySqlConnection("host=etc,,,"); int i = 0; mysql.Open(); while (true) { MySqlCommand sql = new MySqlCommand("select id from alias where userid = ?userid and gametypeid = ?gametypeid and centerid = ?centerid and name = ?name"); MySqlDataReader rec; sql.Parameters.AddWithValue("?userid", 1409); sql.Parameters.AddWithValue("?gametypeid", 1); sql.Parameters.AddWithValue("?centerid", 1); sql.Parameters.AddWithValue("?name", "jcjfdjal"); sql.Connection = mysql; //sql.Prepare(); rec = sql.ExecuteReader(); rec.Close(); Console.WriteLine(" Count = " + i); i++; } } catch (Exception e) { Console.WriteLine(e.Message); Console.WriteLine(e.StackTrace); } } } } ---- Additional Comments From lupus@ximian.com 2007-07-04 12:14:08 MST ---- What kind of box are you using? x86 smp? If you could also include a sample db that can be used to reproduce with this program it would be great. ---- Additional Comments From james@stev.org 2007-07-05 05:15:00 MST ---- The table i am querying looks like this CREATE TABLE `alias` ( `id` int(10) unsigned NOT NULL auto_increment, `userid` int(10) unsigned NOT NULL, `gametypeid` int(11) NOT NULL, `centerid` int(11) NOT NULL, `name` varchar(128) NOT NULL, `count` int(10) unsigned NOT NULL, `lastseen` datetime NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `userid` (`userid`,`gametypeid`,`name`,`centerid`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ; The machine is processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) XP 2600+ stepping : 1 cpu MHz : 2088.059 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow ts bogomips : 4179.16 Single Processor The same bug does not happen when running on MS .Net 2.0 in windows using the database remotely to the linux machine nor on its local machine running a database. When running on either a local or a remote database the problem will occur when running in mono. I cannot get the bug to occur with tracing enabled on mono which would suggest to me its going to be hard to reproduce on a different system and is possibly some kind of race. ---- Additional Comments From james@stev.org 2007-07-16 07:44:56 MST ---- More information Somebody on irc asked me to append the following After sending a QUIT signal to mono the following is dumped. Full thread dump: "" tid=0x0xb6e99bb0 this=0x0x60d10: "" tid=0x0xb7cd8aa0 this=0x0x21e00: I am able to trigger this bug on various version of mono / mysql libs. Mysql Lib tested with are 5.0.3.0 through 5.0.7.0 All of which seem to cause the same deadlock. I have also tested this against mono 1.2.4 and various svn HEAD trees in the last 2/3 weeks. Last tested against revision 82024 I am pritty sure this is not a mysql lib issue as the mono runtime always locks up. I have tried to use the info on http://www.mono-project.com/Debugging to attempt to get a stack trace to look into this issue further but most of the internal functions stop on mono_domain_lock I cannot trigger the deadlock with tracing enabled. The deadlock does not trigger on windows. Under MS .NET Where next ? ---- Additional Comments From james@stev.org 2007-07-16 09:11:07 MST ---- Grendel on irc suggested that i turn on the thread debugging and try to retrigger the bug. This is the last of the output it produced ** Message: _wapi_thread_handle_from_id: Returning 0x127631 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127631 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127631 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127631 for self thread -1365808208 from TLS ** Message: _wapi_thread_set_termination_details: Thread 0x127631 terminating ** Message: thread_start_routine: started thread id -1236284496 ** Message: CreateThread: Started thread handle 0x127635 ID -1236284496 ** Message: _wapi_thread_handle_from_id: Returning 0x403 for self thread -1211114848 from TLS ** Message: _wapi_thread_abandon_mutexes: Thread 0x127631 abandoning held mutexes ** Message: _wapi_thread_set_termination_details: Recording thread handle 0x127631 id -1365808208 status as -1 ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS Count = 603918 ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127635 for self thread -1236284496 from TLS ** Message: _wapi_thread_set_termination_details: Thread 0x127635 terminating ** Message: thread_start_routine: started thread id -1365808208 ** Message: CreateThread: Started thread handle 0x127639 ID -1365808208 ** Message: _wapi_thread_handle_from_id: Returning 0x403 for self thread -1211114848 from TLS ** Message: _wapi_thread_abandon_mutexes: Thread 0x127635 abandoning held mutexes ** Message: _wapi_thread_set_termination_details: Recording thread handle 0x127635 id -1236284496 status as -1 ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS Count = 603919 ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_handle_from_id: Returning 0x127639 for self thread -1365808208 from TLS ** Message: _wapi_thread_set_termination_details: Thread 0x127639 terminating ** Message: thread_start_routine: started thread id -1236284496 ** Message: CreateThread: Started thread handle 0x12763d ID -1236284496 ** Message: _wapi_thread_handle_from_id: Returning 0x403 for self thread -1211114848 from TLS ** Message: _wapi_thread_abandon_mutexes: Thread 0x127639 abandoning held mutexes ** Message: _wapi_thread_set_termination_details: Recording thread handle 0x127639 id -1365808208 status as -1 ---- Additional Comments From lupus@ximian.com 2007-07-16 10:48:13 MST ---- gdb is not very useful for me, because it doesn't produce backtraces. I decode a couple of stack dumps and one seemd stuck in the mono_domain_lock() in object_register_finalizer() (seemigly as the result of SuppressFinalizer in the Timer.cs Dispose method). Another thread was inside a sigusr1 signal handler and stuck in the mono_domain_lock call in mono_jit_info_table_find(). Using a lock-free mono_jit_info_table_find() allows the program to continue for hundreds of thousands of iterations at least (I just stopped it after a few minutes). ---- Additional Comments From james@stev.org 2007-07-16 11:08:12 MST ---- This too will also cause a deadlock in the same way for me but without the mysql libs being involved. class Temp { public void Run() { while (true) Thread.Sleep(1000); } } class Program { static void Main(string[] args) { try { Thread Thd; Temp Tmp = new Temp(); int i = 0; while (true) { Thd = new Thread(new ThreadStart(Tmp.Run)); Thd.Start(); Thd.Abort(); Console.WriteLine(" Count = " + i); i++; } } catch (Exception e) { Console.WriteLine(e.Message); } Console.WriteLine("Exiting"); } ---- Additional Comments From lupus@ximian.com 2007-07-16 12:03:27 MST ---- The new test case just confirms the same pattern: it produces 4 threads with the following manually decoded traces at the time of the deadlock: thread1: (main thread) mono_domain_lock () mono_jit_find_compiled_method () mono_create_jump_trampoline () mono_ldftn() thread2: (started thread) mono_domain_lock() mono_jit_info_table_find () sigusr1_signal_handler () thread3: (finalizer thread) timedwait_signal_poll_cond () _wapi_handle_timedwait_signal_handle () _wapi_handle_wait_signal_handle () WaitForSingleObjectEx () thread4: (io-layer thread) start_thread() There should be no deadlock with just the above calls, since one thread should block while the other takes the domain lock and proceeeds. The fact that both blocked may mean that a thread has been aborted with the domain lock held: which looks like this: lock = {depth = 0, mutex = {__data = {__lock = 2, __count = 0, __owner = 0, __kind = 1, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = "\002", '\0' <repeats 11 times>, "\001\000\000\000\000\000\000\000\000\000\000", __align = 2}} ---- Additional Comments From lupus@ximian.com 2007-08-30 13:27:13 MST ---- The issue with this test was that inside the signal handler used to notify the abort request we took a lock. The same lock was being taken by some other thread. Indeed, pthread_mutex_lock is not one of the functions that are allowed in a signal handler. The fix required quite a bit of work, though luckily Mark did it all:) We now can lookup the runtime info we need to know about in the signal handler without taking locks. There are still a few more cases where we currently take a lock in signal handlers (for example by way of QueueUserAPC), but the chances of hitting a deadlock in those case is pretty low, still we'll need to fix them eventually, so I'm just downgrading the priority of this bug. Unknown bug field "cf_op_sys_details" encountered while moving bug <cf_op_sys_details>Etch with mono 1.2.4</cf_op_sys_details> Unknown bug field "cf_version_details" encountered while moving bug <cf_version_details>svn HEAD</cf_version_details>
Since FogBugz uses Mono for its search service, and since it is inconvenient for us to ask our customers to upgrade to Mono 1.2.6, we created a version of the MySql Net/Connector that appears to successfully work around this bug by using socket timeouts rather than timers. The .dll, a tar of the complete code, and a diff file of the changes are available here, if it's useful to anyone else: http://www.fogcreek.com/FogBugz/kb/setup/ModifyingMySQL.Data.html
These issues were fixed a long time ago in 2.0.
It looks like the last test case still crashes on HEAD.
The testcase crash is now fixed.