|
Bugzilla – Full Text Bug Listing |
| Summary: | RC4: Installation bails out with "cannot start novell-zmd" even though zmd is up and running | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 10.2 | Reporter: | Stefan Horn <shorn> |
| Component: | Installation | Assignee: | Guruprasad Sathyamurthy <guruprasad.s> |
| Status: | RESOLVED WONTFIX | QA Contact: | Jiri Srain <jsrain> |
| Severity: | Critical | ||
| Priority: | P5 - None | CC: | aj, guruprasad.s, kkaempf, mt |
| Version: | RC 4 | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Bug Depends on: | |||
| Bug Blocks: | 225131 | ||
| Attachments: |
y2log
zmd-messages.log y2log - another SEGV in zmd |
||
|
Description
Stefan Horn
2006-11-30 09:26:00 UTC
Created attachment 107541 [details]
y2log
Created attachment 107542 [details]
zmd-messages.log
Actually re-doing the following fails, too: for ((i==0;i<100;i++)); echo $i; do rug ping > /dev/null; done this is the error message which appeared in attempt number 80 **ERROR**: file marshal.c: line 7649 (mono_marshal_get_native_wrapper): assertion failed: (pinvoke) aborting... **ERROR**: file marshal.c: line 7649 (mono_marshal_get_native_wrapper): assertion failed: (pinvoke) aborting... ============================================== Got a SIGABRT while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application. ============================================== And redoing it again leads to another failure message: rug ping > /dev/null ** ERROR **: file loader.c: line 1699 (mono_method_get_wrapper_data): assertion failed: (id <=GPOINTER_TO_UNIT (*data)) aborting... ** ERROR **: file loader.c: line 1699 (mono_method_get_wrapper_data): assertion failed: (id <=GPOINTER_TO_UNIT (*data)) aborting... ============================================== Got a SIGABRT while executing native code. This usually indicates a fatal error in the mono runtime or one of the native libraries used by your application. ============================================== And redoing it again leads to a third failure message: ERROR: Invalid IL code in Novell.Zenworks.Zmd.Public.UnixMessageIO:SendString (System.IO.Stream,string,byte[]): IL_0033: callvirt 0x0a0000f7 The followups might be a result of the first error. I would do a "rczmd restart;rug refresh" to see whether it can reproduced. *** Bug 224849 has been marked as a duplicate of this bug. *** Created attachment 107573 [details]
y2log - another SEGV in zmd
At the first glance it looks like a mono issue. We tried replicating the issue on our setup (RC 2) and could not hit it. Did anything change in mono from RC 2 (mono-core-1.1.18.1-9) to RC 4 (mono-core-1.1.18.1-10)? Last 2 changes in mono-core, so nothing since RC1: ------------------------------------------------------------------- Tue Nov 14 16:58:40 CET 2006 - meissner@suse.de - Disable executable stack option. #65536 ------------------------------------------------------------------- Sat Oct 21 01:54:52 CEST 2006 - wberrier@suse.de - Remove glib2-devel from mono-nunit, not sure why it was ever there (bnc #210224) - Updated to 1.1.18.1 -removed upstream patches -C# Generics fixes -IO Layer changes to ease windows porting migration -Security updates: major speed improvements -Lots of Winforms fixes and updates -Merged source for mcs and gmcs -Performance tuning Failure message with NCC click OK and installation finishes without problems Does zen-updater have a valid update service ? Michael ?? If zen-updater does not have a valid update service, this is a BLOCKER Hello, I would be glad to assist you guys if we can get information on how to reproduce this bug (I do not know what RC4 means, is that RC4 for SP1, or is that for OpenSUSE 10.2?). Where could we get images, and once we get the images how do we reproduce it? That being said, when mono reports a "Got a SIGSEGV" it means that a segfault happened in unmanaged code (standard C/C++ libraries). When the SIGSEGV happens in managed code (code JITed by Mono) the SIGSEGV is turned into a NullReferenceException. An option would be to run valgrind with the mono.supp file (the suppressions file, otherwise you get a lot of bogus reports from the GC) and see whether an unmanaged library is corrupting memory. I just checked the system. There was an update source in YaST (can be seen via yast2 inst_source) but not in zmd. comment #17: Ok, just as expected. So the user will not be able to see updates via the ZEN stack. Raising to BLOCKER again. Ok, we found out what the problem was. It took a while, but we found it. ZMD is broken for the following reasons: The use of signal handlers is not supported in Mono, but we provided a way of doing it for those that knew what they were doing, for details see: (http://www.mono-project.com/FAQ:_Technical#Can_I_use_signal_handlers_with_Mono.3F) ZMD does not follow the practice (which is extremely tricky) and instead created their own. The problem is that signal handlers can be invoked at any time, and they would trigger a JIT compilation of the code, and this would break the locking mechanism inside the JIT. So our process documents how to get this going (and we still strongly encourage people to not use signal handlers with their applications, because it is error prone). * ZMD went beyond the only supported practice, did not follow the advise, and got clever with its use of signals, in particular, they created a "SigAction structure" that they intended would map into the underlying sigaction structure of the operating system (which is highly OS dependent, and their code has a definition that they probably copied from somewhere, that somewhere is probably not Linux though, we did not document that nor provide code similar to that). The definition they are using is: internal struct SigActionData { public SignalHandler handler; public UInt64 flags; public SignalRestorer restorer; public UInt64 mask; // The actual struct has one long[2] type here but this works. public UInt64 mask2; } (That is from rug's UnixSignal.cs file). The above does not match the definition for sigaction on my SLED 10, I do not know where they copied that from. Then they call "sigaction", and hope that Mono will translate the structures, which it does, but does structures do not match the OS, so garbage goes to the OS, and garbage is copied back, likely overwriting internal structures back and forth. In fact, the structure on my machine is smaller, so there is guaranteed corruption. To explain this in C terms, this is like someone not using #include <sigaction.h> and instead typing the definition *incorrectly* into the C file and expecting that things will work. The crasher happened exactly because of that problem. Solution: do not even compile against the UnixSignal.cs from rug, and remove any references to it from rug. Here is a patch:
--- rug-7.1.100.0/src/CommandLineParser.cs.old 2006-11-30 17:39:41.000000000 -0500
+++ rug-7.1.100.0/src/CommandLineParser.cs 2006-11-30 17:39:52.000000000 -0500
@@ -311,7 +311,6 @@
}
if (command != null) {
- UnixSignal.RegisterHandler (Signum.SIGINT, new SignalHandler (command.Command.Terminate));
command.Command.Execute ((string[]) extra.ToArray (typeof (string)));
}
}
Yeah, as Miguel says, this is pretty broken. That struct evidently changed size at one point. I think I was using suse 9.0 when that was created. In all fairness, we didn't go out of our way to avoid using the established guidelines -- they simply did not exist when this was written. There is a much better signal handling solution in zmd itself, and that code should probably just be copied to rug. It does the signal handling in unmanaged code, and signals the managed pieces through a pipe. Paolo found another problem with UnixSignal, the code does not keep a reference to the handler passed to the kernel. Without keeping a reference, the Garbage Collector would dispose the "trampoline" code that is used to call back into C#, so effectively the kernel could call back at any point something random. Test package with a patch equivalen to #21 at http://w3.suse.de/~jpr/ New packages in the same place, with the patch applied correctly. JP, where's the source RPM? Miguel, what is the patch for #23? I'm testing now with the mono-core from Bug #221277. It looks much better on the machine with the problems but now I seem to run in a dead-lock: K29:~ # ps -l 4951 8345 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 S 0 4951 1 0 94 19 - 19899 stext ? 0:08 zmd /usr/lib/zmd/zmd.exe 0 S 0 8345 4401 0 83 0 - 4934 stext pts/1 0:00 /usr/bin/mono /usr/lib/rug/rug.exe ping This was iteration 475 - normally it would bail out far earlier... Run successfully for another 840 iterations and then hung again.. Why was the signal handler needed at all? Does it explain the hang - or is this a result of the mono-core change? Note: We see the hang and the earlier mono fails on a two cpu machine - I do not see it on my two one cpu machines. So, let's test this on SMP systems! The signal handler's only purpose in rug was to kill the process. This can be removed but it does not seem to be responsible for the hang. The only possible cause seems to be the changes made to mono-core. Looking at Miguel's patch (bug 221277), the new locking which was introduced could be the issue. Can someone who knows mono look at this snippet of the patch? The unlock seems to happen only in the conditional and not outside, was that intentional? diff -ru /tmp/mono-1.1.18/mono/metadata/metadata.c metadata/metadata.c --- /tmp/mono-1.1.18/mono/metadata/metadata.c 2006-10-12 20:16:07.000000000 +0200 +++ metadata/metadata.c 2006-12-01 01:14:19.000000000 +0100 @@ -1490,6 +1490,7 @@ int count = 0; gboolean found; + mono_loader_lock (); /* * According to the spec, custom modifiers should come before the byref * flag, but the IL produced by ilasm from the following signature: @@ -1561,6 +1562,7 @@ if (!do_mono_metadata_parse_type (type, m, container, ptr, &ptr)) { if (type != &stype) g_free (type); + mono_loader_unlock (); return NULL; } The full patch seems to unlock properly everywhere. comment #20, #22: I justed verified repeated 'rug ping' calls on a SLED10 SMP machine. The bug did _not_ appear. So apparently it was introduced between SLED10 GA and openSUSE 10.2 ?! The current situation seems to be that rug hangs on a SMP box when run in a loop (that too after 400+ iterations). Is this a practical test case? It should be fixed but I believe that it is not a blocker. Apart from this, isn't the original issue about yast complaining about zmd not being alive even when it was? Has this been observed again and does the updated mono-core fix it? comment #37: YaST complains because it runs "rug ping" which exits with an error. Apparently this error can be easily triggered (on openSUSE 10.2) by running "rug ping" in a loop. The broken signal handler code was not the cause of the issue: it could corrupt data on the stack, but I think this happened to only stomp on data that the jit would use if it had to do stack unwind of that stack frame, which I didn't see actually happen. I guess it was just a bit frustrating to find that code there after it was pointed out to be faulty months ago. Yesterday evening I did about 1500 runs and they were all successfull, with no hang on this same K29 box. The patch does more locking than needed, but the lock used is recursive and it is placed in places where it was supposed to be held anyway (yesterday night I didn't have the time to go and review all the code and find the spot which didn't take it). It would be good to have more data points from executing rug ping in a loop on other SMP boxes with the fixed mono. (In reply to comment #37) > The current situation seems to be that rug hangs on a SMP box when run in a > loop (that too after 400+ iterations). Is this a practical test case? It should > be fixed but I believe that it is not a blocker. > > Apart from this, isn't the original issue about yast complaining about zmd not > being alive even when it was? Has this been observed again and does the updated > mono-core fix it? > As explained to Guru the test case is only a test case which reflects the likelihood of that bug to occur. That means e.g. that out of 400 users calling rug once, statistically one will be hit by that bug. Doing the loop script we can at least kindof reprodue the bug. Also rug ping is not the only issue. I guess the whole rug setup has some flaws. We could equally use rug sl to run into the bug, but that would take 10 times as much test time. >As explained to Guru the test case is only a test case which reflects the
>likelihood of that bug to occur. That means e.g. that out of 400 users calling
>rug once, statistically one will be hit by that bug. Doing the loop script we
>can at least kindof reprodue the bug.
The distro being OpenSUSE 10.2, I expect it to be used as a desktop box (with 1 or 2 users) rather than a server with 400 users. Keeping that in mind, the test is impractical and does not reflect its actual usage.
The question is whether 1 out of 400 users on SMP machines will run into this during installation. I think the actual reported bug is fixed - but we now have the hang. Since we call rug ping during installation, this could effect users. I'm doing a test install now. Just tested with RC5: I do not see a problem during installation (3 different SMP machines). AJ, please run the loop script, too. worked fine the first 100 iterations... I'm sure we'll run into it again since it was tested earlier with the same packages we have in RC5. much better with RC5 now - but still occuring. Not a shipment blocker anymore but needs to be fixed asap to deliver patch for 10.2 and have it working for SP1 We are not going to fix this for openSUSE. |