Bugzilla – Bug 318429
[PATCH] [GMCS] bad generic method instantations performance
Last modified: 2007-09-15 21:24:46 UTC
---- Reported by malekith@pld-linux.org 2005-07-02 08:01:45 MST ---- Description of Problem: We are experiencing something like 10x slowdowns after switching from type erasure compilation method to generics. That is code using generics is 10x slower (generation of this code is also slower but only 2x, and we will investigate this). Anyway this seems to be caused (at least partially) by some problems with instantiated generic method lookup. We have found mono to perform very badly on the following example: #v+ class C1 <T> { } class C2 <T> { } class C3 <T> { } class C4 <T> { } class C5 <T> { } class C6 <T> { } class C7 <T> { } class C8 <T> { } class M { static int dupl<T> (int cnt, T unused) { if (cnt < 0) return 1; else return dupl (cnt - 1, new C1<T> ()) + dupl (cnt - 1, new C2<T> ()) + dupl (cnt - 1, new C3<T> ()) + dupl (cnt - 1, new C4<T> ()) + dupl (cnt - 1, new C5<T> ()) + dupl (cnt - 1, new C6<T> ()) + dupl (cnt - 1, new C7<T> ()) + dupl (cnt - 1, new C8<T> ()); } public static void Main () { string a = System.Environment.GetCommandLineArgs () [1]; System.Console.WriteLine (dupl (int.Parse (a), "foo")); } } #v- It essentially does nothing, but when run with parameter N, it creates 8^(N+2) generic instances of the method dupl. Running it with N=2 (so something about 1000 instances is created) takes 35s on 1.8Ghz amd64 box (under ms.net the time is not noticeable). The statistics are: Mono Jit statistics Compiled methods: 2159 Methods from AOT: 0 Methods cache lookup: 1621 Method trampolines: 60344 Basic blocks: 12428 Max basic blocks: 91 Allocated vars: 23886 Analyze stack repeat: 0 Compiled CIL code size: 108064 Native code size: 339081 Max code size ratio: 22.00 (Default::ReorderArgumentArray) Biggest method: 3158 (MonoType::InvokeMember) Code reallocs: 1 Allocated code size: 339081 Inlineable methods: 0 Inlined methods: 4 Created object count: 5822 Initialized classes: 5096 Used classes: 4824 Static data size: 752 VTable data size: 456888 Generic instances: 69 Initialized classes: 4759 Inflated methods: 33746 / 33874 Inflated types: 44348 Generics metadata size: 3028792 Which seems mostly OK, given that mono does not share code of different instantiations of dupl. However generating 300k of native code should be nowhere near 30s. Moreover, with slightly different version of this program, when increasing N we have found that the slowdown was at least quadratic, not linear, to number of type instantiations. If I were to guess it either hashtable collisions or plain lookup in list which makes it so slow. I were however unable to get mono compiled with -pg running :/ Of course nemerle compiler doesn't use the code like the above, to get zillions of instantiations, it has real types there. I'm using mono svn. ---- Additional Comments From malekith@pld-linux.org 2005-07-02 09:40:28 MST ---- Note that this doesn't seem to be only about sharing code, as the JIT time is HUGE, much bigger then I would expect, and moreover it is quadratic, not linear. ---- Additional Comments From malekith@pld-linux.org 2005-07-02 19:28:51 MST ---- Created an attachment (id=168213) improve hashing of generic types ---- Additional Comments From malekith@pld-linux.org 2005-07-02 19:32:28 MST ---- The attached patch changes hashing of generic instances to take into account generic parameters. While this adds overhead to hashing functions (so I don't know if it suitable for mono trunk), it reduces number of conflicts, and thus almost doubles Nemerle compiler performance (405s -> 224s, 50s -> 28s). It's still slow though... ---- Additional Comments From malekith@pld-linux.org 2005-07-02 19:50:55 MST ---- Top of oprofile report: CPU: AMD64 processors, speed 1800.1 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 357595 19.8571 GC_mark_from 190626 10.5854 mono_type_stack_size 115331 6.4043 mono_generic_inst_equal 101756 5.6505 mono_generic_inst_hash 76959 4.2735 GC_set_mark_bit 68493 3.8034 mono_metadata_class_equal 63561 3.5295 mymono_metadata_type_equal 60756 3.3738 GC_finalize 34754 1.9299 GC_is_marked 32029 1.7786 GC_register_finalizer_inner 30518 1.6946 mono_module_file_get_object 21430 1.1900 do_mono_metadata_type_equal 20747 1.1521 GC_base 20644 1.1464 run_finalize 19538 1.0849 mono_metadata_generic_class_is_valuetype 19454 1.0803 _mono_metadata_generic_class_equal 18118 1.0061 object_register_finalizer 16538 0.9183 g_hash_nodes_destroy 16102 0.8941 mono_metadata_generic_param_equal 16013 0.8892 GC_gcj_malloc_ignore_off_page 14656 0.8138 mono_class_vtable 12777 0.7095 GC_signal_mark_stack_overflow 12528 0.6957 InitializeCriticalSectionAndSpinCount 11543 0.6410 GC_reclaim_clear 11185 0.6211 GC_local_malloc_atomic 10939 0.6074 GC_block_nearly_full 10854 0.6027 mono_jit_compile_method_with_opt 10381 0.5765 mono_metadata_type_hash 10347 0.5746 mono_arch_local_regalloc 8719 0.4842 mono_object_new_alloc_specific 8347 0.4635 mono_metadata_signature_equal 7870 0.4370 mono_method_to_ir 7358 0.4086 GC_generic_malloc_inner 7261 0.4032 .plt 7127 0.3958 mono_jit_compile_method_inner 7077 0.3930 mono_metadata_decode_row 7076 0.3929 mono_burg_label_priv 7037 0.3908 mono_marshal_get_runtime_invoke 7033 0.3905 GC_find_header 5855 0.3251 mono_runtime_class_init 5807 0.3225 EnterCriticalSection ---- Additional Comments From malekith@pld-linux.org 2005-07-03 12:02:27 MST ---- The output above was after the change, for nemerle compiler compiling the standard library. I have however found the results to be somewhat bogus, for instance it reported mono to sit in g_error in mono_type_stack_size(), which clearly wasn't the case. Lupus suggested I use --profile=default:stat and it seems to work. One interesting thing I noticed is finalization. We do not have any Finalize methods, and I don't think we use much of objects with Finalize from the BCL. Indeed in the nongeneric compiler version first entry about finalization is at 0.12%. However in the generic version: prof counts: total/unmanaged: 18792/17154 5158 27.45 % /home/users/malekith/mono-svn/bin/mono(GC_mark_from 1483 7.89 % /home/users/malekith/mono-svn/bin/mono(GC_finalize 587 3.12 % /usr/lib64/libglib-2.0.so.0(g_hash_table_insert 486 2.59 % /home/users/malekith/mono-svn/bin/mono(GC_find_header 367 1.95 % /usr/lib64/libglib-2.0.so.0 351 1.87 % /home/users/malekith/mono-svn/bin/mono(GC_register_finalizer_inner 336 1.79 % /usr/lib64/libglib-2.0.so.0(g_hash_table_lookup 331 1.76 % /lib64/tls/libc.so.6 268 1.43 % /lib64/tls/libpthread.so.0 266 1.42 % /home/users/malekith/mono-svn/bin/mono(GC_reclaim_clear 239 1.27 % /lib64/tls/libc.so.6(calloc 232 1.23 % /home/users/malekith/mono-svn/bin/mono(mono_arch_local_regalloc 221 1.18 % /home/users/malekith/mono-svn/bin/mono(mono_generic_inst_equal 215 1.14 % /lib64/tls/libc.so.6(memset 214 1.14 % /lib64/tls/libpthread.so.0(__pthread_mutex_lock 211 1.12 % /home/users/malekith/mono-svn/bin/mono(do_mono_metadata_type_equal 200 1.06 % /home/users/malekith/mono-svn/bin/mono(GC_base 176 0.94 % /home/users/malekith/mono-svn/bin/mono(mono_burg_label_priv 156 0.83 % /home/users/malekith/mono-svn/bin/mono(GC_is_marked 152 0.81 % /home/users/malekith/mono-svn/bin/mono(_mono_metadata_generic_class_equal 136 0.72 % /home/users/malekith/mono-svn/bin/mono(mono_method_to_ir 135 0.72 % /home/users/malekith/mono-svn/bin/mono(GC_local_gcj_malloc 131 0.70 % /home/users/malekith/mono-svn/bin/mono(mono_metadata_generic_param_equal 120 0.64 % /home/users/malekith/mono-svn/bin/mono(mono_object_new_alloc_specific 109 0.58 % /home/users/malekith/mono-svn/bin/mono(mono_metadata_decode_row 105 0.56 % /home/users/malekith/mono-svn/bin/mono(mono_arch_output_basic_block 93 0.49 % /lib64/tls/libc.so.6(__libc_free (this is all after my change). GC_mark_from was 24% in the non generic version, but it is 5x faster, therefore it seems to take more time. Maybe because of finalizers. The generic version allocates 170M (non generic 140M, this is strange, the generic version doesn't box anything, so it should allocate less, maybe it's some runtime issue). ---- Additional Comments From malekith@pld-linux.org 2005-07-03 13:28:22 MST ---- The full benchmark is here (I failed to attach it to bugzilla): http://nemerle.org/~malekith/temp/benchmark.tar.gz It contains the generic and nongeneric version, so one can compare them. ---- Additional Comments From malekith@pld-linux.org 2005-07-03 16:03:51 MST ---- Ha, got it! Mono did always set has_finalize for generic types. This: Index: metadata/class.c =================================================================== --- metadata/class.c (revision 46875) +++ metadata/class.c (working copy) @@ -1982,7 +1982,7 @@ /* Object::Finalize should have empty implemenatation */ class->has_finalize = 0; if (class->parent) { - if (class->vtable [finalize_slot] != default_finalize) + if (class->vtable [finalize_slot]->token != default_finalize->token) class->has_finalize = 1; } fixes it, but I think that to be sure, we should also compare dynamic modules, not only tokens, so this shouldn't go into main mono tree I guess. Anyway the problem is that we have inflated Finalize copy. With this patch it takes only 10s for go2 to run, so it's now only 2x slower than non-generic version. ---- Additional Comments From malekith@pld-linux.org 2005-07-04 08:46:43 MST ---- I have extended the benchmark with somewhat longer run (another set of sources), the extended version is here: http://nemerle.org/~malekith/temp/benchmark-extended.tar.gz timing is now 30s vs 58s (with all my changes). ---- Additional Comments From vargaz@gmail.com 2005-07-12 06:37:15 MST ---- The finalize issue is now fixed in SVN. ---- Additional Comments From martin@ximian.com 2005-07-12 09:19:21 MST ---- Committed your patch. I'll also try to implement shared code as soon as possible. Imported an attachment (id=168213) Unknown bug field "cf_op_sys_details" encountered while moving bug <cf_op_sys_details>amd64 pld linux</cf_op_sys_details> Unknown operating system unknown. Setting to default OS "Other".