Bug 318429 (MONO75444) - [PATCH] [GMCS] bad generic method instantations performance
Summary: [PATCH] [GMCS] bad generic method instantations performance
Status: RESOLVED FIXED
Alias: MONO75444
Product: Mono: Runtime
Classification: Mono
Component: JIT (show other bugs)
Version: 1.1
Hardware: Other Other
: P3 - Medium : Normal
Target Milestone: ---
Assignee: Martin Baulig
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-02 15:01 UTC by Michal Moskal
Modified: 2007-09-15 21:24 UTC (History)
0 users

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
improve hashing of generic types (2.49 KB, patch)
2005-07-03 02:28 UTC, Thomas Wiest
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 19:24:05 UTC


---- Reported by malekith@pld-linux.org 2005-07-02 08:01:45 MST ----

Description of Problem:

We are experiencing something like 10x slowdowns after switching from type
erasure compilation method to generics. That is code using generics is
10x slower (generation of this code is also slower but only 2x, and we
will investigate this).

Anyway this seems to be caused (at least partially) by some problems with
instantiated generic method lookup.

We have found mono to perform very badly on the following example:

#v+
class C1 <T> { }
class C2 <T> { }
class C3 <T> { }
class C4 <T> { }

class C5 <T> { }
class C6 <T> { }
class C7 <T> { }
class C8 <T> { }

class M {
  static int dupl<T> (int cnt, T unused)
  {
    if (cnt < 0)
      return 1;
    else
          return
      dupl (cnt - 1, new C1<T> ()) +
      dupl (cnt - 1, new C2<T> ()) +
      dupl (cnt - 1, new C3<T> ()) +
      dupl (cnt - 1, new C4<T> ()) +
      dupl (cnt - 1, new C5<T> ()) +
      dupl (cnt - 1, new C6<T> ()) +
      dupl (cnt - 1, new C7<T> ()) +
      dupl (cnt - 1, new C8<T> ());
  }

  public static void Main ()
  {
    string a = System.Environment.GetCommandLineArgs () [1];
    System.Console.WriteLine (dupl (int.Parse (a), "foo"));
  }
}
#v-

It essentially does nothing, but when run with parameter N, it creates
8^(N+2) generic instances of the method dupl. Running it with N=2 (so
something about 1000 instances is created) takes 35s on 1.8Ghz amd64 box
(under ms.net the time is not noticeable).

The statistics are:

Mono Jit statistics
Compiled methods:       2159
Methods from AOT:       0
Methods cache lookup:   1621
Method trampolines:     60344
Basic blocks:           12428
Max basic blocks:       91
Allocated vars:         23886
Analyze stack repeat:   0
Compiled CIL code size: 108064
Native code size:       339081

Max code size ratio:    22.00 (Default::ReorderArgumentArray)
Biggest method:         3158 (MonoType::InvokeMember)
Code reallocs:          1
Allocated code size:    339081
Inlineable methods:     0
Inlined methods:        4

Created object count:   5822
Initialized classes:    5096
Used classes:           4824
Static data size:       752
VTable data size:       456888

Generic instances:      69
Initialized classes:    4759
Inflated methods:       33746 / 33874
Inflated types:         44348
Generics metadata size: 3028792

Which seems mostly OK, given that mono does not share code of different
instantiations of dupl. However generating 300k of native code should
be nowhere near 30s.

Moreover, with slightly different version of this program, when increasing
N we have found that the slowdown was at least quadratic, not linear,
to number of type instantiations.

If I were to guess it either hashtable collisions or plain lookup in list
which makes it so slow.

I were however unable to get mono compiled with -pg running :/

Of course nemerle compiler doesn't use the code like the above, to get
zillions of instantiations, it has real types there.

I'm using mono svn.



---- Additional Comments From malekith@pld-linux.org 2005-07-02 09:40:28 MST ----

Note that this doesn't seem to be only about sharing code, as the JIT
time is HUGE, much bigger then I would expect, and moreover it is
quadratic, not linear.



---- Additional Comments From malekith@pld-linux.org 2005-07-02 19:28:51 MST ----

Created an attachment (id=168213)
improve hashing of generic types




---- Additional Comments From malekith@pld-linux.org 2005-07-02 19:32:28 MST ----

The attached patch changes hashing of generic instances to take into
account generic parameters. While this adds overhead to hashing
functions (so I don't know if it suitable for mono trunk), it reduces
number of conflicts, and thus almost doubles Nemerle compiler
performance (405s -> 224s, 50s -> 28s). It's still slow though...



---- Additional Comments From malekith@pld-linux.org 2005-07-02 19:50:55 MST ----

Top of oprofile report:

CPU: AMD64 processors, speed 1800.1 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
357595   19.8571  GC_mark_from
190626   10.5854  mono_type_stack_size
115331    6.4043  mono_generic_inst_equal
101756    5.6505  mono_generic_inst_hash
76959     4.2735  GC_set_mark_bit
68493     3.8034  mono_metadata_class_equal
63561     3.5295  mymono_metadata_type_equal
60756     3.3738  GC_finalize
34754     1.9299  GC_is_marked
32029     1.7786  GC_register_finalizer_inner
30518     1.6946  mono_module_file_get_object
21430     1.1900  do_mono_metadata_type_equal
20747     1.1521  GC_base
20644     1.1464  run_finalize
19538     1.0849  mono_metadata_generic_class_is_valuetype
19454     1.0803  _mono_metadata_generic_class_equal
18118     1.0061  object_register_finalizer
16538     0.9183  g_hash_nodes_destroy
16102     0.8941  mono_metadata_generic_param_equal
16013     0.8892  GC_gcj_malloc_ignore_off_page
14656     0.8138  mono_class_vtable
12777     0.7095  GC_signal_mark_stack_overflow
12528     0.6957  InitializeCriticalSectionAndSpinCount
11543     0.6410  GC_reclaim_clear
11185     0.6211  GC_local_malloc_atomic
10939     0.6074  GC_block_nearly_full
10854     0.6027  mono_jit_compile_method_with_opt
10381     0.5765  mono_metadata_type_hash
10347     0.5746  mono_arch_local_regalloc
8719      0.4842  mono_object_new_alloc_specific
8347      0.4635  mono_metadata_signature_equal
7870      0.4370  mono_method_to_ir
7358      0.4086  GC_generic_malloc_inner
7261      0.4032  .plt
7127      0.3958  mono_jit_compile_method_inner
7077      0.3930  mono_metadata_decode_row
7076      0.3929  mono_burg_label_priv
7037      0.3908  mono_marshal_get_runtime_invoke
7033      0.3905  GC_find_header
5855      0.3251  mono_runtime_class_init
5807      0.3225  EnterCriticalSection




---- Additional Comments From malekith@pld-linux.org 2005-07-03 12:02:27 MST ----

The output above was after the change, for nemerle compiler compiling
the standard library. I have however found the results to be somewhat
bogus, for instance it reported mono to sit in g_error in
mono_type_stack_size(), which clearly wasn't the case.

Lupus suggested I use --profile=default:stat and it seems to work. One
interesting thing I noticed is finalization. We do not have any
Finalize methods, and I don't think we use much of objects with
Finalize from the BCL. Indeed in the nongeneric compiler version first
entry about finalization is at 0.12%. However in the generic version:

prof counts: total/unmanaged: 18792/17154
   5158 27.45 % /home/users/malekith/mono-svn/bin/mono(GC_mark_from
   1483  7.89 % /home/users/malekith/mono-svn/bin/mono(GC_finalize
    587  3.12 % /usr/lib64/libglib-2.0.so.0(g_hash_table_insert
    486  2.59 % /home/users/malekith/mono-svn/bin/mono(GC_find_header
    367  1.95 % /usr/lib64/libglib-2.0.so.0
    351  1.87 %
/home/users/malekith/mono-svn/bin/mono(GC_register_finalizer_inner
    336  1.79 % /usr/lib64/libglib-2.0.so.0(g_hash_table_lookup
    331  1.76 % /lib64/tls/libc.so.6
    268  1.43 % /lib64/tls/libpthread.so.0
    266  1.42 % /home/users/malekith/mono-svn/bin/mono(GC_reclaim_clear
    239  1.27 % /lib64/tls/libc.so.6(calloc
    232  1.23 %
/home/users/malekith/mono-svn/bin/mono(mono_arch_local_regalloc
    221  1.18 %
/home/users/malekith/mono-svn/bin/mono(mono_generic_inst_equal
    215  1.14 % /lib64/tls/libc.so.6(memset
    214  1.14 % /lib64/tls/libpthread.so.0(__pthread_mutex_lock
    211  1.12 %
/home/users/malekith/mono-svn/bin/mono(do_mono_metadata_type_equal
    200  1.06 % /home/users/malekith/mono-svn/bin/mono(GC_base
    176  0.94 %
/home/users/malekith/mono-svn/bin/mono(mono_burg_label_priv
    156  0.83 % /home/users/malekith/mono-svn/bin/mono(GC_is_marked
    152  0.81 %
/home/users/malekith/mono-svn/bin/mono(_mono_metadata_generic_class_equal
    136  0.72 % /home/users/malekith/mono-svn/bin/mono(mono_method_to_ir
    135  0.72 % /home/users/malekith/mono-svn/bin/mono(GC_local_gcj_malloc
    131  0.70 %
/home/users/malekith/mono-svn/bin/mono(mono_metadata_generic_param_equal
    120  0.64 %
/home/users/malekith/mono-svn/bin/mono(mono_object_new_alloc_specific
    109  0.58 %
/home/users/malekith/mono-svn/bin/mono(mono_metadata_decode_row
    105  0.56 %
/home/users/malekith/mono-svn/bin/mono(mono_arch_output_basic_block
     93  0.49 % /lib64/tls/libc.so.6(__libc_free

(this is all after my change).

GC_mark_from was 24% in the non generic version, but it is 5x faster,
therefore it seems to take more time. Maybe because of finalizers.
The generic version allocates 170M (non generic 140M, this is strange,
the generic version doesn't box anything, so it should allocate less,
maybe it's some runtime issue).




---- Additional Comments From malekith@pld-linux.org 2005-07-03 13:28:22 MST ----

The full benchmark is here (I failed to attach it to bugzilla):

  http://nemerle.org/~malekith/temp/benchmark.tar.gz

It contains the generic and nongeneric version, so one can compare them.



---- Additional Comments From malekith@pld-linux.org 2005-07-03 16:03:51 MST ----

Ha, got it! Mono did always set has_finalize for generic types.

This:

Index: metadata/class.c
===================================================================
--- metadata/class.c    (revision 46875)
+++ metadata/class.c    (working copy)
@@ -1982,7 +1982,7 @@
                /* Object::Finalize should have empty implemenatation */
                class->has_finalize = 0;
                if (class->parent) {
-                       if (class->vtable [finalize_slot] !=
default_finalize)
+                       if (class->vtable [finalize_slot]->token !=
default_finalize->token)
                                class->has_finalize = 1;
                }

fixes it, but I think that to be sure, we should also compare dynamic
modules, not only tokens, so this shouldn't go into main mono tree I
guess.

Anyway the problem is that we have inflated Finalize copy.

With this patch it takes only 10s for go2 to run, so it's now only 2x
slower than non-generic version.




---- Additional Comments From malekith@pld-linux.org 2005-07-04 08:46:43 MST ----

I have extended the benchmark with somewhat longer run (another set of
sources), the extended version is here:

  http://nemerle.org/~malekith/temp/benchmark-extended.tar.gz

timing is now 30s vs 58s (with all my changes).




---- Additional Comments From vargaz@gmail.com 2005-07-12 06:37:15 MST ----

The finalize issue is now fixed in SVN.




---- Additional Comments From martin@ximian.com 2005-07-12 09:19:21 MST ----

Committed your patch.

I'll also try to implement shared code as soon as possible.

Imported an attachment (id=168213)

Unknown bug field "cf_op_sys_details" encountered while moving bug
   <cf_op_sys_details>amd64 pld linux</cf_op_sys_details>
Unknown operating system unknown. Setting to default OS "Other".