Bug 324339 (MONO81663) - [PATCH] Performance: Delegate optimization, DLR and IronPython
Summary: [PATCH] Performance: Delegate optimization, DLR and IronPython
Status: RESOLVED FIXED
Alias: MONO81663
Product: Mono: Runtime
Classification: Mono
Component: JIT (show other bugs)
Version: 1.2
Hardware: Other Other
: P3 - Medium : Enhancement
Target Milestone: ---
Assignee: Paolo Molaro
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-05-17 21:24 UTC by Miguel de Icaza
Modified: 2007-09-15 21:24 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
patch (14.56 KB, patch)
2007-05-24 01:08 UTC, Thomas Wiest
Details | Diff
microbenchmark (408 bytes, text/plain)
2007-05-24 01:08 UTC, Thomas Wiest
Details
patch (14.92 KB, patch)
2007-05-26 20:04 UTC, Thomas Wiest
Details | Diff
updated patch with x86 support (20.55 KB, patch)
2007-05-29 17:51 UTC, Thomas Wiest
Details | Diff
Patch for x86 static case (2.68 KB, patch)
2007-07-20 07:10 UTC, Thomas Wiest
Details | Diff
Static Implementation (2.82 KB, patch)
2007-08-31 05:23 UTC, Thomas Wiest
Details | Diff
Architecture Specific Delegate Caching Patch (7.68 KB, patch)
2007-09-01 07:31 UTC, Thomas Wiest
Details | Diff
Achitecture Delegate Impl with InterlockedCompareExchange (8.06 KB, patch)
2007-09-02 04:50 UTC, Thomas Wiest
Details | Diff
Architecture Specific Delegate Caching #2 (17.57 KB, patch)
2007-09-10 22:22 UTC, Thomas Wiest
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 20:39:03 UTC


---- Reported by miguel@ximian.com 2007-05-17 14:24:04 MST ----

From this post:

http://lists.ironpython.com/pipermail/users-ironpython.com/2007-May/004916.html

There are some observations about new optimizations in the Orcas CLR for
"improvement to the performance of dispatching to a delegate closed over a
parameter.".

it would be interesting to find out what this is.



---- Additional Comments From sanxiyn@gmail.com 2007-05-18 01:28:33 MST ----

Dino says, "The change was to no longer wrap LCG delegates inside a
MultiCast delegate removing the overhead of calling through the
multicast delegate. The multicast delegate was just a convenience for
tracking the lifetime of the delegate and now the GC does that
instead. I don't know if that'll apply to Mono or not but maybe it
gives you some ideas on where to look for own improvements".



---- Additional Comments From miguel@ximian.com 2007-05-18 12:29:32 MST ----

We should create a small test case to measure the problem.



---- Additional Comments From lupus@ximian.com 2007-05-23 10:34:56 MST ----

On my system (pentium M 1.6) there is a small degradation in
performance in pystone when going from 1.0.1/1.1 to 2.0 (54662.5 to
53477.5). But note that pystone should be run with optimizations
enabled (-O pystone.py) and in that case there is a small gain in 2.0
vs 1.1 (55157.9 vs basically no change in earlier ironpythons).

Also note that delegate invocation performance has nothing to do
with the results of pystone (in a profile run the most expensive
delegate invocation wrapper has a 0.6% impact on the total performance).
It would be good for people experiencing big slowdowns to get a
profile for the two runs and attach them to this bug for my review.
The commands to run are:
mono --profile=default:stat IPCE-r6/ipy.exe -O
/usr/lib/python2.5/test/pystone.py 500000
and
mono --profile=default:stat IronPython-1.0.1/ipy.exe -O
/usr/lib/python2.5/test/pystone.py 500000

As for the delegate invocation performance, it is currently about 3
times the time spent for a virtual call. There are a few optimizations
possible.
The first thing to remove is a check added in r27776 by lluis, but we
need to investigate why it was added (the wrapper should not be called
in unmanaged->managed transions as the comment says, if it is, it is a
bug and it should be fixed some other way). This should be the
simplest change.

The most important overhead is reloading the arguments: we can avoid
it in the most common scenario, a delegate that is not chained with
other delegates. There are cases where removing this overhead is cheap
and cases where it is hard to implement, depending on the calling
convention. On x86, for example, for delegates that invoke instance
methods we can just load the target object, place it on the stack
where the delegate object was pushed and jump to the address. Static
methods can't be handled the same way because the stack would end up
imbalanced. Other architectures could be able to handle static methods
as well with some signatures: sliding a few arguments in registers is
very cheap. If we change the internal call convention to pass this in
%ecx, handling static methods would be cheap as well.

I had the above change prototyped a while ago, by prepending to the
delegate-invoke wrapper a simple check+jump. In the production
implementation we could do things a little differently by changing the
call to delegate invoke from:

  call delegate_invoke_impl
to
  call *delegate_object [offset]
where offset is of a field in the delegate object where we store a
delegate instance-specific invoker.
The default would be the address of the current delegate_invoke_impl,
but in the cases where we can optimize we would use the specific one.
This change would result in a nice speedup for the common case and a
tiny slowdown when invoking multicast delegates.
The above is possible because we control in corlib where different
delegates are chained and of course we control the delegate constructor.






---- Additional Comments From lupus@ximian.com 2007-05-23 13:22:44 MST ----

Some of my comments above are incorrect, as I assumed IPCE-r6 was
based on IronPython 2.0. So I recompiled IronPython 2.0a1 with the fix
from seo and indeed the performance drop is significant (down to 40070.2).
In the new profile delegate-invoke methods have more weight (there are
two, taking 1.3 and 1.2 % of total time, for a total of about 6%).
The main difference, though is in time spent in the GC 16+% vs 6% it
had in 1.x versions.



---- Additional Comments From vargaz@gmail.com 2007-05-23 18:07:52 MST ----

Here is a patch to implement parts of Paolo's ideas. It creates three
copies of the Invoke method: non-multicast-no-target,
non-multicast-with-target and multicast. Strangely, it actually
decreases performance on a micro benchmark for me on amd64.




---- Additional Comments From vargaz@gmail.com 2007-05-23 18:08:21 MST ----

Created an attachment (id=171992)
patch




---- Additional Comments From vargaz@gmail.com 2007-05-23 18:08:53 MST ----

Created an attachment (id=171993)
microbenchmark




---- Additional Comments From lupus@ximian.com 2007-05-24 07:24:59 MST ----

Sadly it's not possible to implement the idea in arch-indep code as it
is highly specific to the call convention ABI, otherwise you just pay
the additional overhead of an indirect call vs a direct call, as you
found out, so the patch is no indication of the expected performance.

As an example of how this would look like, the
delegate_invoke_with_target would be implemented in mini-amd64.c with:

  amd64_mov_reg_membase (code, AMD64_RAX, AMD64_RDI, G_STRUCT_OFFSET
(MonoDelegate, method_ptr), 8);
  amd64_mov_reg_membase (code, AMD64_RDI, AMD64_RDI, G_STRUCT_OFFSET
(MonoDelegate, target), 8);
  amd64_jump_reg (code, AMD64_RAX);

(changing rdi to account for struct return types). This code is much
faster than the one produced by your patch, especially with many
arguments.

This may require changes for the mono_delegate_trampoline() handling
to remove assumptions about the current way of invoking delegates.
We should likely have a new delegate magic trampoline which is
installed in the invoke_impl you added (this would remove the huge
overhead your change adds to the delegate ctor). This magic trampoline
would have easy access to the delegate object so it it completely
generic (one instance of the trampoline in the system vs one
trampoline per method/delegate type) and it can replace invoke_impl
according to the target and prev fields in the delegate.




---- Additional Comments From vargaz@gmail.com 2007-05-24 09:03:33 MST ----

The original Invoke call was virtual so the patch adds no overhead in
theory.



---- Additional Comments From lupus@ximian.com 2007-05-24 13:55:44 MST ----

The code before was doing:
  static call to delegate_invoke
  -> indirect call to method

The patch does:
  indirect call to delegate invoke_impl
  -> indirect call to method

My proposal instead would end up with:
  indirect call to delegate invoke_impl
  -> indirect jump (no need for prolog/epilog and argument reload)

So it's no surprise that the patch causes a slowdown, because what
saves time is the removal of the prolog/epilog and of the argument
reload code.




---- Additional Comments From vargaz@gmail.com 2007-05-24 14:24:12 MST ----

I agree with your approach, just doesn't understand why my patch was
causing a slowdown, when the original call to delegate_invoke was also
virtual and thus indirect, and the patch removed lots of loads and
compares.




---- Additional Comments From kumpera@gmail.com 2007-05-24 22:42:01 MST ----

Wouldn't a guard on the call site make the whole thing a lot easier?
This way it would optimize for the non-multicast way using a branch
with a not-taken hint, which is a lot cheaper than an indirect jump.

So, instead of having all these problems with generating/selecting the
right delegate method and fixing the calling convention, a simple
patch in the invoke call site might fix it. I just don't know if the
code increase overhead would be significant.




---- Additional Comments From lupus@ximian.com 2007-05-25 09:45:23 MST ----

Rodrigo made me check the generated code: we call the delegate invoke
method virtually, in my discussion above I assumed it was a
non-virtual call because that is what should happen: the Invoke method
is virtual, but the type is static, so we should use a direct call.
This optimization should benefit also all the other such cases.



---- Additional Comments From marek.safar@seznam.cz 2007-05-25 10:39:10 MST ----

I was always wondering how hard and efficient it can be to implement
sealed class { virtual method } optimization.



---- Additional Comments From kumpera@gmail.com 2007-05-25 17:57:46 MST ----

I posted a patch to mono-dev with the sealed class/method virtual
dispatch optimization. This resulted in a very small improvement for
ipy 2.0, only 4%.

The GC overhead is justifiable since ipy 2.0 allocates a lot more
memory than 1.1.

Another solution to optimize delegate dispatch is to allow the runtime
to violate the sealed flag and generate two specialized sub-classes of
the delegate. For instance methods, just put the target method in the
vtable and for static methods fix the stack and statically dispatch
and return. 

For the simple case, the cost of calling a delegate would be the same
of a virtual-method.

This would require a very intrusive patch, since there are quite a few
places that check (method->klass->parent ==
mono_defaults.multicastdelegate_class).





---- Additional Comments From vargaz@gmail.com 2007-05-26 13:04:17 MST ----

The attach patch implements Paolo's ideas on amd64. It is not yet
complete (some fixmes remain). It improves pystones performance by
about 25% both on ip 1.1 and 2.0. In the best case, the overhead of
delegate invocation is reduced to two machine instructions, and indirect
call and an indirect jump.





---- Additional Comments From vargaz@gmail.com 2007-05-26 13:04:53 MST ----

Created an attachment (id=171994)
patch




---- Additional Comments From vargaz@gmail.com 2007-05-29 10:51:14 MST ----

Created an attachment (id=171995)
updated patch with x86 support




---- Additional Comments From vargaz@gmail.com 2007-05-31 10:12:37 MST ----

The patch is now in SVN. The x86 static case could be optimized further
if somebody wants to work on it.




---- Additional Comments From joncham@gmail.com 2007-07-20 00:09:14 MST ----

Attached a patch for x86 static case. I ran runtime tests, and my own
in a large loop and with exceptions. Please let me know if patch is
acceptable, and if so whether I should merge the two static cases
(with/without parameters) or leave separate for clarity. Also, I
wasn't sure how much code space to reserve.

The microbenchmark goes from about 17 seconds to 11 seconds with the
patch on my machine.



---- Additional Comments From joncham@gmail.com 2007-07-20 00:10:19 MST ----

Created an attachment (id=171996)
Patch for x86 static case




---- Additional Comments From vargaz@gmail.com 2007-07-20 12:54:08 MST ----

This look ok. You can avoid guessing how much space to reserve by
allocating code using g_malloc, emitting the code, then allocating from
the code manager when the length of the emitted code is known.




---- Additional Comments From lupus@ximian.com 2007-08-01 08:11:57 MST ----

For this cases we should delegate caching to the arch-specific code.
In most cases, for eligible signatures, the exact type doesn't matter,
only the number of params does. With this in place we could allocate
a lot less memory and we could do it in a not appdomain specific way.
For x86 not taking into account struct returns we could have a simple
array as the cache lookup, with the number of arguments as the index
into the cache. The code memory would be allocated with
mono_global_codeman_reserve().



---- Additional Comments From joncham@gmail.com 2007-08-30 22:23:48 MST ----

Created an attachment (id=171997)
Static Implementation




---- Additional Comments From joncham@gmail.com 2007-08-30 22:26:27 MST ----

I attached a patch that allocates the correct space using
mono_code_manager_reserve. The x86_jump_membase call only seems to
require 3 bytes, but I just used 4. If I should only allocate 3 bytes
for the x86_jump_membase call, let me know. Thanks.



---- Additional Comments From joncham@gmail.com 2007-09-01 00:31:23 MST ----

Created an attachment (id=171998)
Architecture Specific Delegate Caching Patch




---- Additional Comments From joncham@gmail.com 2007-09-01 00:32:16 MST ----

Attached a first attempt at what Paolo described, caching the delegate
in the architecture specific code in an array based on parameter count.



---- Additional Comments From vargaz@gmail.com 2007-09-01 12:37:36 MST ----

It looks fine except one thing: it should be made thread-safe either
by using a lock, or setting the cache[..] entry at the end using
InterlockedCompareExchange.




---- Additional Comments From joncham@gmail.com 2007-09-01 21:50:32 MST ----

Created an attachment (id=171999)
Achitecture Delegate Impl with InterlockedCompareExchange




---- Additional Comments From vargaz@gmail.com 2007-09-02 06:05:52 MST ----

This looks ok to check in.




---- Additional Comments From lupus@ximian.com 2007-09-06 11:10:31 MST ----

There are a few issues:
*) the has_target case doesn't need a cache based on the number of
arguments: there is going to be just one version regardless of the
arguments.
*) delegate_invoke_impl_with_target_hash needs to be removed also from
domain.c and the header files
*) you need a space before any open [
*) even if the possible leak is minimal, it's so easy to avoid it
using a lock that it's best to handle it properly



---- Additional Comments From joncham@gmail.com 2007-09-10 15:22:10 MST ----

Created an attachment (id=172000)
Architecture Specific Delegate Caching #2




---- Additional Comments From lupus@ximian.com 2007-09-11 13:59:36 MST ----

Looks good, please commit first thearch_init/cleanup changes
and in separate commits the x86 and amd64 implementations.
Thanks!



---- Additional Comments From joncham@gmail.com 2007-09-14 23:57:43 MST ----

Committed 85696/85697. Closing unless there is any objections.

Imported an attachment (id=171992)
Imported an attachment (id=171993)
Imported an attachment (id=171994)
Imported an attachment (id=171995)
Imported an attachment (id=171996)
Imported an attachment (id=171997)
Imported an attachment (id=171998)
Imported an attachment (id=171999)
Imported an attachment (id=172000)

Unknown operating system unknown. Setting to default OS "Other".