Bug 324655 (MONO81981) - [PERF] Up to 4x slower unsafe code
Summary: [PERF] Up to 4x slower unsafe code
Status: RESOLVED FIXED
Alias: MONO81981
Product: Mono: Runtime
Classification: Mono
Component: JIT (show other bugs)
Version: 1.2
Hardware: Other Other
: P3 - Medium : Normal
Target Milestone: ---
Assignee: Paolo Molaro
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-30 13:44 UTC by Marek Safar
Modified: 2007-09-15 21:24 UTC (History)
0 users

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Sure, here it is. (1.53 KB, application/octet-stream)
2007-07-04 15:03 UTC, Thomas Wiest
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 20:42:49 UTC


---- Reported by marek.safar@seznam.cz 2007-06-30 06:44:48 MST ----

Please fill in this template when reporting a bug, unless you know what you
are doing.
Description of Problem:


Steps to reproduce the problem:
1. 

using System;



class C

{

	public static void Main()

	{

		Test_Orig ();

	}
	

	public static unsafe void Test_Orig()

	{

		byte[] a = new byte[10000];

		

		long start = Environment.TickCount;

		byte red, green, blue;

		fixed (byte* fixed_array = &a[0])

		{

			for (int runs = 0; runs < 100000; runs++)

			{

				byte* pointer = fixed_array;

				byte* end = fixed_array + 999;



 	               while (pointer < end)

 	               {

	                    blue = pointer[0];

        	            green = pointer[1];

        	            red = pointer[2];

	                    pointer[0] = pointer[1] = pointer[2] = (byte)(red *
0.3 + green * 0.59 + blue * 0.11);

	                    pointer += 3;

	                }

	            } 

		}

		

		long end_c = Environment.TickCount - start;

		Console.WriteLine ("End Orig = " + end_c);

	}

}

Actual Results:

Compiled by mcs

Mono: 657
MS: 297

Compiled by csc

Mono: 960
MS: 250

Expected Results:

Similar performance

Additional Information:

I am wondering whether I should change mcs to produce different code too.



---- Additional Comments From vargaz@gmail.com 2007-06-30 16:01:31 MST ----

What platform is this ? It might be the fact that our x86 jit uses the
x87 instruction set, while MS might be using the SSE instruction set.




---- Additional Comments From marek.safar@seznam.cz 2007-07-01 06:55:49 MST ----

I ran it on 32-bit x86, with SIMD support. However, when I checked
what MS JIT produces it uses only 2 SSE2 instructions (MOVSD + CVTTSD2SI).

Here is the important part of JITed code.

			for (int runs = 0; runs < 100000; runs++) {
000000a4  xor         edx,edx 
000000a6  mov         dword ptr [ebp-28h],edx 
000000a9  nop              
000000aa  jmp         0000012C 
				byte* pointer = fixed_array;
000000af  mov         edi,dword ptr [ebp-24h] 
000000b2  mov         esi,edi 
				byte* end = fixed_array + 999;
000000b4  mov         edi,dword ptr [ebp-24h] 
000000b7  add         edi,3E7h 
000000bd  mov         dword ptr [ebp-2Ch],edi 
000000c0  nop              
000000c1  jmp         00000124 
					blue = pointer[0];
000000c3  movzx       eax,byte ptr [esi] 
000000c6  mov         dword ptr [ebp-20h],eax 
					green = pointer[1];
000000c9  movzx       eax,byte ptr [esi+1] 
000000cd  mov         dword ptr [ebp-1Ch],eax 
					red = pointer[2];
000000d0  movzx       eax,byte ptr [esi+2] 
000000d4  mov         dword ptr [ebp-18h],eax 
					pointer[0] = pointer[1] = pointer[2] = (byte) (red * 0.3 + green
* 0.59 + blue * 0.11);
000000d7  fild        dword ptr [ebp-18h] 
000000da  fmul        qword ptr ds:[010C0228h] 
000000e0  fild        dword ptr [ebp-1Ch] 
000000e3  fmul        qword ptr ds:[010C0230h] 
000000e9  faddp       st(1),st 
000000eb  fild        dword ptr [ebp-20h] 
000000ee  fmul        qword ptr ds:[010C0238h] 
000000f4  faddp       st(1),st 
000000f6  fstp        qword ptr [ebp-48h] 
000000f9  movsd       xmm0,mmword ptr [ebp-48h] 
000000fe  cvttsd2si   eax,xmm0 
00000102  and         eax,0FFh 
00000107  mov         dword ptr [ebp-38h],eax 
0000010a  mov         eax,dword ptr [ebp-38h] 
0000010d  mov         byte ptr [esi+2],al 
00000110  mov         eax,dword ptr [ebp-38h] 
00000113  mov         dword ptr [ebp-3Ch],eax 
00000116  mov         eax,dword ptr [ebp-3Ch] 
00000119  mov         byte ptr [esi+1],al 
0000011c  mov         eax,dword ptr [ebp-3Ch] 
0000011f  mov         byte ptr [esi],al 
					pointer += 3;
00000121  add         esi,3 

				while (pointer < end) {
00000124  cmp         esi,dword ptr [ebp-2Ch] 
00000127  jb          000000C3 




---- Additional Comments From lupus@ximian.com 2007-07-04 06:19:19 MST ----

Please add the MS-generated exe binary, thanks.



---- Additional Comments From marek.safar@seznam.cz 2007-07-04 06:53:44 MST ----

I am not sure what do you mean. Do you want ngen-ed output or normal
MSIL assembly ?



---- Additional Comments From lupus@ximian.com 2007-07-04 07:40:37 MST ----

The IL code to see why we generate worse JIT code from their IL code,
we don't care about their jit-generated code.
Anyway, I added the support in the jit to use SSE2 instructions to
convert to int and this makes the test go from 1850 to 800 on my
system: this should cover pretty much all the difference with the
mcs-compiled IL code.



---- Additional Comments From marek.safar@seznam.cz 2007-07-04 08:03:20 MST ----

Created an attachment (id=172225)
Sure, here it is.




---- Additional Comments From lupus@ximian.com 2007-07-04 10:49:40 MST ----

The major issue fix (float->int convertion) is in svn.
The csc-compiled code slowdown is I think related to dup use and it's
well known the current jit doesn't deal very well with it.

Imported an attachment (id=172225)

Unknown operating system unknown. Setting to default OS "Other".