Bugzilla – Bug 324655
[PERF] Up to 4x slower unsafe code
Last modified: 2007-09-15 21:24:46 UTC
---- Reported by marek.safar@seznam.cz 2007-06-30 06:44:48 MST ---- Please fill in this template when reporting a bug, unless you know what you are doing. Description of Problem: Steps to reproduce the problem: 1. using System; class C { public static void Main() { Test_Orig (); } public static unsafe void Test_Orig() { byte[] a = new byte[10000]; long start = Environment.TickCount; byte red, green, blue; fixed (byte* fixed_array = &a[0]) { for (int runs = 0; runs < 100000; runs++) { byte* pointer = fixed_array; byte* end = fixed_array + 999; while (pointer < end) { blue = pointer[0]; green = pointer[1]; red = pointer[2]; pointer[0] = pointer[1] = pointer[2] = (byte)(red * 0.3 + green * 0.59 + blue * 0.11); pointer += 3; } } } long end_c = Environment.TickCount - start; Console.WriteLine ("End Orig = " + end_c); } } Actual Results: Compiled by mcs Mono: 657 MS: 297 Compiled by csc Mono: 960 MS: 250 Expected Results: Similar performance Additional Information: I am wondering whether I should change mcs to produce different code too. ---- Additional Comments From vargaz@gmail.com 2007-06-30 16:01:31 MST ---- What platform is this ? It might be the fact that our x86 jit uses the x87 instruction set, while MS might be using the SSE instruction set. ---- Additional Comments From marek.safar@seznam.cz 2007-07-01 06:55:49 MST ---- I ran it on 32-bit x86, with SIMD support. However, when I checked what MS JIT produces it uses only 2 SSE2 instructions (MOVSD + CVTTSD2SI). Here is the important part of JITed code. for (int runs = 0; runs < 100000; runs++) { 000000a4 xor edx,edx 000000a6 mov dword ptr [ebp-28h],edx 000000a9 nop 000000aa jmp 0000012C byte* pointer = fixed_array; 000000af mov edi,dword ptr [ebp-24h] 000000b2 mov esi,edi byte* end = fixed_array + 999; 000000b4 mov edi,dword ptr [ebp-24h] 000000b7 add edi,3E7h 000000bd mov dword ptr [ebp-2Ch],edi 000000c0 nop 000000c1 jmp 00000124 blue = pointer[0]; 000000c3 movzx eax,byte ptr [esi] 000000c6 mov dword ptr [ebp-20h],eax green = pointer[1]; 000000c9 movzx eax,byte ptr [esi+1] 000000cd mov dword ptr [ebp-1Ch],eax red = pointer[2]; 000000d0 movzx eax,byte ptr [esi+2] 000000d4 mov dword ptr [ebp-18h],eax pointer[0] = pointer[1] = pointer[2] = (byte) (red * 0.3 + green * 0.59 + blue * 0.11); 000000d7 fild dword ptr [ebp-18h] 000000da fmul qword ptr ds:[010C0228h] 000000e0 fild dword ptr [ebp-1Ch] 000000e3 fmul qword ptr ds:[010C0230h] 000000e9 faddp st(1),st 000000eb fild dword ptr [ebp-20h] 000000ee fmul qword ptr ds:[010C0238h] 000000f4 faddp st(1),st 000000f6 fstp qword ptr [ebp-48h] 000000f9 movsd xmm0,mmword ptr [ebp-48h] 000000fe cvttsd2si eax,xmm0 00000102 and eax,0FFh 00000107 mov dword ptr [ebp-38h],eax 0000010a mov eax,dword ptr [ebp-38h] 0000010d mov byte ptr [esi+2],al 00000110 mov eax,dword ptr [ebp-38h] 00000113 mov dword ptr [ebp-3Ch],eax 00000116 mov eax,dword ptr [ebp-3Ch] 00000119 mov byte ptr [esi+1],al 0000011c mov eax,dword ptr [ebp-3Ch] 0000011f mov byte ptr [esi],al pointer += 3; 00000121 add esi,3 while (pointer < end) { 00000124 cmp esi,dword ptr [ebp-2Ch] 00000127 jb 000000C3 ---- Additional Comments From lupus@ximian.com 2007-07-04 06:19:19 MST ---- Please add the MS-generated exe binary, thanks. ---- Additional Comments From marek.safar@seznam.cz 2007-07-04 06:53:44 MST ---- I am not sure what do you mean. Do you want ngen-ed output or normal MSIL assembly ? ---- Additional Comments From lupus@ximian.com 2007-07-04 07:40:37 MST ---- The IL code to see why we generate worse JIT code from their IL code, we don't care about their jit-generated code. Anyway, I added the support in the jit to use SSE2 instructions to convert to int and this makes the test go from 1850 to 800 on my system: this should cover pretty much all the difference with the mcs-compiled IL code. ---- Additional Comments From marek.safar@seznam.cz 2007-07-04 08:03:20 MST ---- Created an attachment (id=172225) Sure, here it is. ---- Additional Comments From lupus@ximian.com 2007-07-04 10:49:40 MST ---- The major issue fix (float->int convertion) is in svn. The csc-compiled code slowdown is I think related to dup use and it's well known the current jit doesn't deal very well with it. Imported an attachment (id=172225) Unknown operating system unknown. Setting to default OS "Other".