Bug 317324 (MONO73086) - The UTF-8 decoding problems
Summary: The UTF-8 decoding problems
Status: RESOLVED FIXED
Alias: MONO73086
Product: Mono: Compilers
Classification: Mono
Component: C# (show other bugs)
Version: unspecified
Hardware: Other Other
: P3 - Medium : Normal
Target Milestone: ---
Assignee: Atsushi Enomoto
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-02-27 14:05 UTC by Svetlana Zholkovsky
Modified: 2007-09-15 21:24 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
treat decoding of the \uFEFF character and correct decoding of the surrogate pair (5.08 KB, patch)
2005-03-02 13:12 UTC, Thomas Wiest
Details | Diff
New patch (4.59 KB, patch)
2005-03-04 00:44 UTC, Thomas Wiest
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 19:09:06 UTC


---- Reported by svetlanaz@mainsoft.com 2005-02-27 07:05:20 MST ----

Please fill in this template when reporting a bug, unless you know what 
you are doing.
Description of Problem:
The UTF-8 decoder does not return original characters. The 
character "\uFEFF" (bytes FE BB BF) does not returned at all.


Steps to reproduce the problem:
		public static void Main( String[] args )
		{
			Encoding utf = Encoding.UTF8;
			char[] testChars = {'\uFEFF','A'};

			byte[] bytes = utf.GetBytes(testChars);
			char[] chars = utf.GetChars(bytes);

			foreach (char c in chars)
			{
				Console.Write("[{0:x4}] ", (int)c);
			}

			Console.WriteLine();
			Console.WriteLine( "Press any key ...");
			Console.ReadLine();
		}

Actual Results:
[0041]

Expected Results:
[feff] [0041]

How often does this happen? 
Always

Additional Information:



---- Additional Comments From rafaelteixeirabr@hotmail.com 2005-02-28 15:18:41 MST ----

"\uFEFF" is the BOM (Byte Order Mark) when it is the first character
in a stream/string/buffer. We need to discuss its preservation...



---- Additional Comments From svetlanaz@mainsoft.com 2005-03-01 06:19:09 MST ----

Hi,
The BOM should be returned from the decoder for compliance with .NET.
I have a patch to solve the problem in the UTF-8 and Unicode 
encodings.  Please, let me know when I can send the patch.
Svetlana.



---- Additional Comments From gonzalo@ximian.com 2005-03-01 20:23:18 MST ----

You can attach the patch here ('Create new attachment' link)



---- Additional Comments From svetlanaz@mainsoft.com 2005-03-02 06:12:44 MST ----

Created an attachment (id=167502)
treat decoding of the \uFEFF character and correct decoding of the surrogate pair




---- Additional Comments From gonzalo@ximian.com 2005-03-03 17:44:44 MST ----

Created an attachment (id=167503)
New patch




---- Additional Comments From gonzalo@ximian.com 2005-03-03 17:49:03 MST ----

The patch I attached is the same as yours but without removing the 5
and 6 bits cases from those 2 'switch'.

The test works. Any reason to remove those cases or do i commit the
patch i attached?



---- Additional Comments From svetlanaz@mainsoft.com 2005-03-06 05:39:48 MST ----

Hi,
I don't see the reason to handle 5 and 6 bytes decoding if the 
encoder does not encodes such cases (the UTF-8 encoder implementation 
can encode only up to 4 bytes per character). But it is not disturb 
me and you can commit the patch.
Thanks.



---- Additional Comments From gonzalo@ximian.com 2005-04-21 02:15:35 MST ----

Applying this patch breaks mcs.



---- Additional Comments From svetlanaz@mainsoft.com 2005-04-21 07:28:20 MST ----

Hi,
In the .NET, UTF-8 decoder returns the '\uFEFF' character.
In the Mono before my patch, the character was eaten.
The patch corrects the problem.
I think, that the Decoder is a low level API and should return all 
encoded characters. And it is responsibility of the users to decide 
how to treat each character. So, the problem is not with the patch, 
but with the mcs itself, which incorrectly uses the decoder. The mcs 
should handle the logic about the special characters such as '\uFEFF' 
character.
Thanks,
Svetlana




---- Additional Comments From gonzalo@ximian.com 2005-04-21 07:58:27 MST ----

Yes, that's why mcs needs to be fixed before applying this patch (I
moved the component of the bug to the C# compiler)



---- Additional Comments From miguel@ximian.com 2005-04-23 15:12:25 MST ----

I will take care of the mcs side of things.



---- Additional Comments From miguel@ximian.com 2005-05-12 18:22:51 MST ----

Am re-assigning to Lluis.

I thought that this had broken the encoder-autodetection code in
StreamReader, but a sample program show that this is working.

The problem seems to be that it broke the computation of the preamble
size in mcs/support.cs's SeekableStreamReader in the compiler. 

I wonder: what if we do not use corlib's auto-detection of the
encoder, and instead "peek" at the results ourselves in
SeekableStreamReader.  We only auto-detect 3 kinds of files anyways
(The three unicode variants).



---- Additional Comments From atsushi@ximian.com 2005-12-07 07:43:00 MST ----

Actually this bug had already been fixed (yeah I remember I fixed it
during mcs bugfixing).

Imported an attachment (id=167502)
Imported an attachment (id=167503)

Unknown bug field "cf_op_sys_details" encountered while moving bug
   <cf_op_sys_details>Windows XP Professional Service Pack 2</cf_op_sys_details>
Unknown operating system unknown. Setting to default OS "Other".