Bugzilla – Bug 317324
The UTF-8 decoding problems
Last modified: 2007-09-15 21:24:23 UTC
---- Reported by svetlanaz@mainsoft.com 2005-02-27 07:05:20 MST ---- Please fill in this template when reporting a bug, unless you know what you are doing. Description of Problem: The UTF-8 decoder does not return original characters. The character "\uFEFF" (bytes FE BB BF) does not returned at all. Steps to reproduce the problem: public static void Main( String[] args ) { Encoding utf = Encoding.UTF8; char[] testChars = {'\uFEFF','A'}; byte[] bytes = utf.GetBytes(testChars); char[] chars = utf.GetChars(bytes); foreach (char c in chars) { Console.Write("[{0:x4}] ", (int)c); } Console.WriteLine(); Console.WriteLine( "Press any key ..."); Console.ReadLine(); } Actual Results: [0041] Expected Results: [feff] [0041] How often does this happen? Always Additional Information: ---- Additional Comments From rafaelteixeirabr@hotmail.com 2005-02-28 15:18:41 MST ---- "\uFEFF" is the BOM (Byte Order Mark) when it is the first character in a stream/string/buffer. We need to discuss its preservation... ---- Additional Comments From svetlanaz@mainsoft.com 2005-03-01 06:19:09 MST ---- Hi, The BOM should be returned from the decoder for compliance with .NET. I have a patch to solve the problem in the UTF-8 and Unicode encodings. Please, let me know when I can send the patch. Svetlana. ---- Additional Comments From gonzalo@ximian.com 2005-03-01 20:23:18 MST ---- You can attach the patch here ('Create new attachment' link) ---- Additional Comments From svetlanaz@mainsoft.com 2005-03-02 06:12:44 MST ---- Created an attachment (id=167502) treat decoding of the \uFEFF character and correct decoding of the surrogate pair ---- Additional Comments From gonzalo@ximian.com 2005-03-03 17:44:44 MST ---- Created an attachment (id=167503) New patch ---- Additional Comments From gonzalo@ximian.com 2005-03-03 17:49:03 MST ---- The patch I attached is the same as yours but without removing the 5 and 6 bits cases from those 2 'switch'. The test works. Any reason to remove those cases or do i commit the patch i attached? ---- Additional Comments From svetlanaz@mainsoft.com 2005-03-06 05:39:48 MST ---- Hi, I don't see the reason to handle 5 and 6 bytes decoding if the encoder does not encodes such cases (the UTF-8 encoder implementation can encode only up to 4 bytes per character). But it is not disturb me and you can commit the patch. Thanks. ---- Additional Comments From gonzalo@ximian.com 2005-04-21 02:15:35 MST ---- Applying this patch breaks mcs. ---- Additional Comments From svetlanaz@mainsoft.com 2005-04-21 07:28:20 MST ---- Hi, In the .NET, UTF-8 decoder returns the '\uFEFF' character. In the Mono before my patch, the character was eaten. The patch corrects the problem. I think, that the Decoder is a low level API and should return all encoded characters. And it is responsibility of the users to decide how to treat each character. So, the problem is not with the patch, but with the mcs itself, which incorrectly uses the decoder. The mcs should handle the logic about the special characters such as '\uFEFF' character. Thanks, Svetlana ---- Additional Comments From gonzalo@ximian.com 2005-04-21 07:58:27 MST ---- Yes, that's why mcs needs to be fixed before applying this patch (I moved the component of the bug to the C# compiler) ---- Additional Comments From miguel@ximian.com 2005-04-23 15:12:25 MST ---- I will take care of the mcs side of things. ---- Additional Comments From miguel@ximian.com 2005-05-12 18:22:51 MST ---- Am re-assigning to Lluis. I thought that this had broken the encoder-autodetection code in StreamReader, but a sample program show that this is working. The problem seems to be that it broke the computation of the preamble size in mcs/support.cs's SeekableStreamReader in the compiler. I wonder: what if we do not use corlib's auto-detection of the encoder, and instead "peek" at the results ourselves in SeekableStreamReader. We only auto-detect 3 kinds of files anyways (The three unicode variants). ---- Additional Comments From atsushi@ximian.com 2005-12-07 07:43:00 MST ---- Actually this bug had already been fixed (yeah I remember I fixed it during mcs bugfixing). Imported an attachment (id=167502) Imported an attachment (id=167503) Unknown bug field "cf_op_sys_details" encountered while moving bug <cf_op_sys_details>Windows XP Professional Service Pack 2</cf_op_sys_details> Unknown operating system unknown. Setting to default OS "Other".