Bugzilla – Bug 399959
BinaryReader skips invalid characters by default, MSFT 2.0 SP1 doesn't
Last modified: 2008-08-05 11:17:18 UTC
I am using Mono 1.9.1. The problem is with the ReadChars function. All of a sudden it skips more than it is supposed to. When I use ReadBytes, it works. The ReadChars(64) is supposed to be on position 128, but it is on 136. Here are the file and some test code : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.IO; namespace ConsoleApplication1 { class Program { static void Main(string[] args) { int nrOfColumns = 0; Stream memoryStream = File.OpenRead("test.mix"); BinaryReader reader = new BinaryReader(memoryStream); reader.BaseStream.Position = 0; Console.Out.WriteLine(reader.BaseStream.Position); char[] chars = reader.ReadChars(4); Console.Out.WriteLine(reader.BaseStream.Position); string identification = new string(chars); Console.Out.WriteLine("identification = " + identification); reader.BaseStream.Position = 40; nrOfColumns = reader.ReadInt32(); Console.Out.WriteLine(reader.BaseStream.Position); reader.BaseStream.Position = 60; for (int i = 0; i < nrOfColumns; i++) { reader.ReadInt32(); Console.Out.WriteLine(reader.BaseStream.Position); char[] mixName = reader.ReadChars(64); Console.Out.WriteLine(reader.BaseStream.Position); //Console.Out.WriteLine(" mixName= " + new string(mixName)); } } } }
Created attachment 221984 [details] test file for the test code
The text in the file contains eight invalid characters so MSFT's UTF8Encoding, as used by default by BinaryReader, converts them to U+FFFD, i.e. REPLACEMENT CHARACTER, so gets 64 chars from 64 bytes. Mono (etc) just skips them, so needs to read onward to get eight more. Hence the difference. The U+FFFD behaviour is new in .NET 2.0 SP1 [1], so if you run your app on MSFT FX 1.1, or on the original FX2, you'll see the same problem. I've tested the first at least! The text in the file appears to contain fifteen bytes of (ASCII/Latin) null terminated text, with the remaining 48 bytes being uninitialised data -- thus the invalid bytes. If the specification says 64 _bytes_ of text then the best solution would be to just use ReadBytes(64) and then encoding.GetString; that'll work in all situations. Initializing BinaryReader with Encoding.ASCII (or a Latin one) is another possibility. If you're also writing the file then remember to null those buffers! [1] See http://support.microsoft.com/kb/940521/ http://blogs.msdn.com/michkap/archive/2007/09/17/4950277.aspx http://blogs.msdn.com/shawnste/archive/2007/07/23/utf-16-utf-8-utf-32-update-to-conform-with-unicode-5-0-s-security-concerns.aspx etc
Mono's FX2 BinaryReader defaults to a UTF8Encoding object which inserts nothing when invalid data is read. The MSFT current behaviour is obtained by passing a standard UTF8 encoding into the two-parameter constructor: BinaryReader reader = new BinaryReader(memoryStream, Encoding.UTF8); To be compatible with current MSFT 2.0 rather than original 2.0, the single-parameter constructor should use a 'standard' UTF8Encoding. The MSFT change was actually pre-SP1, actually in MS07-040. See http://blogs.msdn.com/dougste/archive/2007/09/06/version-history-of-the-clr-2-0.aspx and note that MS07-040=KB931212=>KB928365=2.0.50727.832
Atsushi this looks like an area you deal with. Should we change the BinaryReader to use the strict UTF-8 decoding like MSFT's does now?
It isn't. I'll have a look when I think am appropriate to work on it though.