Bug 399959 - BinaryReader skips invalid characters by default, MSFT 2.0 SP1 doesn't
Summary: BinaryReader skips invalid characters by default, MSFT 2.0 SP1 doesn't
Status: NEW
Alias: None
Product: Mono: Class Libraries
Classification: Mono
Component: System (show other bugs)
Version: 1.9
Hardware: Other Other
: P5 - None : Normal
Target Milestone: ---
Assignee: Mono Bugs
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords: Code
Depends on:
Blocks:
 
Reported: 2008-06-13 08:36 UTC by Mario De Clippeleir
Modified: 2008-08-05 11:17 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
test file for the test code (10.42 KB, application/octet-stream)
2008-06-13 09:16 UTC, Mario De Clippeleir
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mario De Clippeleir 2008-06-13 08:36:26 UTC
I am using Mono 1.9.1.
The problem is with the ReadChars function. All of a sudden it skips
more than it is supposed to. When I use ReadBytes, it works.
The ReadChars(64) is supposed to be on position 128, but it is on 136.

Here are the file and some test code :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication1
{
   class Program
   {

       static void Main(string[] args)
       {
           int nrOfColumns = 0;
           Stream memoryStream = File.OpenRead("test.mix");
           BinaryReader reader = new BinaryReader(memoryStream);
           reader.BaseStream.Position = 0;
           Console.Out.WriteLine(reader.BaseStream.Position);
           char[] chars = reader.ReadChars(4);
           Console.Out.WriteLine(reader.BaseStream.Position);
           string identification = new string(chars);
           Console.Out.WriteLine("identification = " + identification);

           reader.BaseStream.Position = 40;
           nrOfColumns = reader.ReadInt32();
           Console.Out.WriteLine(reader.BaseStream.Position);
           reader.BaseStream.Position = 60;
           for (int i = 0; i < nrOfColumns; i++)
           {
               reader.ReadInt32();
               Console.Out.WriteLine(reader.BaseStream.Position);
               char[] mixName = reader.ReadChars(64);
               Console.Out.WriteLine(reader.BaseStream.Position);
               //Console.Out.WriteLine(" mixName= " + new
string(mixName));

           }
       }
   }
}
Comment 1 Mario De Clippeleir 2008-06-13 09:16:14 UTC
Created attachment 221984 [details]
test file for the test code
Comment 2 Andy Hume 2008-06-13 21:27:44 UTC
The text in the file contains eight invalid characters so MSFT's UTF8Encoding, as used by default by BinaryReader, converts them to U+FFFD, i.e. REPLACEMENT CHARACTER, so gets 64 chars from 64 bytes.  Mono (etc) just skips them, so needs to read onward to get eight more.  Hence the difference.  The U+FFFD behaviour is new in .NET 2.0 SP1 [1], so if you run your app on MSFT FX 1.1, or on the original FX2, you'll see the same problem.  I've tested the first at least!

The text in the file appears to contain fifteen bytes of (ASCII/Latin) null terminated text, with the remaining 48 bytes being uninitialised data -- thus the invalid bytes.  If the specification says 64 _bytes_ of text then the best solution would be to just use ReadBytes(64) and then encoding.GetString; that'll work in all situations.  Initializing BinaryReader with Encoding.ASCII (or a Latin one) is another possibility.

If you're also writing the file then remember to null those buffers!


[1] See 
http://support.microsoft.com/kb/940521/
http://blogs.msdn.com/michkap/archive/2007/09/17/4950277.aspx
http://blogs.msdn.com/shawnste/archive/2007/07/23/utf-16-utf-8-utf-32-update-to-conform-with-unicode-5-0-s-security-concerns.aspx
etc
Comment 3 Andy Hume 2008-06-15 12:28:33 UTC
Mono's FX2 BinaryReader defaults to a UTF8Encoding object which inserts nothing when invalid data is read.  The MSFT current behaviour is obtained by passing a standard UTF8 encoding into the two-parameter constructor:
    BinaryReader reader = new BinaryReader(memoryStream, Encoding.UTF8);


To be compatible with current MSFT 2.0 rather than original 2.0, the single-parameter constructor should use a 'standard' UTF8Encoding.

The MSFT change was actually pre-SP1, actually in MS07-040.  See http://blogs.msdn.com/dougste/archive/2007/09/06/version-history-of-the-clr-2-0.aspx and note that MS07-040=KB931212=>KB928365=2.0.50727.832
Comment 4 Andy Hume 2008-08-05 09:17:53 UTC
Atsushi this looks like an area you deal with.  Should we change the BinaryReader to use the strict UTF-8 decoding like MSFT's does now?
Comment 5 Atsushi Enomoto 2008-08-05 11:17:18 UTC
It isn't. I'll have a look when I think am appropriate to work on it though.