Bug 311050 (MONO23541) - mcs needs to deal with the encoding of source files
Summary: mcs needs to deal with the encoding of source files
Status: RESOLVED FIXED
Alias: MONO23541
Product: Mono: Compilers
Classification: Mono
Component: C# (show other bugs)
Version: unspecified
Hardware: Other Other
: P3 - Medium : Enhancement
Target Milestone: ---
Assignee: Miguel de Icaza
QA Contact: Mono Bugs
URL:
Whiteboard:
Keywords: Built
Depends on:
Blocks:
 
Reported: 2002-04-17 03:13 UTC by Martin Adoue
Modified: 2010-03-11 13:14 UTC (History)
0 users

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Wiest 2007-09-15 17:56:21 UTC


---- Reported by martin@cwanet.com 2002-04-16 20:13:01 MST ----

Description of Problem:
mcs fails to compile sources with long comment lines. csc has no problem.

Steps to reproduce the problem:
1. copy the lines below to test2.cs
2. try to compile with mcs --target exe test2.cs

Actual Results:
(process:3348): ** ERROR **: file unicode.c: line 288 (iconv_get_length): 
should not be reached
aborting...

Expected Results:
A clean compile.

How often does this happen? 
Always

Additional Information:
CVS snapshot of 2002-04-12, running on Windows


using System;
namespace test2
{
	class test2
	{
		/// <param name="NumDigitsAfterDecimal">Optional. Numeric 
value indicating how many places are displayed to the right of the 
decimal. Default value is –1, which indicates that the computer's regional 
settings are used.</param>
		static int Main(string[] args)
		{
			return 0;
		}
	}
}



---- Additional Comments From miguel@ximian.com 2002-04-19 17:04:18 MST ----

This is a bug in the runtime.



---- Additional Comments From lupus@ximian.com 2002-04-20 08:26:30 MST ----

It works on linux: this may be due to the different behaviour libiconv
has on some platforms (this bug may also be related to https://bugzilla.novell.com/show_bug.cgi?id=MONO23116).



---- Additional Comments From gonzalo@ximian.com 2002-04-23 12:03:49 MST ----

I did cut & paste with the sample code and it failed because there
was a "û" instead of a "-" (minus sing) just before the 1 (in "Default
value is -1") (cut&paste between mozilla and gvim).

The error returned by iconv () in the first case is EILSEQ.
This g_warning () showed "û1, which indicates..." as the source of the
ILSEQ.
---
case EILSEQ:
	g_warning ("Iconv error at or near: %s\n", p);
	have_error = TRUE;
---

When I changed the "û" to "-" it compiles ok in cygwin.




---- Additional Comments From gonzalo@ximian.com 2002-04-27 21:43:44 MST ----

*** https://bugzilla.novell.com/show_bug.cgi?id=MONO23951 has been marked as a duplicate of this bug. ***



---- Additional Comments From martin@gnome.org 2002-04-28 07:42:20 MST ----

See also my comment on https://bugzilla.novell.com/show_bug.cgi?id=MONO23951 - this is because iconv() tries
to convert the string from UTF-8 to UTF-16le and the `u-umlaut' is
not a valid UTF-8 character.

So IMO this is not a bug in the runtime, but in the class 
libraries/MCS - iconv_open() is called with "UTF-8" as source and 
"UTF-16le" as target - so it's correct to throw an error (of course 
it should throw an exception and not g_assert_not_reached ()).

MCS/the class library must determine the encoding of the input file
and then make sure that iconv_open() gets the correct source 
encoding (iso-8859-1 in this case).



---- Additional Comments From lupus@ximian.com 2002-08-27 13:05:23 MST ----

Moving to mcs: mcs needs to validate the source files for the correct
encoding (maybe starting with utf8 and then falling back to latin1 or
the default encoding for the locale...).



---- Additional Comments From lupus@ximian.com 2002-08-27 13:24:21 MST ----

*** https://bugzilla.novell.com/show_bug.cgi?id=MONO27655 has been marked as a duplicate of this bug. ***



---- Additional Comments From martin@gnome.org 2002-08-27 20:29:43 MST ----

Hmm, how should MCS do this, it's using a StreamReader() to read the file ?

At the moment, there's nothing I can do against this in MCS:

* the runtime doesn't report any exception, it just blindly ignores the characters so MCS won't even see that something's wrong.

* according to the documentation it's the runtime's job to autodetect the encoding of a file
  See ms-help://MS.NETFrameworkSDK/cpref/html/frlrfsystemiostreamreaderclasscurrentencodingtopic.htm,
  t says that StreamReader.CurrentEncoding is set after the first Read() since the encoding is autodetected.

So IMHO this must be done either by the runtime or by our StreamReader implementation.

I don't want to "force" this bug back into the runtime, so keeping it here and setting priority to wishlist.





---- Additional Comments From miguel@ximian.com 2002-09-04 12:10:00 MST ----

CSC has an option called /codepage:XXX which is used for specifying
the codepage for the input file. 

The documentation for codepage claims that if the source code is in
either the default code page, or Unicode or UTF-8 the compiler will be
able to figure things out on its own.

I am assuming they mean `UTF-16' when they say Unicode.  

Distinguishing what microsoft calls "Unicode" and "Unicode big endian"
on Windows is easy, the first two bytes are 0xfe 0xff (unicode) or
0xfe 0xff (unicode big endian).

On *windows* they use 0xef 0xbb 0xbf for Utf-8 encoded files

So I assume the rest is supposed to be encoded in the current
"codepage", an interesting concept, because I do not know how code
pages map to character sets on Unix or how to tell what the current
codepage is.



---- Additional Comments From miguel@ximian.com 2002-09-06 20:10:06 MST ----

Fixed: both StreamReader as the compiler.



---- Additional Comments From miguel@ximian.com 2002-09-06 20:47:09 MST ----

Forgot to close the bug


Unknown operating system unknown. Setting to default OS "Other".