Bugzilla – Bug 311050
mcs needs to deal with the encoding of source files
Last modified: 2010-03-11 13:14:01 UTC
---- Reported by martin@cwanet.com 2002-04-16 20:13:01 MST ---- Description of Problem: mcs fails to compile sources with long comment lines. csc has no problem. Steps to reproduce the problem: 1. copy the lines below to test2.cs 2. try to compile with mcs --target exe test2.cs Actual Results: (process:3348): ** ERROR **: file unicode.c: line 288 (iconv_get_length): should not be reached aborting... Expected Results: A clean compile. How often does this happen? Always Additional Information: CVS snapshot of 2002-04-12, running on Windows using System; namespace test2 { class test2 { /// <param name="NumDigitsAfterDecimal">Optional. Numeric value indicating how many places are displayed to the right of the decimal. Default value is –1, which indicates that the computer's regional settings are used.</param> static int Main(string[] args) { return 0; } } } ---- Additional Comments From miguel@ximian.com 2002-04-19 17:04:18 MST ---- This is a bug in the runtime. ---- Additional Comments From lupus@ximian.com 2002-04-20 08:26:30 MST ---- It works on linux: this may be due to the different behaviour libiconv has on some platforms (this bug may also be related to https://bugzilla.novell.com/show_bug.cgi?id=MONO23116). ---- Additional Comments From gonzalo@ximian.com 2002-04-23 12:03:49 MST ---- I did cut & paste with the sample code and it failed because there was a "û" instead of a "-" (minus sing) just before the 1 (in "Default value is -1") (cut&paste between mozilla and gvim). The error returned by iconv () in the first case is EILSEQ. This g_warning () showed "û1, which indicates..." as the source of the ILSEQ. --- case EILSEQ: g_warning ("Iconv error at or near: %s\n", p); have_error = TRUE; --- When I changed the "û" to "-" it compiles ok in cygwin. ---- Additional Comments From gonzalo@ximian.com 2002-04-27 21:43:44 MST ---- *** https://bugzilla.novell.com/show_bug.cgi?id=MONO23951 has been marked as a duplicate of this bug. *** ---- Additional Comments From martin@gnome.org 2002-04-28 07:42:20 MST ---- See also my comment on https://bugzilla.novell.com/show_bug.cgi?id=MONO23951 - this is because iconv() tries to convert the string from UTF-8 to UTF-16le and the `u-umlaut' is not a valid UTF-8 character. So IMO this is not a bug in the runtime, but in the class libraries/MCS - iconv_open() is called with "UTF-8" as source and "UTF-16le" as target - so it's correct to throw an error (of course it should throw an exception and not g_assert_not_reached ()). MCS/the class library must determine the encoding of the input file and then make sure that iconv_open() gets the correct source encoding (iso-8859-1 in this case). ---- Additional Comments From lupus@ximian.com 2002-08-27 13:05:23 MST ---- Moving to mcs: mcs needs to validate the source files for the correct encoding (maybe starting with utf8 and then falling back to latin1 or the default encoding for the locale...). ---- Additional Comments From lupus@ximian.com 2002-08-27 13:24:21 MST ---- *** https://bugzilla.novell.com/show_bug.cgi?id=MONO27655 has been marked as a duplicate of this bug. *** ---- Additional Comments From martin@gnome.org 2002-08-27 20:29:43 MST ---- Hmm, how should MCS do this, it's using a StreamReader() to read the file ? At the moment, there's nothing I can do against this in MCS: * the runtime doesn't report any exception, it just blindly ignores the characters so MCS won't even see that something's wrong. * according to the documentation it's the runtime's job to autodetect the encoding of a file See ms-help://MS.NETFrameworkSDK/cpref/html/frlrfsystemiostreamreaderclasscurrentencodingtopic.htm, t says that StreamReader.CurrentEncoding is set after the first Read() since the encoding is autodetected. So IMHO this must be done either by the runtime or by our StreamReader implementation. I don't want to "force" this bug back into the runtime, so keeping it here and setting priority to wishlist. ---- Additional Comments From miguel@ximian.com 2002-09-04 12:10:00 MST ---- CSC has an option called /codepage:XXX which is used for specifying the codepage for the input file. The documentation for codepage claims that if the source code is in either the default code page, or Unicode or UTF-8 the compiler will be able to figure things out on its own. I am assuming they mean `UTF-16' when they say Unicode. Distinguishing what microsoft calls "Unicode" and "Unicode big endian" on Windows is easy, the first two bytes are 0xfe 0xff (unicode) or 0xfe 0xff (unicode big endian). On *windows* they use 0xef 0xbb 0xbf for Utf-8 encoded files So I assume the rest is supposed to be encoded in the current "codepage", an interesting concept, because I do not know how code pages map to character sets on Unix or how to tell what the current codepage is. ---- Additional Comments From miguel@ximian.com 2002-09-06 20:10:06 MST ---- Fixed: both StreamReader as the compiler. ---- Additional Comments From miguel@ximian.com 2002-09-06 20:47:09 MST ---- Forgot to close the bug Unknown operating system unknown. Setting to default OS "Other".