Bug 755414

Summary: YCP string builtins like the substring() function are not UTF-8 safe
Product: [openSUSE] openSUSE 12.2 Reporter: Ladislav Slezák <lslezak>
Component: YaST2Assignee: Martin Vidner <mvidner>
Status: RESOLVED FIXED QA Contact: Jiri Srain <jsrain>
Severity: Major    
Priority: P5 - None CC: aschnell, jsmeix
Version: Factory   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Ladislav Slezák 2012-04-03 09:50:16 UTC
When fixing bug bnc#728588 it turned out that the YCP substring() function uses byte units instead of UTF-8 characters as expected.

This causes buggy behavior when iterating over a string in YCP code.


Example:

  size("áa") => 2

but

  substring("áa", 1, 1) => "\0xF1"
Comment 2 Arvin Schnell 2012-04-03 13:35:22 UTC
YCP has a function lsubstring for that, see bug #446996.
Comment 3 Johannes Meixner 2012-04-04 10:04:13 UTC
Extra functions for UTF8 encoded strings are not needed
because since ever the YCP data type string
consist of UNICODE characters encoded in UTF8.

See the documentation e.g. old one for SLE10
http://doc.opensuse.org/projects/YaST/SLES10/tdg/id_ycp_data_string.html
and the same in newer one e.g. for openSUSE 11.3
http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/id_ycp_data_string.html

Therefore all "YCP String Builtins" as listed in
http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/Book-YaSTReference.html
must work with UTF8 encoded strings.

In this context
https://bugzilla.novell.com/show_bug.cgi?id=446996#c16
  "Multibyte strings are an area of the YCP language
   that missed the train when SUSE switched to UTF8 arouns SL 8"
looks really surprising - at least from my point of view - oor
I completely misunderstand something here...
Comment 4 Arvin Schnell 2012-06-13 11:49:34 UTC
Fixed in yast2-core for openSUSE 12.2.

Functions like substring work on unicode characters now. That does
not mean that splitting a string at any position does what you might
expect, e.g. splitting a string between combining characters can
still give strange results.