Posts: 53
Threads: 21
Joined: Jul 2021
I'm trying to wrtite macros that deal with unicode characters - and noticed that very soon unicode is lost:
str s.getclip;
paste s;
str b=s;b=b.left(b 1);
paste b;
the first paste is fine, b hasn't survived. Do I need a different declaration to enable unicode for b or what's the problem?
Posts: 12,072
Threads: 140
Joined: Dec 2002
QM string encoding is UTF8. It means variable-length characters. Some characters are 1 byte, some 2, 3 or 4 (rarely).
When you know character length, simply use it in code.
Macro
Macro3017
str s="ąbc" ;;first character is 2 bytes
str bad.left(s 1) ;;gets half of character
out bad
str good.left(s 2)
out good
In other cases usually you use find or findrx or similar function to find a substring, and it gives correct result.
Posts: 1,336
Threads: 61
Joined: Jul 2006
a couple more options
#1 you can use this member function
Member function
str.getU2
function$ $sinp from nc
;Unicode version of "get" macro, it also serves left
;Error if from invalid.
;If nc < 0 or too big, gets all right part.
str s.unicode(sinp) ;;convert to UTF-16
from*2; nc*2
if(from<0 or from>s.len) end ERR_BADARG
if(nc<0 or from+nc>s.len) nc=s.len-from
this.ansi(s+from -1 nc/2)
ret this
for your example would use like so
Function
UnicodeTest1
str s.getclip;
paste s
str b.getU2(s 0 1);; gets first character
paste b
#ret;;for testing place cursor on line below and run
#2 can use paste with format fields
Function
UnicodeTest2
str s.getclip;
paste s;
paste("%#.1s" s);; paste first character of string
#ret;;for testing place cursor on line below and run
Posts: 53
Threads: 21
Joined: Jul 2021
Thanks Kevin, very helpful!
Now...just for my understanding: your code seems to assume that it has string with UTF-16 chars. What if it was a UTF32 - I guess then it wouldn't work. Is there no way we can determine the # of bytes a character uses?
Posts: 12,072
Threads: 140
Joined: Dec 2002
Like in UTF8, some UTF16 characters consist of 2 normal characters. But they are very rare, used for ancient scripts etc. When working with trivial text, it is safe to ignore it.