Login

MBaas · 08-04-2021, 08:08 AM

I'm trying to wrtite macros that deal with unicode characters - and noticed that very soon unicode is lost:

Code:

Copy Help

str s.getclip;

paste s; 

str b=s;b=b.left(b 1);

paste b;

the first paste is fine, b hasn't survived. Do I need a different declaration to enable unicode for b or what's the problem?

***Gintaras*** · 08-04-2021, 06:39 PM

QM string encoding is UTF8. It means variable-length characters. Some characters are 1 byte, some 2, 3 or 4 (rarely).
When you know character length, simply use it in code.

Macro Macro3017

Code: Copy      Help
str s="ąbc" ;;first character is 2 bytes

str bad.left(s 1) ;;gets half of character

out bad

str good.left(s 2)

out good

In other cases usually you use find or findrx or similar function to find a substring, and it gives correct result.

Kevin · 08-05-2021, 02:40 PM

a couple more options
#1 you can use this member function

Member function str.getU2

Code: Copy      Help
function$ $sinp from nc

;Unicode version of "get" macro, it also serves left

;Error if from invalid.

;If nc < 0 or too big, gets all right part.

str s.unicode(sinp) ;;convert to UTF-16

from*2; nc*2

if(from<0 or from>s.len) end ERR_BADARG

if(nc<0 or from+nc>s.len) nc=s.len-from

this.ansi(s+from -1 nc/2)

ret this

for your example would use like so
Function UnicodeTest1

Code: Copy      Help
str s.getclip;

paste s

str b.getU2(s 0 1);; gets first character

paste b

#ret;;for testing place cursor on line below and run

#2 can use paste with format fields
Function UnicodeTest2

Code: Copy      Help
str s.getclip;

paste s;

paste("%#.1s" s);; paste first character of string

#ret;;for testing place cursor on line below and run

MBaas · 08-06-2021, 07:12 AM

Thanks Kevin, very helpful!

Now...just for my understanding: your code seems to assume that it has string with UTF-16 chars. What if it was a UTF32 - I guess then it wouldn't work. Is there no way we can determine the # of bytes a character uses?

***Gintaras*** · 08-06-2021, 07:41 AM

Like in UTF8, some UTF16 characters consist of 2 normal characters. But they are very rare, used for ancient scripts etc. When working with trivial text, it is safe to ignore it.

Login
Username:
Password:	Lost Password?
	Remember me