#1 2017-12-22 11:02:48

gothbert
Member
Registered: 2017-11-08
Posts: 12

[Solved] Extracting and comparing single characters in RawUTF8 strings

Hi,

I would like to transform a RawUTF8 string character-wise, in particular remove underscores and move all initial lowercase characters to the end (de_la_Rue -> Rue!dela). To do so I run over the characters in the RawUTF8 string N and its LoweCaseUnicode() copy L as

while (i <= len) and (N[i] = L[i]) do ...

This does not work for non-7-bit characters like é and ä. This is probably due to the UTF-8 encoding and N[] returning a single byte instead of a single char.

What is the correct way of accessing the ith character in a RawUTF8 string?

Best regards
Boris

Last edited by gothbert (2017-12-22 18:21:51)

Offline

#2 2017-12-22 15:09:24

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,544
Website

Re: [Solved] Extracting and comparing single characters in RawUTF8 strings

See SynCommons.NextUTF8UCS4

Offline

#3 2017-12-22 18:21:37

gothbert
Member
Registered: 2017-11-08
Posts: 12

Re: [Solved] Extracting and comparing single characters in RawUTF8 strings

Thank you, mpv, that does the job.

There is a lot of pointer arithmetic involved in getting the job done, plus some pitfalls. Here is what I learned:

RawUTF8 strings behave like pointers.
p= pointer(S) makes a PUTF8Char point to the beginning of the RawUTF8 string S. Do not use p:= @S.
K:= S makes K point to the same memory as S. Even K:= Trim(S) does so. I needed to make an explicit copy by K:= Copy(S, 1, Maxint).

Offline

Board footer

Powered by FluxBB