[Solved] Extracting and comparing single characters in RawUTF8 strings

gothbert · 2017-12-22 11:02:48

Hi,

I would like to transform a RawUTF8 string character-wise, in particular remove underscores and move all initial lowercase characters to the end (de_la_Rue -> Rue!dela). To do so I run over the characters in the RawUTF8 string N and its LoweCaseUnicode() copy L as

while (i <= len) and (N[i] = L[i]) do ...

This does not work for non-7-bit characters like é and ä. This is probably due to the UTF-8 encoding and N[] returning a single byte instead of a single char.

What is the correct way of accessing the ith character in a RawUTF8 string?

Best regards
Boris

Last edited by gothbert (2017-12-22 18:21:51)

mpv · 2017-12-22 15:09:24

See SynCommons.NextUTF8UCS4

gothbert · 2017-12-22 18:21:37

Thank you, mpv, that does the job.

There is a lot of pointer arithmetic involved in getting the job done, plus some pitfalls. Here is what I learned:

RawUTF8 strings behave like pointers.
p= pointer(S) makes a PUTF8Char point to the beginning of the RawUTF8 string S. Do not use p:= @S.
K:= S makes K point to the same memory as S. Even K:= Trim(S) does so. I needed to make an explicit copy by K:= Copy(S, 1, Maxint).

mORMot Open Source

#1 2017-12-22 11:02:48

[Solved] Extracting and comparing single characters in RawUTF8 strings

#2 2017-12-22 15:09:24

Re: [Solved] Extracting and comparing single characters in RawUTF8 strings

#3 2017-12-22 18:21:37

Re: [Solved] Extracting and comparing single characters in RawUTF8 strings

Board footer