#1 2020-01-15 14:47:13

macfly
Member
From: Brasil
Registered: 2016-08-20
Posts: 374

StringReplaceAll + Unicode

var
  S : RawUTF8;
begin
  S := 'aaaâââ';
  S := StringReplaceAll(S, 'â', 'a');
  Writeln(S); //Write: aaaâââ   Expected: aaaaaa

  S := StringReplaceAll(S, 'a', 'â');
  Writeln(S); //Write: ???âââ   Expected: ââââââ
end;

Now with StringToUTF8 conversion:

  S := 'aaaâââ';
  S := StringReplaceAll(S, StringToUTF8('â'), 'a');
  Writeln(S); //Write: aaaaaa  Expected: aaaaaa

  S := StringReplaceAll(S, 'a', StringToUTF8('â'));
  Writeln(S); //Write: ââââââ   Expected: ââââââ
end;

I know this is not a FrameWork problem, but what would be the explanation for:
- The Compiler will not give me an Inplicit Conversion Warning if it is considering the literal String as not UTF-8

- Why is this conversion necessary if by default the String in Delphi(new versions) is Unicode?

Offline

#2 2020-01-15 16:50:51

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: StringReplaceAll + Unicode

I guess the reason because Delphi is buggy with UTF-8 constants, is that they didn't consider UTF-8 support seriously enough.
Why use UTF-8 if you have UTF-16?
They even deprecated UTF-8 strings... then came back to reason.
But the bugs remains. And I guess Embarcadero is very unlikely to fix them.

So what I do in such context is:
1. for UI: use English text in the source code, e.g. as resourcestring, then put the translation in some resource or external file.
2. for logic process: hardcode constants using explicit StringToUTF8() conversions in the initialization section of the unit, setting the proper UTF-8 content in a global RawUTF8 variable.

Offline

#3 2020-01-15 17:25:41

macfly
Member
From: Brasil
Registered: 2016-08-20
Posts: 374

Re: StringReplaceAll + Unicode

Thanks for the explanation.

Unfortunately, compiling with FPC has proven to be more reliable than with Delphi.

This is a situation that can generate a error that is hard to notice.

Is really frustrating to have to fix basic things with this one.


The idea of converting constants is great. I will adopt this procedure.

Offline

#4 2020-01-16 14:25:42

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: StringReplaceAll + Unicode

Side note.
The benefit of global RawUTF8 variables is that they will allow reference counting, whereas plain const strings have a reference counter set to -1, so in some cases the compiler will allocate and copy it into a temp variable if this constant is assigned to another variable.
So global RawUTF8 variables may also slightly help performance.

Offline

#5 2020-01-16 14:43:05

macfly
Member
From: Brasil
Registered: 2016-08-20
Posts: 374

Re: StringReplaceAll + Unicode

I'm using records to manage these strings.

This help is also true for rawutf8 properties in records?


TMyRecord = record
   MyStr : RawUTF8;
   ...
end;

const 
  MYCONST = 'âââ';

var
 MyRecord : TMyRecord;

initialization
  MyRecord.MyStr =  StringToUTF8(MYCONST);

Offline

#6 2020-01-16 16:00:30

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: StringReplaceAll + Unicode

Yes, it is exactly the same.

Offline

#7 2020-01-23 17:49:32

macfly
Member
From: Brasil
Registered: 2016-08-20
Posts: 374

Re: StringReplaceAll + Unicode

I'm playing around with this.

And if i set code page to 65001 (UTF-8 ), Delphi recognize the constants as UTF-8 correctly.

Then this conversion are not necessary.


One question, all MORmot source is coded in ANSI (1252), correct?.

Delphi shows as ANSI, but Lazarus and NotePad++ as UTF8.

Last edited by macfly (2020-01-23 17:50:10)

Offline

#8 2020-01-23 19:50:12

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: StringReplaceAll + Unicode

We tried to make mORMot source plain Ansi-7 ASCII.
Any accentuated or special character is expected to be written as #... constant.

There is no (and there won't be any) "BOM" marker, so UTF-8 or Ansi depends on the IDE, not on the file itself.

Offline

#9 2020-01-23 20:12:57

macfly
Member
From: Brasil
Registered: 2016-08-20
Posts: 374

Re: StringReplaceAll + Unicode

Thanks @ab.


A note for anyone who has the same problem.

After changing my source file to UTF-8 I had a problem with encoding in Lazarus (not in Delphi).

The unit is in UTF-8 and SynCommons.pas in ANSI.

 Writeln(Utf8ToConsole(
    UrlDecode(UrlEncode('aaaããã'))
  ));       
 //Write :aaaããã

If i change de unit to ANSI to match SynCommons encodig the result is as expected.

The solution is to add the conditional {$CODEPAGE UTF8} in unit or better yet include Synopse.inc that define this.

Offline

Board footer

Powered by FluxBB