UTF8 encoded strings in FPC and Lazarus

merlin352 · 2016-09-08 06:28:35

Good Morning

CurrentAnsiConvert is initalized always with a converter to the current used Windows Codepage. For FPC and Lazarus this is not correct as Lazarus (and therefore all the visible components like Grids etc) use UTF8 encoded strings. As CurrentAnsiConvert is used all around in the code I suppose the following changes.

1. Insert a new variable SystemAnsiConvert in SynCommons

  /// global TSynAnsiConvert instance to handle current system encoding
  // - this is the encoding as used by the AnsiString Delphi, so will be used
  // before Delphi 2009 to speed-up VCL string handling (especially for UTF-8)
  // - as FPC and Lazarus use UTF8 encoding this is initalized with TSynAnsiUTF8
  // - this instance is global and instantied during the whole program life time
  CurrentAnsiConvert: TSynAnsiConvert;

  /// global TSynAnsiConvert instance to handle current system encoding
  // - this is the encoding as used by the System
  // - this instance is global and instantied during the whole program life time
  SystemAnsiConvert: TSynAnsiConvert;

2. Changes in TSynAnsiConvert.Engine

class function TSynAnsiConvert.Engine(aCodePage: cardinal): TSynAnsiConvert;
var i: integer;
begin
  if SynAnsiConvertList=nil then begin
    GarbageCollectorFreeAndNil(SynAnsiConvertList,TObjectList.Create);
    SystemAnsiConvert := TSynAnsiConvert.Engine(GetACP);
    {$ifdef FPC}
    CurrentAnsiConvert := TSynAnsiConvert.Engine(CP_UTF8) as TSynAnsiUTF8;
    {$else}
    CurrentAnsiConvert := TSynAnsiConvert.Engine(GetACP);
    {$endif}
    WinAnsiConvert := TSynAnsiConvert.Engine(CODEPAGE_US) as TSynAnsiFixedWidth;
    UTF8AnsiConvert := TSynAnsiConvert.Engine(CP_UTF8) as TSynAnsiUTF8;
  end;

If somewhere in the code where CurrentAnsiConvert is used, but in fact the system code page is ment, this should make it far easy to change the source

Greetings

Last edited by merlin352 (2016-09-08 06:31:18)

ab · 2016-09-08 11:54:53

AFAIR it depends on the FPC compiler version (2.x or 3.x), and Lazarus revision...

Are you 100% sure it is a safe patch?

merlin352 · 2016-09-08 18:20:41

Well, for Delphi the patch ist save, as it changes nothing.

For Lazarus the thing is also clear, it is ALWAYS UTF8. All visible components of the LCL expect an UTF8-String, independant of the version. But I do not know the mORMot-code enough to say if this is true for all the places where codepage-conversion is used. With my proposition it should be easy to change the call from CurrentAnsiConvert to SystemAnsiConvert.

You are right that for FPC things are little more complicate, as the RTL switched recently from Ansi-encoded to UTF8.

On the other hand the existing code is not safe for everything other than Windows, as it always assumes fon non-Windoes-OS that the codepage is 1252, which is rarely the case. Problem is that only Windows knows about ACP-codes.The standard procedure to find the system encoding under Unix-like OS'es (like Linux, Android, OSx) is

{$IFDEF Unix}
function GetSystemEncoding: string;
var
  Lang: string;
begin
  lang := GetEnv('LC_ALL');
  if Length(lang) = 0 then
  begin
    lang := GetEnv('LC_MESSAGES');
    if Length(lang) = 0 then
      lang := GetEnv('LANG');
  end;
  i:=pos('.',Lang);
  if (i>0) and (i<=length(Lang)) then
    Result := copy(Lang,i+1,length(Lang)-i)
  else
     Result := 'UTF-8'
end;
{$ELSE}
begin
  Result := 'UTF-8';
end;
{$ENDIF}

But then you have a string which descibes the character encoding, not an ACP-code. This can be solved, but it can lead to a codepage that is not supported under mORMot.

I have wrtten some years ago a set of units to support ALL codepages for which a Unicode-description exists (that are some more than Windows knows). It has the ability to do codepage conversion internally, (direct conversion between different codepages, multibyte codepage support (asian languages and so on), EBCDIC support, Upper- and Lowercase support) but can also fall back to system calls (iconvenc under *nix). It also has a tool to generate pascal source code from a unicode description file that can then be integrated in the project.

If you are interested in a more global support for codepages I could update these units. But there is some work to do, especially adapt it to strings that support a codepageinfo in the header, and make the whole thing compile unter Delphi.

George · 2020-08-10 11:57:08

I have a question about RawUTF8.
Trying to build json object, and send it to external API service. But, i get unexpected character encoding..
As stated, RawUTF8 is AnsiString with codepage CP_UTF8.

This example works, external service receive correct character code ("€" = HEX: E2 82 AC).

  // Each code compiled with
  {$mode delphi}

  function BuildJsonObject1(): RawUTF8;
  var
    JSONObj: variant;
  begin
    with TDocVariantData(JSONObj) do
     begin
      AddValue('c', '€');
      Result := ToJSON();
     end;
  end;

// From another place:
HTTPClient.Request('/API/v1/test', 'POST', KeepAlive, RequestHeaders, BuildJsonObject1(), RequestDataType, ResponseHeaders, ResponseData);

This code produce unexpected character encoding ("€" = HEX: C3 A2 C2 82 C2 AC):

  function testUtf8Char(): RawUTF8;
  begin
    Result := '€';
  end;

  function BuildJsonObject2(): RawUTF8;
  var
    JSONObj: variant;
  begin
    with TDocVariantData(JSONObj) do
     begin
      AddValue('c', testUtf8Char());
      Result := ToJSON();
     end;
  end;

Same happen when i use AddValueFromText, ("€" = HEX: C3 A2 C2 82 C2 AC)

  function BuildJsonObject3(): RawUTF8;
  var
    JSONObj: variant;
  begin
    with TDocVariantData(JSONObj) do
     begin
      AddValueFromText('c', '€');
      Result := ToJSON();
     end;
  end;

And one more question, how AnsiString(CP_UTF8) store data internally?

  strRawUtf8 := '€'; // EXPECTED UTF-8 bytes = HEX: E2 82 AC
  WriteLn('strRawUtf8 codepage: ' + StringCodePage(strRawUtf8).ToString); // 65001 (UTF-8)
  WriteLn('strRawUtf8 hex: ' + BinToHex(strRawUtf8)); // C3 A2 C2 82 C2 AC

Should i use {$codepage utf8}?

Last edited by George (2020-08-10 13:11:06)

ab · 2020-08-10 13:02:09

It is mainly a problem about encoding of the source code itself, with FPC.
Please read https://wiki.freepascal.org/FPC_Unicode … e_codepage

And FPC Unicode support did some weird things about constants - not consistent with Delphi for instance...

So for such RawUTF8 constants, you may try to use the hexa constant variation #$e2#$82#$ac instead of '€' - used via a constant like _EUROSIGNUTF8.

George · 2020-08-10 13:28:44

I found a table that describes which assignments are allowed.
In that table, they have UTF8String, which is the same thing as RawUTF8. Both = type AnsiString(CP_UTF8).

Based on that table, direct value assignments from source code to RawUTF8 are not allowed.
{$codepage utf8} directive may help here.

PS: i use UTF-8 for file encoding.

George · 2020-08-10 15:03:14

ab wrote:

you may try to use the hexa constant variation #$e2#$82#$ac instead of '€'

No, that not helps. Works only with {$codepage utf8}.

Leslie7 · 2020-08-12 20:53:01

I have run into the same problem recently. As mentioned earlier in this topic RawUTF8 is the same as the FPC String so one would expect that assigning a string constant should work seamlessly. Unfortunately it does not, but his solved it for me :

aRawUTF8:= StringToUTF8(aConstantString);

I was lamenting about the effects of changing the definition of RawUTF8 for FPC like this: RawUTF8 = String;

It might mess up some parts of mORMot. Even if all issues solved the code created based on this definition would be FPC only.

George · 2020-08-12 21:05:50

As mentioned earlier in this topic RawUTF8 is the same as the FPC String

It's same as UTF8String, not simple string.

And table "Assign string literals to different string types" says that by default, we can't assign constant string directly to UTF8String (or to RawUTF8).
Instead of converting, you can add {$codepage utf8} at module begin which will tell the compiler that all constant strings in module should be interpreted as utf8.
If you would like to aplly this mode to entire project, it's possible by adding compilation flag "-FcUTF8" (project options -> custom options -> Add -FcUTF8).

Last edited by George (2020-08-12 21:37:13)

macfly · 2020-08-12 21:09:15

I had similar problems with Delphi.

As suggested by @ab in this topic, I am using a global variable to store the converted constant at initialization.

const 
  MY_CONST = 'âââ';

var
  MyConst: RawUTF8;

initialization
  MyConst =  StringToUTF8(MY_CONST);

macfly · 2020-08-12 21:13:03

....you can add {$codepage utf8} at module begin...

Or include Synopse.inc that define this.

mORMot Open Source

#1 2016-09-08 06:28:35

UTF8 encoded strings in FPC and Lazarus

#2 2016-09-08 11:54:53

Re: UTF8 encoded strings in FPC and Lazarus

#3 2016-09-08 18:20:41

Re: UTF8 encoded strings in FPC and Lazarus

#4 2020-08-10 11:57:08

Re: UTF8 encoded strings in FPC and Lazarus

#5 2020-08-10 13:02:09

Re: UTF8 encoded strings in FPC and Lazarus

#6 2020-08-10 13:28:44

Re: UTF8 encoded strings in FPC and Lazarus

#7 2020-08-10 15:03:14

Re: UTF8 encoded strings in FPC and Lazarus

#8 2020-08-12 20:53:01

Re: UTF8 encoded strings in FPC and Lazarus

#9 2020-08-12 21:05:50

Re: UTF8 encoded strings in FPC and Lazarus

#10 2020-08-12 21:09:15

Re: UTF8 encoded strings in FPC and Lazarus

#11 2020-08-12 21:13:03

Re: UTF8 encoded strings in FPC and Lazarus

Board footer