Issue about CP_54936 converting

zen010101 · 2025-01-27 14:54:46

CP_54936 is an extension of CP_936, which is the GB18030 character set and encoding standard, it is a mandatory national standard in China.

In the code below, if you use the TSynAnsiConvert object to convert it, you will not get the correct result, but if you use the FPC SetCodePage method, there is no problem.

https://gist.github.com/zen010101/091b6 … p54936-pas

Suggest adding support for this Code Page, just like supporting CP_936.

More information about CP_54936: https://en.wikipedia.org/wiki/GB_18030

ab · 2025-01-27 16:18:03

There was indeed a problem with the GB18030 code page name, which was not supplied correctly to the ICU library.
Should be fixed now with
https://github.com/synopse/mORMot2/commit/56313670f

Thanks a lot for the detailed feedback.

zen010101 · 2025-01-27 17:12:44

I forgot to provide some information earlier: the issue I encountered is on Windows, and upon checking, it seems that the commit only involves POSIX systems and not Windows.

ab · 2025-01-27 19:12:35

On Windows, this seems to be a WinAPI limitation.
The API returns INVALID_PARAMETER for codepage = 54936.

Even SetCodePage() with convert=true returns ''.

zen010101 · 2025-01-28 00:32:58

ab wrote:

On Windows, this seems to be a WinAPI limitation.
The API returns INVALID_PARAMETER for codepage = 54936.

Which API do you use? If it is WideCharToMultiByte, I checked Microsoft's instructions, there is the following description:

Note For the code page 65001 (UTF-8) or the code page 54936 (GB18030, Windows Vista and later), dwFlags must be set to either 0 or WC_ERR_INVALID_CHARS. Otherwise, the function fails with ERROR_INVALID_FLAGS.

If the above instructions are adjusted, if you still fail to succeed, I guess it may be the problem of not installing a language package.

Following the guidelines below to install the simplified Chinese language package:

https://support.microsoft.com/en-us/win … fc697cfca8

ab wrote:

Even SetCodePage() with convert=true returns ''.

FPC doesn't seem to call WinAPI, and on my Windows, I get the correct result with SetCodePage. I traced the FPC code a bit, and it seems that the final conversion code is this:

procedure Unicode2AnsiMove(source:punicodechar;var dest:RawByteString;cp : TSystemCodePage;len:SizeInt);

It is located in the "...\fpcsrc\rtl\objpas\fpwidestring.pp" unit file.

Last edited by zen010101 (2025-01-28 00:35:15)

ab · 2025-01-28 09:15:25

There is no emedded charset support for GB18030 AFAICT.
FPC does eventually call the Windows API.

I have modified the API calls:
https://github.com/synopse/mORMot2/commit/b7eec74aa

And now my tests do pass on Windows systems with the proper Chinese support:
https://github.com/synopse/mORMot2/commit/3a049a16c

zen010101 · 2025-01-28 11:25:16

Now it passed my test on both Windows and Linux.

Thank you again. :-)

ab · 2025-01-28 17:59:50

I have added your tests to the main regressions (with some fix + completion).
https://github.com/synopse/mORMot2/commit/b7548a86b
The more test, the better.

Thanks to you for your feedback.
I don't know anything about Chinese and its encoding!

danielkuettner · 2025-01-28 18:16:18

Just a question from today when working under linux (system codepage utf8), unit.pas is set to utf8 using fpc 3.2.2:

When I make a function Example(const s: RawJSON/RawUtf8/UTF8String) call with:

Example('äöü') -> it's not working any more. I have to call
Example(StringToUtf8('äöü')) to get the right string.

Is this a new behavior from this update. I can't remember to have such issues in past.

ab · 2025-01-28 18:42:09

@danielkuettner
(perhaps worth a new thread I don't think this is a regression about it)
Are you sure it is FPC 3.2.2?
Are you sure you did not change the file encoding? (look if Delphi did not create a BOM)

danielkuettner · 2025-01-28 18:51:36

Sorry (3.2.0), Free Pascal Compiler version 3.2.0-r45643 [2021/04/13] for x86_64
../source/VendorInterface/ImportVendor/Luxottica/uLuxottica.pas: UTF-8 Unicode (with BOM) text

Just a simple writeln('ü') output a ?.
But a writeln(StringTo('ü')) outputs correct ü.

Last edited by danielkuettner (2025-01-28 18:53:22)

ab · 2025-01-28 18:53:44

AFAIR the BOM is not a good idea with FPC.
It is a Windows/Delphi specific weird behavior.

Ensure you have

  {$CODEPAGE UTF8}

in your unit, e.g. by adding

interface

{$I ..\mormot.defines.inc}

uses ..

danielkuettner · 2025-01-28 19:01:43

all my units have a options.inc include with your
{$I mormot.defines.inc}

danielkuettner · 2025-01-28 19:07:02

But it is not a regression, you are right.

With mOMRmot2 054d2568 and Free Pascal Compiler version 3.2.2 [2023/11/14] for x86_64 under FreeBSD it has same behavior.

ab · 2025-01-29 09:17:10

Check the actual file encoding.

danielkuettner · 2025-01-29 09:56:43

I don't want to waste your time. It's a fpc issue I don't understand.

locale:
LANG=C.UTF-8
LC_CTYPE="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_TIME="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=

fpc:
compiled with -FcUTF-8
mormot.defines.inc inlcuded

file SOneSrv2.dpr:
Unicode text, UTF-8 (with BOM) text (also tested without BOM)

in .dpr just a writeln('ü') -> output: '?'

Thanks you,
Daniel

zen010101 · 2025-01-29 11:25:56

It's the FPC "feature" :

if the source code saved as UTF8 , e.g. using {$codepage utf8} or saving in utf8 with BOM, the FPC will storage the string constants that contain non-ASCII characters as UTF-16 in the executable.

So, in your FreeBSD, writeln('ü') will output a UnicodeString actually but the console output codepage is UTF-8, so you got a '?' as the result.

But if you don't use UTF8 as source file encoding format, the FPC using CP_ACP ， e.g. in linux it is UTF-8, to storage string constants.

danielkuettner · 2025-01-29 13:33:23

@zen010101
Thanks.
I've also read about it but I wasn't able to belief or understand why it were done this way. When I set a codepage or save a file in utf8, then I expect utf8 and not utf16.

zen010101 · 2025-01-29 17:30:51

ab wrote:

The more test, the better.

I commited a PR to add BIG5 charset in Unicode_CodePageName function.

I tested the new code in linux with Python 3.12 cjkencoding files and all passed.

Here is my test code: https://gist.github.com/zen010101/4d667 … 970eb20495

It is very strange that on Windows, some test cases failed because certain language packs were not installed on my machine, but one test case FPC passed, while mORMot failed, code page 52936.

I wonder why this result occurs if mORMot and FPC both call the same Windows API ？

ab · 2025-01-29 17:43:38

Weird.
From what I can see, FPC is calling the Windows API for its conversion.
SetCodePage() calls fpc_AnsiStr_To_AnsiStr() calls widestringmanager.Unicode2AnsiMoveProc() calls Win32Unicode2AnsiMove() calls WideCharToMultiByte().

The only difference is that WideCharToMultiByte() is called twice: once with dest = nil to get the length, then a second time to make the actual conversion.
What does GetLastError does with the mORMot way?

function Unicode_WideToAnsi(W: PWideChar; A: PAnsiChar; LW, LA, CodePage: PtrInt): integer;
...
  result := WideCharToMultiByte(CodePage, 0, W, LW, A, LA, defchar, nil);
  if result = 0 then writeln(GetLastError);

zen010101 · 2025-01-30 02:42:37

GetLastError returns 87: INVALID_PARAMETER

I compared the implementation code of FPC and mORMot with the WideCharToMultiByte API document, and found that:

The case statement in FullCodePage uses the same values as the API documentation, but the documentation states that the dwFlags parameter needs to be set to 0 when the codepage is equal to the specified value. However, mORMot and FullCodePage return value actually set lpDefaultChar to nil, which is optional according to the API documentation, and FPC has always set this parameter to nil.

In the end, I forced the lpDefaultChar parameter to nil, and the error disappeared.

ab · 2025-01-30 08:31:22

Now I understand: there was some missing CodePage in FullCodePage().
52936 was not part of the WideCharToMultiByte() API documentation, so we could guess this documentation is not fully accurate, and the supplied list should better not be trusted.

I have made it wide, wild and broad:
https://github.com/synopse/mORMot2/commit/9eeceddce

zen010101 · 2025-01-30 09:37:17

OK, It works now. :-)

ab · 2025-01-31 16:39:52

A lot of fixes and refactoring about the cross-platform Unicode process today.

And some new regression tests:
https://github.com/synopse/mORMot2/commit/59e01274e
I still don't understand why the python reference material for code pages 932, 949, 951, 20932, 50222, 51949 do fail (on both Linux and Windows).
Some wrong codepage equivalency?

Any help is welcome.

zen010101 · 2025-02-01 12:57:52

I did some experiments, and the conclusion is as follows:
None of these three approaches seems to fully cover the test cases for the CJK encoding set of Python 3.12.

1. uconv (ICU lib) on debian

Match found: big5-utf8.txt with encoding ibm-1373
Match found: cp949-utf8.txt with encoding ibm-1363
Match found: euc_jp-utf8.txt with encoding ibm-33722
Match found: gb18030-utf8.txt with encoding ibm-1392
Match found: gb2312-utf8.txt with encoding ibm-1392
Match found: gbk-utf8.txt with encoding ibm-1392
Match found: iso2022_jp-utf8.txt with encoding ISO-2022-JP
Match found: iso2022_kr-utf8.txt with encoding ISO-2022-KR
Match found: shift_jis-utf8.txt with encoding ibm-942

Or you can use another set of aliases:

Match found: big5-utf8.txt with encoding ibm-1373_P100-2002
Match found: cp949-utf8.txt with encoding ibm-1363_P11B-1998
Match found: euc_jp-utf8.txt with encoding ibm-33722_P12A_P12A-2009_U2
Match found: gb18030-utf8.txt with encoding gb18030
Match found: gb2312-utf8.txt with encoding gb18030
Match found: gbk-utf8.txt with encoding gb18030
Match found: iso2022_jp-utf8.txt with encoding ISO_2022,locale=ja,version=0
Match found: iso2022_kr-utf8.txt with encoding ISO_2022, locale=ko, version=0
Match found: shift_jis-utf8.txt with encoding ibm-942_P12A-1999

The interesting thing about the HZ-GB2312 encoding is:

pi@NanoPi103:~/cjkencoding_test$ cat hz-HZ.out
~}This sentence is in ASCII.
The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.
pi@NanoPi103:~/cjkencoding_test$ cat hz.txt
This sentence is in ASCII.
The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.

Although uconv claims to support this encoding, compared to hz.txt, there are two additional characters at the beginning of hz.out: "~}"

2. iconv on debian

Match found: big5-utf8.txt with encoding BIG-5
Match found: big5hkscs-utf8.txt with encoding BIG5-HKSCS
Match found: cp949-utf8.txt with encoding CP949
Match found: euc_jisx0213-utf8.txt with encoding EUC-JISX0213
Match found: euc_jp-utf8.txt with encoding CSEUCPKDFMTJAPANESE
Match found: gb18030-utf8.txt with encoding GB18030
Match found: gb2312-utf8.txt with encoding CN-GB
Match found: gbk-utf8.txt with encoding CP936
Match found: iso2022_jp-utf8.txt with encoding CSISO2022JP
Match found: iso2022_kr-utf8.txt with encoding CSISO2022KR
Match found: johab-utf8.txt with encoding CP1361
Match found: shift_jis-utf8.txt with encoding CP932

3. Windows API

If we use the official list provided by Microsoft, the results are as follows:

hz.txt 52936
hz.txt 54936
gb18030.txt 52936
gb18030.txt 54936
big5.txt 932
big5.txt 936
big5.txt 950
big5.txt 10001
big5.txt 10002
big5.txt 20000
big5.txt 20001
big5.txt 20002
big5.txt 20003
big5.txt 20005
big5.txt 20932
big5.txt 50220
big5.txt 50221
big5.txt 50222
big5.txt 50229
big5.txt 65000
big5.txt 65001
cp949.txt 949
cp949.txt 65000
cp949.txt 65001
euc_jp.txt 932
euc_jp.txt 936
euc_jp.txt 10001
euc_jp.txt 20932
euc_jp.txt 50220
euc_jp.txt 50221
euc_jp.txt 50222
euc_jp.txt 65000
euc_jp.txt 65001
gb2312.txt 936
gb2312.txt 10008
gb2312.txt 20936
gb2312.txt 50227
gb2312.txt 65000
gb2312.txt 65001
gbk.txt 936
gbk.txt 65000
gbk.txt 65001
iso2022_jp.txt 932
iso2022_jp.txt 936
iso2022_jp.txt 10001
iso2022_jp.txt 20932
iso2022_jp.txt 50220
iso2022_jp.txt 50221
iso2022_jp.txt 50222
iso2022_jp.txt 65000
iso2022_jp.txt 65001
iso2022_kr.txt 949
iso2022_kr.txt 1361
iso2022_kr.txt 10003
iso2022_kr.txt 20949
iso2022_kr.txt 50225
iso2022_kr.txt 51949
iso2022_kr.txt 65000
iso2022_kr.txt 65001

ab · 2025-02-01 17:53:12

Note that for the HZ encoding, I have already identified it, and mormot.core.os.posix.inc will delete them when they appear.

zen010101 · 2025-02-02 16:43:25

I committed a PR to fix some errors regarding the code page based on the results tested yesterday.

ab · 2025-02-03 10:08:36

This PR fails on my Debian machine.
https://gist.github.com/synopse/4ac3b59 … 9b61a46475

ab · 2025-02-03 13:57:20

I have made another huge refactoring.
Sounds better now, on both Linux and Windows.

mORMot Open Source

#1 2025-01-27 14:54:46

Issue about CP_54936 converting

#2 2025-01-27 16:18:03

Re: Issue about CP_54936 converting

#3 2025-01-27 17:12:44

Re: Issue about CP_54936 converting

#4 2025-01-27 19:12:35

Re: Issue about CP_54936 converting

#5 2025-01-28 00:32:58

Re: Issue about CP_54936 converting

#6 2025-01-28 09:15:25

Re: Issue about CP_54936 converting

#7 2025-01-28 11:25:16

Re: Issue about CP_54936 converting

#8 2025-01-28 17:59:50

Re: Issue about CP_54936 converting

#9 2025-01-28 18:16:18

Re: Issue about CP_54936 converting

#10 2025-01-28 18:42:09

Re: Issue about CP_54936 converting

#11 2025-01-28 18:51:36

Re: Issue about CP_54936 converting

#12 2025-01-28 18:53:44

Re: Issue about CP_54936 converting

#13 2025-01-28 19:01:43

Re: Issue about CP_54936 converting

#14 2025-01-28 19:07:02

Re: Issue about CP_54936 converting

#15 2025-01-29 09:17:10

Re: Issue about CP_54936 converting

#16 2025-01-29 09:56:43

Re: Issue about CP_54936 converting

#17 2025-01-29 11:25:56

Re: Issue about CP_54936 converting

#18 2025-01-29 13:33:23

Re: Issue about CP_54936 converting

#19 2025-01-29 17:30:51

Re: Issue about CP_54936 converting

#20 2025-01-29 17:43:38

Re: Issue about CP_54936 converting

#21 2025-01-30 02:42:37

Re: Issue about CP_54936 converting

#22 2025-01-30 08:31:22

Re: Issue about CP_54936 converting

#23 2025-01-30 09:37:17

Re: Issue about CP_54936 converting

#24 2025-01-31 16:39:52

Re: Issue about CP_54936 converting

#25 2025-02-01 12:57:52

Re: Issue about CP_54936 converting

#26 2025-02-01 17:53:12

Re: Issue about CP_54936 converting

#27 2025-02-02 16:43:25

Re: Issue about CP_54936 converting

#28 2025-02-03 10:08:36

Re: Issue about CP_54936 converting

#29 2025-02-03 13:57:20

Re: Issue about CP_54936 converting

Board footer