#1 2024-10-31 09:41:25

Gigo
Member
From: Split, Croatia
Registered: 2012-01-27
Posts: 16

Regression test fails on Utf8CompareIOS

Regression tests return errors in

test.core.base.pas

procedure TTestCoreBase._UTF8;
...
    W := WinAnsiString(RandomString(len));
    U := WinAnsiToUtf8(W);
...
    Up := mormot.core.unicode.UpperCase(U);
...
    CheckEqual(Utf8CompareIOS(pointer(U), pointer(Up)), 0);      // fails here
...
end

Running on Windows 10, Croatian Locale (ANSI Code Page 1250, OEM Code Page 852), both 32bit and 64bit using mORMot2 commit 2.3.8840

When comparing strings CompareStringW() considers diacritic symbols (šđčćž ŠĐČĆŽ) as well as digraphs (nj NJ or lj LJ)
On the other hand, we use mormot.core.unicode.UpperCase() which will uppercase only invariant chars <#127
We need to force CompareStringW() to use LOCALE_INVARIANT, otherwise it will return <> 0 for "nJ" = "NJ" on random-generated text.

On Linux, everything runs fine with current source.


My suggestion would be :

mormot.core.test.pas

class function TSynTestCase.RandomString(CharCount: integer): WinAnsiString;
-   PByteArray(result)[i] := 32 + R[i] and 127;                                   // can get over #127
+   PByteArray(result)[i] :=  $20 + R[i] mod 95;

mormot.core.test.pas

class function TSynTestCase.RandomAnsi7(CharCount: integer): RawByteString;
-    PByteArray(result)[i] := 32 + R[i] mod 94;
+    PByteArray(result)[i] := 32 + R[i] mod 95;                                   // tilde #$7E shoud be included (not related to errors from test)

mormot.core.os.pas

function Unicode_CompareString(PW1, PW2: PWideChar; L1, L2: PtrInt;
-  result := CompareStringW(LOCALE_USER_DEFAULT, _CASEFLAG[IgnoreCase], PW1, L1, PW2, L2);
+  result := CompareStringW(LOCALE_INVARIANT, _CASEFLAG[IgnoreCase], PW1, L1, PW2, L2);

Offline

#2 2024-10-31 16:02:04

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,684
Website

Re: Regression test fails on Utf8CompareIOS

Thanks a lot for the investigation.

I am okay with the LOCALE_INVARIANT flag usage, and RandomAnsi7.

Even if I still have a doubt about the initial demand on this UtF8CompareIOS() function.
It was meant to deal with Chinese characters and sorting, and I am not sure if LOCALE_INVARIANT would not break the search...
https://learn.microsoft.com/en-us/windo … -invariant


I am more concerned about RandomString(): this function should work without any tweak, because it returns a WinAnsiString content, which can get over #127 as expected.
The tests is using this WinAnsi chars to validate the WinAnsi case conversion of the framework, i.e. UpperCaseU().

Offline

#3 2024-11-01 22:17:40

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,684
Website

Re: Regression test fails on Utf8CompareIOS

Please try with the latest trunk,
especially https://github.com/synopse/mORMot2/commit/bc047ba6

Using LOCALE_INVARIANT would break existing code relying on the current user locale.

Offline

#4 2024-11-05 13:13:39

Gigo
Member
From: Split, Croatia
Registered: 2012-01-27
Posts: 16

Re: Regression test fails on Utf8CompareIOS

Regression tests now passes without failed assertions.

ab wrote:

Even if I still have a doubt about the initial demand on this UtF8CompareIOS() function.
It was meant to deal with Chinese characters and sorting, and I am not sure if LOCALE_INVARIANT would not break the search...

UtF8CompareIOS() should be used as rare as possible (i.e. for sorting).

https://learn.microsoft.com/en-us/windo … plications

On Windows it will use CompareStringW() which will fail on some computer generated text (random, base64 encoded, etc.) because of digraphs in some languages, but will be fine for text from natural language conversation.
For example for UtF8CompareIOS() "Anja" = "ANJA","ANJa" = "ANJA" , but "AnJa" <> "ANJA" in my locale.
It's a mess.

ab wrote:

I am more concerned about RandomString(): this function should work without any tweak, because it returns a WinAnsiString content, which can get over #127 as expected.
The tests is using this WinAnsi chars to validate the WinAnsi case conversion of the framework, i.e. UpperCaseU().

Actually it works as expected (returns chars in range $20 - $9F).
My bad, in _UTF() test I saw UpperCase()/LowerCase() instead of UpperCaseU()/LowerCaseU() ad figured (wrongly) that chars shoud be in 7-bit range ($20 - $7F).

Cheers

Offline

Board footer

Powered by FluxBB