SynCommons.UTF8ToString() does not handle surrogate pair properly

akirabbq · 2013-02-04 04:14:57

For example:
var
c : ansistring;

....

setlength(c, 4);
c[1] := #$F0;
c[2] := #$A8;
c[3] := #$B3;
c[4] := #$92;

System.UTF8ToString(c) and SynCommons.UTF8ToString(c) give different results. The specific character on unicode.org:

http://www.unicode.org/cgi-bin/GetUniha … 0%A8%B3%92

System's UTF8Tostring results D863 DCD2 which is correct, but SynCommons.UTF8ToString() gives 85C2 003F.

Using CharLength to test the above character's length:
LStr := UTF8ToString(c);
CharLength(LStr, 1)

System: 4 (correct, 4 bytes in total for a surrogate pair character)
SynCommons: 2

I have also tried other surrogate unicode pairs and have similar results.

Thanks.

Last edited by akirabbq (2013-02-04 04:17:41)

ab · 2013-02-04 07:53:18

I suspect you use Delphi 2009 or up.

Writing

var
  c : ansistring;

is incorrect: it will force the compiler to generate an hidden AnsiString -> RawUTF8 converstion at:

LStr := UTF8ToString(c);

I'm quite sure there is a "warning" emitted at compilation.

First of all, we should write, as expected:

var c: RawUTF8;

System UTF8ToString() is more lazy, and expects a RawByteString parameter, I suspect, so do not make the conversion.
When working with Delphi 2009+ version of the compiler, it is mandatory to get rid of all "Warning" about implicit string type conversion.

BUT I'm it is not enough here.
System.UTF8ToString() calls the MultiByteToWideChar() API - which is slow, but handle surrogate pairs as expected.
Our optimized version does not support surrogate, in fact. See how UTF8ToWideChar() is implemented: it convert the UTF-8 to a WideChar...

I just checked SQlite3 source code:

...
**  *  This routine never allows a UTF16 surrogate value to be encoded.
**     If a multi-byte character attempts to encode a value between
**     0xd800 and 0xe000 then it is rendered as 0xfffd. 
#define READ_UTF8(zIn, zTerm, c)
...

So SQLite3 does not support surrogate either.
I guess perhaps it affects only searching - surrogates are ignored - and not storage itself, if we work with the UTF-8 API for reading and writing (what mORMot does).
The SQLite3 limitation is perhaps not too problematic.

I suspect it won't be worth it to let SynCommons support surrogate, without a big performance penalty for 99% of users.
We may add a conditional define, and call MultiByteToWideChar() API instead of our optimized versions, or explicitely handle such surrogates... I've created a ticket for this - I may be able to handle surrogates without too much problems...
But it won't work with SQlite3 either! or we would have to switch to UTF-16 APIs of SQLite3.

I think you have discovered a design limitation of both SQlite3 and SynCommons!

ab · 2013-02-05 16:18:37

That's it.

In mORmot, UTF-8 process will now handle UTF-16 surrogates, as expected.
I've used some nice tricks, similar to http://floodyberry.wordpress.com/2007/0 … ion-tricks
By the way, some potential other issues have been fixed, and some regression tests added - about http://www.unicode.org/cgi-bin/GetUniha … 0%A8%B3%92 glyph for instance.

UnicodeCharToUTF8() is renamed WideCharToUTF8() and a new UTF16CharToUTF8() function has been introduced.
See http://synopse.info/fossil/info/b18d47257d

For Sqlite3, I do not expect any problem about storage, just about searching in some cases.

mORMot Open Source

#1 2013-02-04 04:14:57

SynCommons.UTF8ToString() does not handle surrogate pair properly

#2 2013-02-04 07:53:18

Re: SynCommons.UTF8ToString() does not handle surrogate pair properly

#3 2013-02-05 16:18:37

Re: SynCommons.UTF8ToString() does not handle surrogate pair properly

Board footer