You are not logged in.
For example:
var
c : ansistring;
....
setlength(c, 4);
c[1] := #$F0;
c[2] := #$A8;
c[3] := #$B3;
c[4] := #$92;
System.UTF8ToString(c) and SynCommons.UTF8ToString(c) give different results. The specific character on unicode.org:
http://www.unicode.org/cgi-bin/GetUniha … 0%A8%B3%92
System's UTF8Tostring results D863 DCD2 which is correct, but SynCommons.UTF8ToString() gives 85C2 003F.
Using CharLength to test the above character's length:
LStr := UTF8ToString(c);
CharLength(LStr, 1)
System: 4 (correct, 4 bytes in total for a surrogate pair character)
SynCommons: 2
I have also tried other surrogate unicode pairs and have similar results.
Thanks.
Last edited by akirabbq (2013-02-04 04:17:41)
Offline
I suspect you use Delphi 2009 or up.
Writing
var
c : ansistring;
is incorrect: it will force the compiler to generate an hidden AnsiString -> RawUTF8 converstion at:
LStr := UTF8ToString(c);
I'm quite sure there is a "warning" emitted at compilation.
First of all, we should write, as expected:
var c: RawUTF8;
System UTF8ToString() is more lazy, and expects a RawByteString parameter, I suspect, so do not make the conversion.
When working with Delphi 2009+ version of the compiler, it is mandatory to get rid of all "Warning" about implicit string type conversion.
BUT I'm it is not enough here.
System.UTF8ToString() calls the MultiByteToWideChar() API - which is slow, but handle surrogate pairs as expected.
Our optimized version does not support surrogate, in fact. See how UTF8ToWideChar() is implemented: it convert the UTF-8 to a WideChar...
I just checked SQlite3 source code:
...
** * This routine never allows a UTF16 surrogate value to be encoded.
** If a multi-byte character attempts to encode a value between
** 0xd800 and 0xe000 then it is rendered as 0xfffd.
#define READ_UTF8(zIn, zTerm, c)
...
So SQLite3 does not support surrogate either.
I guess perhaps it affects only searching - surrogates are ignored - and not storage itself, if we work with the UTF-8 API for reading and writing (what mORMot does).
The SQLite3 limitation is perhaps not too problematic.
I suspect it won't be worth it to let SynCommons support surrogate, without a big performance penalty for 99% of users.
We may add a conditional define, and call MultiByteToWideChar() API instead of our optimized versions, or explicitely handle such surrogates... I've created a ticket for this - I may be able to handle surrogates without too much problems...
But it won't work with SQlite3 either! or we would have to switch to UTF-16 APIs of SQLite3.
I think you have discovered a design limitation of both SQlite3 and SynCommons!
Offline
That's it.
In mORmot, UTF-8 process will now handle UTF-16 surrogates, as expected.
I've used some nice tricks, similar to http://floodyberry.wordpress.com/2007/0 … ion-tricks
By the way, some potential other issues have been fixed, and some regression tests added - about http://www.unicode.org/cgi-bin/GetUniha … 0%A8%B3%92 glyph for instance.
UnicodeCharToUTF8() is renamed WideCharToUTF8() and a new UTF16CharToUTF8() function has been introduced.
See http://synopse.info/fossil/info/b18d47257d
For Sqlite3, I do not expect any problem about storage, just about searching in some cases.
Offline