#1 2013-01-31 21:50:47

hovadur
Member
Registered: 2013-01-31
Posts: 2

Why UTF16, not UTF8?

Delphi XE2, XE3 internal default encoding is UTF16. Why UTF16 is more popular than UTF8?
Although your program uses UTF8 everywhere. Why?

Offline

#2 2013-02-01 07:19:47

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,660
Website

Re: Why UTF16, not UTF8?

It is explained in the doc.

Offline

#3 2013-02-01 08:05:06

hovadur
Member
Registered: 2013-01-31
Posts: 2

Re: Why UTF16, not UTF8?

Well, in the document "Synopse mORMot Framework SAD 1.17.pdf" in section "1.4.6.1. Unicode and UTF-8", I learned that you use UTF8 for speed.
So, if UTF8 so fast, why do not use it in delphi xe2? xe2, xe3, python3 switched to UTF16. Why UTF16 so popular?

Offline

#4 2013-02-01 08:43:50

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,660
Website

Re: Why UTF16, not UTF8?

Documentation wrote:
Unicode and UTF-8

Our mORMot Framework has 100% UNICODE compatibility, that is compilation under Delphi 2009/2010/XE/XE2/XE3. The code has been deeply rewritten and tested, in order to provide compatibility with the String=UnicodeString paradigm of these compilers.  But the code will also handle safely Unicode for older version, i.e. from Delphi 6 up to Delphi 2007.

Since our framework is natively UTF-8 (this is the better character encoding for fast text - JSON - streaming/parsing and it is natively supported by the SQLite3 engine), we had to establish a secure way our framework used strings, in order to handle all versions of Delphi (even pre-Unicode versions, especially the Delphi 7 version we like so much), and provide compatibility with the Free Pascal Compiler.

(...)

Note that RawUTF8 is the preferred string type to be used in our framework when defining textual properties in a TSQLRecord and for all internal data processing. It's only when you're reaching the User Interface layer that you may convert explicitly the RawUTF8 content into the generic VCL string type, using either the Language. UTF8ToString method (from mORMoti18n.pas unit) or the following function from SynCommons.pas:

/// convert any UTF-8 encoded String into a generic VCL Text
// - it's prefered to use TLanguageFile.UTF8ToString() in mORMoti18n.pas,
// which will handle full i18n of your application
// - it will work as is with Delphi 2009+ (direct unicode conversion)
// - under older version of Delphi (no unicode), it will use the
// current RTL codepage, as with WideString conversion (but without slow
// WideString usage)
function UTF8ToString(const Text: RawUTF8): string;

Of course, the StringToUTF8 method or function are available to send back some text to the ORM layer.
A lot of dedicated conversion functions (including to/from numerical values) are included in SynCommons.pas. Those were optimized for speed and multi-thread capabilities, and to avoid implicit conversions involving a temporary string variable.

Warning during the compilation process are not allowed, especially under Unicode version of Delphi (e.g. Delphi 2010): all string conversion from the types above are made explicitly in the framework's code, to avoid any unattended data loss.

To summarize some points about encoding:
- UnicodeString is used mainly by Delphi 2009+ to directly map the Windows API;
- Neither UTF-8 nor UTF-16 do have direct mapping between glyph and character (even UTF-32 does not, due to diacritics);
- In separation of layers we trust, so it is safe, efficient and worth it to use a dedicated string type at business logic level (with all our optimized functions and classes in SynCommons.pas);
- Speed with a JSON-based framework like ours, since JSON is mainly used with UTF-8 encoding;
- Direct use of Sqlite3 UTF-8 API;
- It enables older version of Delphi (previous to Delphi 2009) to have an Unicode-ready kernel, with 100% compatibility with Delphi 2009+ (our RawUTF8 is a fast Unicode-ready cross-Delphi type);
- It also provide compatibility with FreePascalCompiler.

Those ideas/choices are about mORMot, not about other platform design, like Delphi, python, or whatever.
UTF-8/UTF-16/UTF-32 is a trolling subject. Just remember that glyph and UTF-16 chars are not perfect, either.
http://stackoverflow.com/questions/9818 … 8-or-utf16

See also
http://blog.synopse.info/post/2011/06/0 … ON-parsing
http://blog.synopse.info/post/2012/12/2 … ke-it-fast
http://blog.synopse.info/post/2012/02/14/ORM-cache
http://blog.synopse.info/post/2010/07/0 … pplication

Offline

Board footer

Powered by FluxBB