#1 2016-06-10 19:41:23

hnb
Member
Registered: 2015-06-15
Posts: 291

Improper SynCrossPlatformJSON detail for \u escape sequences

Hi,

during my discussion with Benjamin Rosseaux  (aka BeRo) he raised important thing about "\u" escape in JSON:

The most JSON parsers (incl. SynCrossPlatformJSON, as it seems) does it wrong, these have often a incorrect handling of \uXXXX escape sequences: They are pushing even at a surrogate pair of two UTF16 codeunits these parsed UTF16 codeunits simply directly as raw UTF8 sequences (i.e. without first converting this surrogate two UTF16 codeunits to a full unicode codepoint and then from that to the destination encoding, for example UTF8), so that the result is often only CESU8 (or Java's Modified UTF-8) but not valid UTF8 in these cases, see https://en.wikipedia.org/wiki/CESU-8 .

anyway our core GetJSONField in SynCommons.pas for mORMot looks proper. Is some special reason why the SynCrossPlatformJSON implements only CESU8/"Java's Modified UTF-8" escape \u sequence?


best regards,
Maciej Izak

Offline

#2 2016-06-11 08:54:42

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: Improper SynCrossPlatformJSON detail for \u escape sequences

I made a code review of GetJsonField.
I doubt very much anyone may encode UTF-16 surrogates as \u####\u#### pairs...
Such text should be supplied directly as UTF-8 content IMHO.
But I've ensured GetJSONField() would handle UTF-16 surrogate pairs incoming as \u####\u#### escapes.
Included basic regression test from https://en.wikipedia.org/wiki/CESU-8
See http://synopse.info/fossil/info/f7705237a4

In SynCrossPlatormJSON, there is only a limited support of Surrogates, by now...
But AFAIR FireMonkey handles very poorly all this surrogate stuff, since it expects one WideChar = one glyph, which is plain wrong.
This is why I didn't go too much deeper in surrogate support for SynCrossPlatformJson.

Offline

Board footer

Powered by FluxBB