#1 2015-01-10 00:26:23

louis_riviera
Member
Registered: 2013-09-23
Posts: 61

UTF8 bug

Hi!

Seems a bug in SynCommons.pas ?

program Project3;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils, SynCommons;

var
  Str: RawUTF8;
  SubStr: RawUTF8;
  I: Integer;
begin
  Str := 'ийскf';
  SubStr := 'к';
  I := SynCommons.PosEx(SubStr, Str); // I := 7 ??? It should be 3
end.

EDIT:

Ah! It seems I need to div SizeOf(Char); smile

Last edited by louis_riviera (2015-01-10 00:37:04)

Offline

#2 2015-01-10 07:17:06

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: UTF8 bug

No this is not a bug.
It returns the position in utf8 bytes, so it is 7 as expected.

Offline

#3 2015-01-10 07:34:29

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: UTF8 bug

louis_riviera wrote:

So how do you get the real position with Ansi POS?

This is the "real position", in bytes.
In Unicode, there is no AnsiPos() "real position", since you may use e.g. diacritics.

A confusion between encoding bytes and glyph is a current misunderstanding.

In UTF-8, glyph=char works for English characters.
In UTF-16, glyph=char works for most characters, unless you use diacritics, or you mix languages (e.g. Hebrew and English).
In normalized UTF-32 content, you may have glyph=char.

Why do you need this glyph position?
From computer point of view, byte position is enough for any process.
If you need glyph position, you would need something more complex, like UniScribe under Windows.
There is a logical order (i.e. the byte order, in either UTF-8 or UTF-16), and a view order.
Such a confusion about glyphs and encodings is BTW why FireMonkey is not able to work as expected outside the latin text range.

Offline

#4 2015-01-10 07:35:32

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: UTF8 bug

louis_riviera wrote:

Ah! It seems I need to div SizeOf(Char); smile

NO! this is not correct!

Please document yourself about Unicode and encoding, before breaking your code.

Offline

#5 2015-01-10 07:37:50

louis_riviera
Member
Registered: 2013-09-23
Posts: 61

Re: UTF8 bug

But how else do you find true position of some chinese substring in a large text? smile

Offline

#6 2015-01-10 07:39:56

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: UTF8 bug

You just use the byte index!

There is no such "true" position in a computer encoded text.
For a single language, you may use UTF-32 normalized text.
The "naive" UTF-16 order, as you get with Pos() over UnicodeString, called "AnsiPos" in older versions of Delphi, is very misleading, and as good (or wrong) as the byte index for UTF-8.
But if you expect the "order" as displayed, you need to take a look at http://msdn.microsoft.com/en-us/library/dd374091

Ensure you read http://www.joelonsoftware.com/articles/Unicode.html
and http://stackoverflow.com/a/222424/458259

Offline

#7 2015-01-10 08:26:20

louis_riviera
Member
Registered: 2013-09-23
Posts: 61

Re: UTF8 bug

Very interesting and confusing. Thanks! smile

EDIT:

Why do I get these warnings?

var
  Str: RawUTF8;
  A: RawUTF8;

function ExtractBetween(const Value, A, B: RawUTF8): RawUTF8;
var
  aPos, bPos: Integer;
begin
  result := '';
  aPos := SynCommons.Pos(A, Value);
  if aPos > 0 then
  begin
    aPos := aPos + Length(A);
    bPos := SynCommons.PosEx(B, Value, aPos);
    if bPos > 0 then
    begin
      result := Copy(Value, aPos, bPos - aPos);
    end;
  end;
end;

begin
  A := '中'; //[dcc32 Warning] Project3.dpr(32): W1062 Narrowing given wide string constant lost information
  Str := '中國哲學書電子化計劃';
  A := ExtractBetween(Str, '中國', '計劃');
  if A = '哲學書電子化' then
  WriteLn('Parsed Correctly.');
  ReadLn;
end.

Last edited by louis_riviera (2015-01-10 09:40:02)

Offline

#8 2015-01-10 12:36:57

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: UTF8 bug

Is your source code file encoded as UTF-8?

Offline

#9 2015-01-10 12:49:17

louis_riviera
Member
Registered: 2013-09-23
Posts: 61

Re: UTF8 bug

Ah that was it smile

Offline

Board footer

Powered by FluxBB