You are not logged in.
Hi!
Seems a bug in SynCommons.pas ?
program Project3;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils, SynCommons;
var
Str: RawUTF8;
SubStr: RawUTF8;
I: Integer;
begin
Str := 'ийскf';
SubStr := 'к';
I := SynCommons.PosEx(SubStr, Str); // I := 7 ??? It should be 3
end.
EDIT:
Ah! It seems I need to div SizeOf(Char);
Last edited by louis_riviera (2015-01-10 00:37:04)
Offline
So how do you get the real position with Ansi POS?
This is the "real position", in bytes.
In Unicode, there is no AnsiPos() "real position", since you may use e.g. diacritics.
A confusion between encoding bytes and glyph is a current misunderstanding.
In UTF-8, glyph=char works for English characters.
In UTF-16, glyph=char works for most characters, unless you use diacritics, or you mix languages (e.g. Hebrew and English).
In normalized UTF-32 content, you may have glyph=char.
Why do you need this glyph position?
From computer point of view, byte position is enough for any process.
If you need glyph position, you would need something more complex, like UniScribe under Windows.
There is a logical order (i.e. the byte order, in either UTF-8 or UTF-16), and a view order.
Such a confusion about glyphs and encodings is BTW why FireMonkey is not able to work as expected outside the latin text range.
Offline
But how else do you find true position of some chinese substring in a large text?
Offline
You just use the byte index!
There is no such "true" position in a computer encoded text.
For a single language, you may use UTF-32 normalized text.
The "naive" UTF-16 order, as you get with Pos() over UnicodeString, called "AnsiPos" in older versions of Delphi, is very misleading, and as good (or wrong) as the byte index for UTF-8.
But if you expect the "order" as displayed, you need to take a look at http://msdn.microsoft.com/en-us/library/dd374091
Ensure you read http://www.joelonsoftware.com/articles/Unicode.html
and http://stackoverflow.com/a/222424/458259
Offline
Very interesting and confusing. Thanks!
EDIT:
Why do I get these warnings?
var
Str: RawUTF8;
A: RawUTF8;
function ExtractBetween(const Value, A, B: RawUTF8): RawUTF8;
var
aPos, bPos: Integer;
begin
result := '';
aPos := SynCommons.Pos(A, Value);
if aPos > 0 then
begin
aPos := aPos + Length(A);
bPos := SynCommons.PosEx(B, Value, aPos);
if bPos > 0 then
begin
result := Copy(Value, aPos, bPos - aPos);
end;
end;
end;
begin
A := '中'; //[dcc32 Warning] Project3.dpr(32): W1062 Narrowing given wide string constant lost information
Str := '中國哲學書電子化計劃';
A := ExtractBetween(Str, '中國', '計劃');
if A = '哲學書電子化' then
WriteLn('Parsed Correctly.');
ReadLn;
end.
Last edited by louis_riviera (2015-01-10 09:40:02)
Offline
Ah that was it
Offline