#1 Re: SynProject » Problem with TDiff » 2022-10-15 12:20:42

Sha

Yes, result is correct, but is not shortest.

#2 SynProject » Problem with TDiff » 2022-10-15 07:04:09

Sha
Replies: 2

Hello AB,

using TDiff with files of string ['1','2','5'] and ['1','2','3','4','5']
I get valid result (2 add operations)

but with files of string ['1','2','3','4','5'] and ['1','2','5']
I get invalid result (1 modify and 2 delete operations)

#3 Re: mORMot Framework » Fast MM5 » 2020-05-02 20:13:23

Sha

Same, Delphi 7: 20% + 1

and take your attention to FastMM4-AVX
https://github.com/maximmasiutin/FastMM4-AVX

#4 Re: mORMot Framework » How to check correctness of a custom unserialization of record? » 2013-02-28 17:16:38

Sha

In my source code I didn't use the predetermined types to show possibility of switching of data presentation.

Post is changed according to SynCommons changes.

For the best translation from/to Russian you can use   http://www.online-translator.com/Default.aspx/Text

#5 Re: mORMot Framework » How to check correctness of a custom unserialization of record? » 2013-02-27 19:46:28

Sha

Sorry.

Here some serialization examples   http://guildalfa.ru/alsha/sites/default … Proj_0.zip

and here annotation (in Russian)   http://guildalfa.ru/alsha/node/28

Thanks for the link. Very interesting. I will close read it and check UTF-updates in week.

#6 Re: mORMot Framework » How to check correctness of a custom unserialization of record? » 2013-02-26 20:57:59

Sha

I decided to use the simplified check on the basis of a trick with NULCHAR as in SynCommons.

The project below contains different variants of work with JSON.

http://guildalfa.ru/alsha/sites/default … onProj.zip

#7 mORMot Framework » How to check correctness of a custom unserialization of record? » 2013-02-25 19:25:23

Sha
Replies: 8

I try to play with custom serialization of record as array of unnamed values.
If I correctly understand, for this purpose it is necessary to use RecordSaveJson.
The result, for example, is [365,-365,36.5,"2014-02-25T23:18:38","day\/3"]
For a unserialization I use RecordLoadJson. The result is expected.
The problem consists that it isn't clear how to check correctness of the received result.

#9 Re: mORMot Framework » Why TDynArrayJSONCustomWriter & Reader are the methods? » 2013-02-24 16:48:09

Sha

If two different classes have the same DynArrayTI and RecordTI,
we can use only one reader/writer pair to process these classes.
Class TI is useless.

OK, let's leave everything as is.
Anyway it isn't especially important.

#10 Re: mORMot Framework » Why TDynArrayJSONCustomWriter & Reader are the methods? » 2013-02-22 22:02:56

Sha

Framework uses DynArrayTI and RecordTI and not method data to find and replace custom reader/writer in JSONCustomParsers array.

#11 Re: mORMot Framework » Why TDynArrayJSONCustomWriter & Reader are the methods? » 2013-02-22 20:18:41

Sha

Yes, I see.
Then I will use this feature as follows.

type
  TDynArrayWriter= procedure(aUserInfo: pointer; const aWriter: TTextWriter; const aValue);
  TDynArrayReader= function(aUserInfo: pointer; p: PUTF8Char; var aValue; out aValid: Boolean): PUTF8Char;

procedure RegisterDynArrayRW(aTypeInfo: pointer; aReader: TDynArrayReader; aWriter: TDynArrayWriter; aUserInfo: pointer= nil);

implementation

procedure RegisterDynArrayRW(aTypeInfo: pointer; aReader: TDynArrayReader; aWriter: TDynArrayWriter; aUserInfo: pointer= nil);
var
  Reader, Writer: TMethod;
begin;
  Reader.Code:=@aReader; Reader.Data:=aUserInfo;
  Writer.Code:=@aWriter; Writer.Data:=aUserInfo;
  TTextWriter.RegisterCustomJSONSerializer(aTypeInfo,
    TDynArrayJSONCustomReader(Reader),
    TDynArrayJSONCustomWriter(Writer));
  end;

procedure Int64ArrayWriter(aUserInfo: pointer; const aWriter: TTextWriter; const aValue);
var
  v: int64 absolute aValue;
begin;
  aWriter.Add(v);
  end;

function Int64ArrayReader(aUserInfo: pointer; p: PUTF8Char; var aValue; out aValid: Boolean): PUTF8Char;
var
  v: int64 absolute aValue;
  wasString: boolean;
  delim: AnsiChar;
begin;
  aValid:=false;
  Result:=nil;
  if p=nil then exit;
  v:=GetInt64(GetJSONField(p,p,@wasString,@delim));
  if (p=nil) or wasString then exit;
  dec(p); p^:=delim;
  aValid:=true;
  Result:=p; // ',' or ']' for last item of array
  end;

But the problem consists that I can't register some serializers with different UserData for the same type.

#12 mORMot Framework » 2 bugs in TTextWriter.RegisterCustomJSONSerializer » 2013-02-22 18:04:22

Sha
Replies: 2

To reproduce this bug register/unregister/register custom serializer.
You can see that default serializer is used. Not the last registered.

Another bug is that length of JSONCustomParsers will increase
for each registration of previously unregistered serializer.

Here is correct code.

class procedure TTextWriter.RegisterCustomJSONSerializer(aTypeInfo: pointer;
  aReader: TDynArrayJSONCustomReader; aWriter: TDynArrayJSONCustomWriter);
var i: integer;
    DynArrayTI, RecordTI: pointer;
begin
  i := -MaxInt; //Sha: //i := -2;
  RecordTI := nil;
  DynArrayTI := nil;
  if aTypeInfo<>nil then
  case PDynArrayTypeInfo(aTypeInfo)^.kind of
  tkDynArray: begin
    DynArrayTI := aTypeInfo;
    RecordTI := TypeInfoToRecordInfo(aTypeInfo);
    i := JSONCustomParsersDynArrayIndex(aTypeInfo);
    if (i<0) and (RecordTI<>nil) then
      i := JSONCustomParsersRecordIndex(RecordTI);
    end;
  tkRecord: begin
    i := JSONCustomParsersRecordIndex(aTypeInfo);
    RecordTI := aTypeInfo;
    DynArrayTI := nil;
  end;
  end;
  if i=-MaxInt then //Sha: //if i=-2 then
    raise ESynException.Create('Invalid TTextWriter.RegisterCustomJSONSerializer call');
  if i<0 then
    if Assigned(aWriter) or Assigned(aReader) then
    if i<-1 then i := -(i+2) else //Sha: line added
    begin
      i := length(JSONCustomParsers);
      SetLength(JSONCustomParsers,i+1);
    end else
    exit;
  with JSONCustomParsers[i] do begin
    DynArrayTypeInfo := DynArrayTI;
    RecordTypeInfo := RecordTI;
    Writer := aWriter;
    Reader := aReader;
  end;
end;

//Sha
function JSONCustomParsersDynArrayIndex(aTypeInfo: pointer): integer; {$ifdef HASINLINE}inline;{$endif}
begin
  result := length(JSONCustomParsers) - 1;
  while result >= 0 do begin
    with JSONCustomParsers[result] do if DynArrayTypeInfo=aTypeInfo then begin
      if not Assigned(Reader) and not Assigned(Writer) then result := -(result+2);
      exit;
    end;
    dec(result);
  end;
end;

//Sha
function JSONCustomParsersRecordIndex(aTypeInfo: pointer): integer; {$ifdef HASINLINE}inline;{$endif}
begin
  result := length(JSONCustomParsers) - 1;
  while result >= 0 do begin
    with JSONCustomParsers[result] do if RecordTypeInfo=aTypeInfo then begin
      if not Assigned(Reader) and not Assigned(Writer) then result := -(result+2);
      exit;
    end;
    dec(result);
  end;
end;

#13 mORMot Framework » Why TDynArrayJSONCustomWriter & Reader are the methods? » 2013-02-21 20:21:14

Sha
Replies: 6

I think the use of regular procedures can save a lot of code
because we will need only one pair of them
for each array of ordinals or array of records
and not for each class used such arrays.

Of course, I can continue to write something like this.

procedure IntArrWriter(aObject: TObject; const aWriter: TTextWriter; const aValue);
var
  v: integer absolute aValue;
begin;
  aWriter.Add('%', [v], twJSONEscape);
  end;

function IntArrReader(aObject: TObject; p: PUTF8Char; var aValue; out aValid: Boolean): PUTF8Char;
var
  v: integer absolute aValue;
  delim: AnsiChar;
begin;
  aValid:=false;
  Result:=nil;
  if p=nil then exit;
  v:=GetInteger(GetJSONField(p,p,nil,@delim)); 
  if p=nil then exit;
  dec(p); p^:=delim; 
  aValid:=true;
  Result:=p; // ',' or ']' for last item of array
  end;

var
  reader: TDynArrayJSONCustomReader;
  writer: TDynArrayJSONCustomWriter;

initialization
  @reader:=@IntArrReader;
  @writer:=@IntArrWriter;
  TTextWriter.RegisterCustomJSONSerializer(TypeInfo(TIntArr), reader, writer);

#14 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-11 11:56:40

Sha

Ok. Then let's try my code.

I have tested functons from current version of ShaUnicode
using 1000 random CP and 10000 random strings.

No error. You can repeat this. All code you need is below. You need also Button and Memo on the Form.

procedure FillRandomConvTable(pTable: PConvTable);
const
  DefaultConvChar = 32;
var
  i, n: NInt;
begin;
  pTable.CodePage:=0;

  for i:=0 to 127 do begin;
    pTable.WideToAnsi[i]:=i;
    pTable.AnsiToWide[i]:=i;
    end;

  for i:=128 to $FFFF do pTable.WideToAnsi[i]:=DefaultConvChar;

  for i:=128 to 255 do begin;
    repeat;
      if i<128+64 then n:=Random($07FF -   255) +   256
                  else n:=Random($FFFF - $07FF) + $0800;
      until pTable.WideToAnsi[n]=DefaultConvChar;
    pTable.WideToAnsi[n]:=i;
    pTable.AnsiToWide[i]:=n;
    end;

  for i:=0 to 255 do begin;
    n:=pTable.AnsiToWide[i];
    if n<=127 then n:=0
    else if n<=$7FF then n:=1
    else n:=2;
    pTable.AnsiToUTF8Len[i]:=n;
    end;
  end;

function TSynTestCaseRandomString(CharCount: Integer): RawByteString;
var V: cardinal;
    P: PAnsiChar;
begin
  SetString(result,nil,CharCount);
  P := pointer(Result);
  while CharCount>0 do begin
    if CharCount>5 then begin
      V := Random(maxInt); // fast: one random compute per 5 chars
      P[0] := AnsiChar(32+V and 127); V := V shr 7;
      P[1] := AnsiChar(32+V and 127); V := V shr 7;
      P[2] := AnsiChar(32+V and 127); V := V shr 7;
      P[3] := AnsiChar(32+V and 127); V := V shr 7;
      P[4] := AnsiChar(65+V);
      Inc(P,5);
      dec(CharCount,5);
    end else begin
      P^ := AnsiChar(32+Random(224));
      inc(P);
      dec(CharCount);
    end;
  end;
end;

procedure TestCP;
var
  ST: array[0..10000] of RawByteString;
  CP, i: integer;
  Table: TConvTable;
begin
  for i := 0 to high(ST) do ST[i] := TSynTestCaseRandomString(i shr 3);

  for CP:=0 to 999 do begin;
    FillRandomConvTable(@Table);

    for i := 0 to high(ST) do
    if ShaUTF8ToAnsi(ShaAnsiToUTF8(ST[i],@Table),@Table)<>ST[i] then begin;
      Form1.Memo1.Lines.Add('ShaAnsiToUTF8 '+IntToStr(CP));
      Form1.Memo1.Lines.Add(ST[i]);
      break;
      end;

    for i := 0 to high(ST) do
    if ShaUnicodeToAnsi(ShaAnsiToUnicode(ST[i],@Table),@Table)<>ST[i] then begin;
      Form1.Memo1.Lines.Add('ShaAnsiToUnicode '+IntToStr(CP));
      Form1.Memo1.Lines.Add(ST[i]);
      break;
      end;
    end;
  end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  TestCP;
  Form1.Memo1.Lines.Add('TestCP done');
  end;

#15 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-11 09:05:11

Sha
ab wrote:

Just use my above benchmark code, and you'll find out that it it works only with (CP=1250) or (CP=1251) or (CP=1254) or (CP=1256).

So may be problem in your above benchmark code? ;-)
Or you tested old my functions?

I haven't such function

Sha := GetShaConvTable(CP);

as well as conversion tables for all these CPs.

Could you provide minimal code that I could reproduce AV?

#16 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-10 18:26:00

Sha
ab wrote:

I noticed that your routines are a bit faster than mine.
But also that it works only with (CP=1250) or (CP=1251) or (CP=1254) or (CP=1256).
I've got access violation otherwise. I suspect there are some issues in your UTF8 conversion code.

I have validated all my functions for CP 1251 using code

function RandomString(MaxCharCount: Integer): RawByteString;
var
  CharCount, CharRange: integer;
  P: PAnsiChar;
begin;
//  CharRange:=96 +  32; //ASCII + #128..159
  CharRange:=96 + 128; //ASCII + #128..255
  CharCount:=1+Random(MaxCharCount);
  SetString(Result, nil, CharCount);
  P:=pointer(Result);
  while CharCount>0 do begin;
    dec(CharCount); P[CharCount]:=AnsiChar(Random(CharRange)+32); //starting from #32
    end;
  end;

function Validate(const s: RawByteString; Status: integer=0): integer;
var
  s2: RawByteString;
  u, u2: RawUnicode;
  t, t2: RawUTF8;
  Len, Len2, Len8, TestNo: integer;
begin;
  u:='';
  t:='';
  Len8:=0;

  Len:=Length(s);
  if Len>0 then begin;
    SetLength(u,Len*2); u[Len+2]:=#0;
    MultiByteToWideChar(GetACP, 0, pointer(s), Len, pointer(u), Len);
    Len8:=WideCharToMultiByte(CP_UTF8, 0, pointer(u), Len, nil, 0, nil, nil);
    SetLength(t, Len8);
    WideCharToMultiByte(CP_UTF8, 0, pointer(u), Len, pointer(t), Len8, nil, nil);
    end;

  TestNo:=1;
  if TestNo and Status=0 then begin;
    u2:=ShaAnsiToUnicode(s, pConvDefault);
    if u2=u then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    t2:=ShaUnicodeToUTF8(u);
    if t2=t then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    s2:=ShaUnicodeToAnsi(u, pConvDefault);
    if s2=s then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    u2:=ShaUTF8ToUnicode(t);
    if u2=u then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    t2:=ShaAnsiToUTF8(s, pConvDefault);
    if t2=t then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    s2:=ShaUTF8ToAnsi(t, pConvDefault);
    if s2=s then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    Len2:=ShaUnicodeToUTF8(pointer(u), Len);
    if Len2=Len8 then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    Len2:=ShaAnsiToUTF8(pointer(s), Len, pConvDefault);
    if Len2=Len8 then Status:=Status or TestNo;
    end;

  TestNo:=TestNo*2;
  if TestNo and Status=0 then begin;
    Len2:=ShaUTF8ToAnsi(pointer(t), Len8);
    if Len2=Len then Status:=Status or TestNo;
    end;

  while TestNo<>0 do begin;
    TestNo:=TestNo*2;
    Status:=Status or TestNo;
    end;
  Result:=Status;
  end;

procedure TForm1.bValidateClick(Sender: TObject);
var
  s: RawByteString;
  i, len, ErrorCount, Status: integer;
begin;
  ErrorCount:=0;
  i:=0;
  len:=0;
  repeat;
    s:=RandomString(len);
    Status:=Validate(s);
    if Status<>-1 then begin;
      inc(ErrorCount);
      Validate(s,Status);
      end;
    len:=32;
    if i>100000 then len:=8*1024;
    inc(i);
    until i>150000;
  Memo1.Lines.Add(Format('Validation done, %d errors',[ErrorCount]));
  end;

Also I have tested them for CP 1251 and 1252 as you do

function TSynTestCaseRandomString(CharCount: Integer): RawByteString;
var V: cardinal;
    P: PAnsiChar;
begin
  SetString(result,nil,CharCount);
  P := pointer(Result);
  while CharCount>0 do begin
    if CharCount>5 then begin
      V := Random(maxInt); // fast: one random compute per 5 chars
      P[0] := AnsiChar(32+V and 127); V := V shr 7;
      P[1] := AnsiChar(32+V and 127); V := V shr 7;
      P[2] := AnsiChar(32+V and 127); V := V shr 7;
      P[3] := AnsiChar(32+V and 127); V := V shr 7;
      P[4] := AnsiChar(65+V);
      Inc(P,5);
      dec(CharCount,5);
    end else begin
      P^ := AnsiChar(32+Random(224));
      inc(P);
      dec(CharCount);
    end;
  end;
end;

procedure TestCP;
var
  CP, i: Integer;
  ST: array[0..10000] of RawByteString;
  Sha: PConvTable;
begin
  for i := 0 to high(ST) do ST[i] := TSynTestCaseRandomString(i shr 3);

  for CP:=1251 to 1252 do begin;
    if CP=1251 then Sha:=pConvDefault else Sha:=pConvLatin;

    for i := 0 to high(ST) do
    if ShaUTF8ToAnsi(ShaAnsiToUTF8(ST[i],Sha),Sha)<>ST[i] then begin;
      Form1.Memo1.Lines.Add('ShaAnsiToUTF8'+IntToStr(CP));
      Form1.Memo1.Lines.Add(ST[i]);
      break;
      end;

    for i := 0 to high(ST) do
    if ShaUnicodeToAnsi(ShaAnsiToUnicode(ST[i],Sha),Sha)<>ST[i] then begin;
      Form1.Memo1.Lines.Add('ShaAnsiToUnicode'+IntToStr(CP));
      Form1.Memo1.Lines.Add(ST[i]);
      break;
      end;
    end;
  end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  TestCP;
  Form1.Memo1.Lines.Add('TestCP done');
  end;

No errors found in all tests.
Can you please point more details about AV for CP 1252.

#17 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-10 07:19:47

Sha

I use text of same length in both languages (Russian and English). Measure differs in 15 times.

So in real applications for long texts it is better to use win1251 coding but not utf8. It makes difficult using of JSON in my case.

#18 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-09 20:46:34

Sha

My default code page is 1251. It can be used for test with all russian and english (ASCII) texts.

I think 15 ms resolution of GetTickCount is sufficient in our case.

Changing to CurrentAnsiConvert.AnsiToUTF8()/UTF8ToAnsi() shows the same speed.

   AnsiToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha        2547   140   594
WinAnsi       0   203     0
Syn        3422   203   313
Sha        2547   140   594
WinAnsi       0   203     0
Syn        3422   203   313

   Utf8ToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha        2297   141   609
WinAnsi       0   219     0
Syn        3047   203   344
Sha        2234   156   610
WinAnsi       0   203     0
Syn        3062   203   329

#19 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-09 20:20:19

Sha

I have used russian and english texts. Very simple tests. Some of them are here. 

var
  //137 chars
  s1251: RawByteString='Лишь годные дятлы собираются в стаи, юникодом пугая мозги января. Их песни не стихнут, они не устанут. А елка как кактус беспокоит меня. ';
  s1252: RawByteString='Only woodpeckers gather in flights, they frighten with unicode the brains of January. They were not tired, their songs will not abate.   ';
  a1251, a1252, a255: RawByteString;
  u1251, u1252, u255: RawUnicode;
  t1251, t1252, t255: RawUTF8;

  Ticks: array[0..18] of cardinal;
  TicksLast: integer;
  Iterations: integer;

function NextTimer: integer;
begin;
  inc(TicksLast);
  if TicksLast<=High(Ticks) then Ticks[TicksLast]:=GetTickCount;
  Result:=Iterations;
  end;

function FirstTimer: integer;
begin;
  TicksLast:=-1;
  Result:=NextTimer;
  end;

//--------------------------------------------------------------------------------------------------
procedure TForm1.FormCreate(Sender: TObject);
var
  i: integer;
begin;
  Iterations:=10000;
  a1251:=s1251;
  a1252:=s1252;
{}
  for i:=1 to 8 do begin;
    a1251:=a1251 + a1251;
    a1252:=a1252 + a1252;
    end;
  //35209 chars
  a1251:=a1251 + s1251;
  a1252:=a1252 + s1252;
{}
  a255:=a1252; a255[1]:=#255;

  u1251:=ShaAnsiToUnicode(a1251,pConvDefault);
  u1252:=ShaAnsiToUnicode(a1252,pConvDefault);
  u255 :=ShaAnsiToUnicode(a255, pConvDefault);

  t1251:=ShaUnicodeToUTF8(u1251);
  t1252:=ShaUnicodeToUTF8(u1252);
  t255 :=ShaUnicodeToUTF8(u255);

  if (ShaUnicodeToAnsi(u1251,pConvDefault)<>a1251)
  or (ShaUTF8ToUnicode(t1251)<>u1251)
  or (ShaAnsiToUTF8(a1251,pConvDefault)<>t1251)
  or (ShaUTF8ToAnsi(t1251,pConvDefault)<>a1251)
  then ShowMessage('Error in FormCreate');
  end;

//--------------------------------------------------------------------------------------------------
procedure TForm1.ShowResult(const Title: string);
begin;
  Memo1.Lines.Add       ('');
  Memo1.Lines.Add       ('   ' + Title + ' time, ms');
  Memo1.Lines.Add       ('============================');
  Memo1.Lines.Add       ('                Charset');
  Memo1.Lines.Add       ('Functions  1251  1252  1252*');
  Memo1.Lines.Add       ('----------------------------');
  Memo1.Lines.Add(Format('Sha       %5d %5d %5d',[Ticks[01]-Ticks[00],
                                                  Ticks[02]-Ticks[01],
                                                  Ticks[03]-Ticks[02]]));
  Memo1.Lines.Add(Format('WinAnsi   %5d %5d %5d',[Ticks[04]-Ticks[03],
                                                  Ticks[05]-Ticks[04],
                                                  Ticks[06]-Ticks[05]]));
  Memo1.Lines.Add(Format('Syn       %5d %5d %5d',[Ticks[07]-Ticks[06],
                                                  Ticks[08]-Ticks[07],
                                                  Ticks[09]-Ticks[08]]));
  Memo1.Lines.Add(Format('Sha       %5d %5d %5d',[Ticks[10]-Ticks[09],
                                                  Ticks[11]-Ticks[10],
                                                  Ticks[12]-Ticks[11]]));
  Memo1.Lines.Add(Format('WinAnsi   %5d %5d %5d',[Ticks[13]-Ticks[12],
                                                  Ticks[14]-Ticks[13],
                                                  Ticks[15]-Ticks[14]]));
  Memo1.Lines.Add(Format('Syn       %5d %5d %5d',[Ticks[16]-Ticks[15],
                                                  Ticks[17]-Ticks[16],
                                                  Ticks[18]-Ticks[17]]));
  end;

//--------------------------------------------------------------------------------------------------
procedure TForm1.bAnsiToUtf8Click(Sender: TObject);
var
  t: RawUTF8;
  i: integer;
begin;
  i:=FirstTimer;
  repeat;

    repeat;
      t:=ShaAnsiToUtf8(a1251,pConvDefault);
      t:=ShaAnsiToUtf8(a1251,pConvDefault);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=ShaAnsiToUtf8(a1252,pConvDefault);
      t:=ShaAnsiToUtf8(a1252,pConvDefault);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=ShaAnsiToUtf8(a255,pConvDefault);
      t:=ShaAnsiToUtf8(a255,pConvDefault);
      dec(i); until i=0; i:=NextTimer;

    repeat;
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=WinAnsiToUtf8(a1252);
      t:=WinAnsiToUtf8(a1252);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      dec(i); until i=0; i:=NextTimer;

    repeat;
      AnsiCharToUTF8(pointer(a1251),length(a1251),t,1251);
      AnsiCharToUTF8(pointer(a1251),length(a1251),t,1251);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      AnsiCharToUTF8(pointer(a1252),length(a1252),t,1251);
      AnsiCharToUTF8(pointer(a1252),length(a1252),t,1251);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      AnsiCharToUTF8(pointer(a255),length(a255),t,1251);
      AnsiCharToUTF8(pointer(a255),length(a255),t,1251);
      dec(i); until i=0; i:=NextTimer;

    until TicksLast>9;
  ShowResult('AnsiToUtf8');
  end;

//--------------------------------------------------------------------------------------------------
procedure TForm1.bUTF8ToAnsiClick(Sender: TObject);
var
  t: AnsiString;
  i: integer;
begin;
  i:=FirstTimer;
  repeat;

    repeat;
      t:=ShaUTF8ToAnsi(t1251,pConvDefault);
      t:=ShaUTF8ToAnsi(t1251,pConvDefault);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=ShaUTF8ToAnsi(t1252,pConvDefault);
      t:=ShaUTF8ToAnsi(t1252,pConvDefault);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=ShaUTF8ToAnsi(t255,pConvDefault);
      t:=ShaUTF8ToAnsi(t255,pConvDefault);
      dec(i); until i=0; i:=NextTimer;

    repeat;
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=Utf8ToWinAnsi(t1252);
      t:=Utf8ToWinAnsi(t1252);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      dec(i); until i=0; i:=NextTimer;

    repeat;
      t:=Utf8ToString(t1251);
      t:=Utf8ToString(t1251);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=Utf8ToString(t1252);
      t:=Utf8ToString(t1252);
      dec(i); until i=0; i:=NextTimer;
    repeat;
      t:=Utf8ToString(t255);
      t:=Utf8ToString(t255);
      dec(i); until i=0; i:=NextTimer;

    until TicksLast>9;
  ShowResult('Utf8ToAnsi');
  end;

#20 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-09 18:41:51

Sha

Speed tests (2 times each test)
of new ShaUnicode functions (unbuffered version 2012-02-09)
and new framework functions at E6850

   AnsiToUnicode time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         562   563   562
WinAnsi           735
Syn         734   735   734
Sha         562   563   547
WinAnsi           734
Syn         750   735   734

   UnicodeToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         500   531   531
WinAnsi           813
Syn        1109  1109  1110
Sha         500   531   531
WinAnsi           813
Syn        1109  1109  1110

   UnicodeToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha        2266   359   485
Syn        2734   360   718
Sha        2266   359   469
Syn        2734   375   704

   AnsiToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha        2547   140   485
WinAnsi           218
Syn        3422   203   329
Sha        2546   141   484
WinAnsi           219
Syn        3422   203   328

   Utf8ToUnicode time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha        2109   453   406
Syn        2500   516   687
Sha        2110   453   406
Syn        2500   516   687

   Utf8ToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha        2297   141   609
WinAnsi           219
Syn        3047   218   329
Sha        2250   156   609
WinAnsi           203
Syn        3063   219   328

Code page 1252 is very fast, but 1251 is not.

#21 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-09 11:14:35

Sha

Yes, I see :-)
But the question is
why sometimes the framework calculates data in the buffer and then copies data to result string,
and sometimes it calculates data in the result string and then reallocates it?

#22 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-09 08:46:12

Sha

Yes, I see.
The question is why the framework uses SetString (buffered result) in one case and SetLength (unbuffered result) in another case?

#23 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-08 19:57:32

Sha

Thanks!
I will test it tomorrow.

One question.
Why TSynAnsiConvert.AnsiBufferToRawUTF8 calls Getmem, but RawUnicodeToUtf8 doesn't ?

#24 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-06 10:32:39

Sha
ab wrote:

Thanks a lot for sharing your code!

Some remarks:
- ShaAnsiToUnicode() will be correct only for 7 bits ascii - so I guess this is about Ansi7ToString() which is not often called so I did not modified it;
- I've updated RawUnicodeToUtf8(), WinAnsiBufferToUtf8(), UTF8ToWideChar(), UTF8ToWinPChar() to handle any trailing 7 bit ASCII AnsiChars, by pairs - this is a very nice trick in practice;
- I did not introduce CP 1252 specific optimization yet, since I'd like to implement a clean class-driven approach here - I've added it to the mORMot roadmap.

See http://synopse.info/fossil/info/4be9156a97


- ShaAnsiToUnicode() uses pointer to the translation table as well as all Ansi-functions from ShaUnicode.pas. So it trasforms correctly any Ansi-characters. You just need to call the function with pConvLatin or pConvDefault as second parameter.

- I think it is simple to use new unicode functions from ShaUnicode.pas changing *all* calls in SynCommons.pas:
WinAnsiToRawUnicode/StringToRawUnicode/... --> ShaAnsiToUnicode
RawUnicodeToWinAnsi/RawUnicodeToString/... --> ShaUnicodeToAnsi
WinAnsiToUTF8/AnsiCharToUTF8/... --> ShaAnsiToUTF8
RawUnicodeToUTF8/... --> ShaUnicodeToUTF8
UTF8DecodeToRawUnicode/... --> ShaUTF8ToUnicode
UTF8ToString/... --> ShaUTF8ToAnsi

- Main optimization for win1252 and user default code page is done in ShaUnicode.pas. Of course, we always may do more.

#25 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-02-04 18:45:44

Sha

Hi

New unicode functions are here
Some coments (russian) are here

Timings at E6850, table headers:
   Charset 1251 - russian text
   Charset 1252 - english (ASCII) text
   Charset 1252* - english (ASCII) text, but first char is russian
   Sha - functions from ShaUnicode.pas
   Syn - functions from SynCommons.pas
   WinAnsi - special functions from SynCommons.pas for WinAnsiString

   AnsiToUnicode time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         563   562   547
WinAnsi           766
Syn         734   734   735

   UnicodeToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         500   531   515
WinAnsi           938
Syn         828  1703   813

   UnicodeToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         563   422   547
Syn         734   703   719

   AnsiToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         594   219   594
WinAnsi           593
Syn        1469  1438  1453

   Utf8ToUnicode time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         547   594   563
Syn         671   641   672

   Utf8ToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         687   219   703
WinAnsi           594
Syn        1047  1015  1032

Results at i5-2300

   AnsiToUnicode time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         468   452   468
WinAnsi       0   718     0
Syn         686   702   702

   UnicodeToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         390   375   390
WinAnsi       0   936     0
Syn         982  1685   983

   UnicodeToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         530   406   530
Syn         702   702   702

   AnsiToUtf8 time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         562   171   562
WinAnsi       0   546     0
Syn        1404  1388  1404

   Utf8ToUnicode time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         499   530   499
Syn         609   608   609

   Utf8ToAnsi time, ms
============================
                Charset
Functions  1251  1252  1252*
----------------------------
Sha         655   218   640
WinAnsi       0   562     0
Syn        1201  1185  1186

Some bugs fixed in ShaUnicode.pas 2012-02-05

#26 Re: mORMot Framework » Fast Ansi/Unicode conversion » 2012-01-23 17:56:46

Sha

I have tested some new functions. They are faster than framework's ones even on WinAnsiStrings.
I hope I will write full set of Ansi/Unicode/UTF8 conversions in a week.

#28 mORMot Framework » Fast Ansi/Unicode conversion » 2012-01-10 21:27:52

Sha
Replies: 26

Hi, Arnaud.
Here is my module for fast conversion

//Sha 2012
unit SynConversionTables;

interface

uses
  Windows,  // GetACP
  SysUtils; // PWordArray

type
  PConversionTable= ^TConversionTable;
  TConversionTable= packed record
    WideToAnsiW: packed array[0..127] of word;
    WideToAnsiA: packed array[0..127] of byte;
    WideToAnsiCount: integer;
    CodePage: integer;                     //table code page
    pDataExt: pWordArray;
    DataCount: integer;
    AnsiToWide:  packed array[0..255] of word;
    end;

var //READ ONLY VARS
  pSynTableDefault: PConversionTable= nil; //table for default code page
  pSynTable1252:    PConversionTable= nil; //win1252 Latin table
  pSynTable1251:    PConversionTable= nil; //win1251 Cyrillic table

type
  PtrInt= {$ifdef UNICODE} NativeInt {$else} integer {$endif};

//extended version of WinAnsiTableSortedFind
function FindAnsiChar(wc: cardinal; pTable: PConversionTable): PtrInt;

//get table by index
function GetSynTable(i: integer): PConversionTable;

//test of conversion tables
function TestSynTables: boolean;

implementation

const
  Win1252Ext: packed array[0..31] of word = (
    8364,  129, 8218,  402, 8222, 8230, 8224, 8225,  710, 8240,  352, 8249,  338,  141,  381,  143,
     144, 8216, 8217, 8220, 8221, 8226, 8211, 8212,  732, 8482,  353, 8250,  339,  157,  382,  376);
  Win1251Ext: packed array[0..127] of word = (
    1026, 1027, 8218, 1107, 8222, 8230, 8224, 8225, 8364, 8240, 1033, 8249, 1034, 1036, 1035, 1039,
    1106, 8216, 8217, 8220, 8221, 8226, 8211, 8212,  152, 8482, 1113, 8250, 1114, 1116, 1115, 1119,
     160, 1038, 1118, 1032,  164, 1168,  166,  167, 1025,  169, 1028,  171,  172,  173,  174, 1031,
     176,  177, 1030, 1110, 1169,  181,  182,  183, 1105, 8470, 1108,  187, 1112, 1029, 1109, 1111,
    1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
    1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
    1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
    1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103);

var
  SynDefaultExt: packed array[0..127] of word;
  SynDefaultCP: integer= 0; //default code page
  SynTables: array of array of word;

function FindAnsiChar(wc: cardinal; pTable: PConversionTable): PtrInt;
var
  Cur, Left, Right: PtrInt;
begin;
  //fast search of main 32 letters for win1251
  if (pTable=pSynTable1251) and (cardinal(wc-1040)<32) then begin;
    Result:=wc-848;
    exit;
    end;
  Right:=pTable.WideToAnsiCount;
  Left:=0;
  while Left<Right do begin;
    Cur:=(Left + Right) shr 1;
    if wc>pTable.WideToAnsiW[Cur] then Right:=Cur else Left:=Cur + 1;
    end;
  dec(Right);
  if (Right>=0) and (wc=pTable.WideToAnsiW[Right])
  then Result:=pTable.WideToAnsiA[Right]
  else Result:=-1;
  end;

function GetSynTable(i: integer): PConversionTable;
begin;
  if (i>=0) and (i<Length(SynTables))
  then Result:=@SynTables[i,0]
  else Result:=nil;
  end;

function TestFindChars(pTable: PConversionTable): integer;
var
  i: integer;
begin;
  Result:=0;
  for i:=0 to pTable.DataCount-1 do if pTable.pDataExt[i]>255 then begin;
    dec(Result);
    if FindAnsiChar(pTable.pDataExt[i],pTable)<>i+128 then exit;
    end;
  Result:=-Result;
  end;

function TestCountChars(pTable: PConversionTable): integer;
var
  i: integer;
begin;
  Result:=0;
  for i:=$100 to $FFFF do if FindAnsiChar(i,pTable)>=0 then inc(Result);
  end;

function TestSynTables: boolean;
var
  i, FoundAll, CountAll: integer;
  pTable: PConversionTable;
begin;
  Result:=true;
  for i:=0 to Length(SynTables)-1 do begin;
    pTable:=@SynTables[i,0];
    FoundAll:=TestFindChars(pTable);
    CountAll:=TestCountChars(pTable);
    Result:=Result and (FoundAll>0) and (CountAll=FoundAll);
    end;
  end;

procedure AddConversionTable(CodePage: integer; pDataExt: PWordArray; DataCount: integer);
var
  save: array[0..127] of cardinal;
  tmp: cardinal;
  i, len, max: integer;
  pTable: PConversionTable;
begin;
  if (CodePage<=0) or (DataCount<=0) or (DataCount>128) then exit;

  len:=Length(SynTables);
  for i:=0 to len-1 do begin;
    pTable:=@SynTables[i,0];
    if pTable.CodePage=CodePage then exit;
    end;
  SetLength(SynTables,len+1);
  SetLength(SynTables[len], SizeOf(TConversionTable) div SizeOf(word));
  pTable:=@SynTables[len,0];
  pTable.CodePage:=CodePage;
  pTable.pDataExt:=pDataExt;
  pTable.DataCount:=DataCount;

  for i:=0 to 255 do pTable.AnsiToWide[i]:=i;
  len:=0;
  for i:=0 to DataCount-1 do if pDataExt[i]>255 then inc(len);
  pTable.WideToAnsiCount:=len;
  len:=0;
  max:=0;
  for i:=DataCount-1 downto 0 do begin;
    pTable.AnsiToWide[i+128]:=pDataExt[i];
    if pDataExt[i]>255 then begin;
      save[len]:=integer(pDataExt[i]) shl 8 or (i+128);
      if save[max]<save[len] then max:=len;
      inc(len);
      end;
    end;
  dec(len); // last index

  // insertion sort of save[0..len]
  tmp:=save[0]; save[0]:=save[max]; save[max]:=tmp;
  i:=1;
  while i<len do begin;
    inc(i);
    tmp:=save[i];
    if tmp>save[i-1] then begin;
      max:=i;
      repeat;
        save[max]:=save[max-1];
        dec(max);
        until tmp<=save[max-1];
      save[max]:=tmp;
      end;
    end;

  for i:=0 to len do begin;
    pTable.WideToAnsiW[i]:=save[i] shr 8;
    pTable.WideToAnsiA[i]:=byte(save[i]);
    end;
  if CodePage=1252 then pSynTable1252:=pTable;
  if CodePage=1251 then pSynTable1251:=pTable;
  if CodePage=SynDefaultCP then pSynTableDefault:=pTable;
  end;

function InitConversionTables: boolean;
var
  c: array[0..127] of byte;
  i: integer;
begin;
  SynDefaultCP:=GetACP;
  AddConversionTable(1252, @Win1252Ext[0], Length(Win1252Ext)); //Latin
  AddConversionTable(1251, @Win1251Ext[0], Length(Win1251Ext)); //Cyrillic
  if pSynTableDefault=nil then begin;
    for i:=0 to 127 do c[i]:=i+128;
    MultiByteToWideChar(SynDefaultCP,0,@c[0],128,@SynDefaultExt[0],128);
    AddConversionTable(SynDefaultCP, @SynDefaultExt[0], 128);
    end;
  Result:=(pSynTableDefault<>nil) and TestSynTables;
  end;

procedure FinalConversionTables;
begin;
  SynDefaultCP:=0;
  pSynTableDefault:=nil;
  pSynTable1252:=nil;
  pSynTable1251:=nil;
  SynTables:=nil;
  end;

initialization

  InitConversionTables;

finalization

  FinalConversionTables;

end.

I suggest to replace WinAnsiTableSortedFind(wc) with FindAnsiChar(wc,pSynTable1252),
for example:

//Sha: new version
function WideCharToWinAnsiChar(wc: cardinal): AnsiChar;
begin // code generated for this function is very fast
  if wc<256 then
    //if WinAnsiTable[wc]<256 then begin
    if pSynTable1252.AnsiToWide[wc]<256 then begin //Sha: use new table
      result := AnsiChar(wc);
      exit;
    end else begin
      result := ' '; // invalid ansi char for this code page (e.g. #128)
      exit;
    end else begin // wc>255:
      //wc := WinAnsiTableSortedFind(wc);
      wc:=FindAnsiChar(wc, pSynTable1252); //Sha: use new version of search
      if integer(wc)>=0 then
        result := AnsiChar(byte(wc)) else
        result := ' '; // space for invalid wide char
    exit;
  end;
end;

//Sha: new version
function WideCharToWinAnsi(wc: cardinal): integer;
begin
  if wc<256 then
    //if WinAnsiTable[wc]<256 then
    if pSynTable1252.AnsiToWide[wc]<256 then //Sha: use new table
      result := wc else
      result := -1 else // invalid ansi char for this code page (e.g. #128)
      //result := WinAnsiTableSortedFind(wc);
      result := FindAnsiChar(wc, pSynTable1252); //Sha: use new version of search
end;

//Sha: new version
function IsWinAnsiU(UTF8Text: PUTF8Char): boolean;
var c: Cardinal;
begin
  result := false;
  if UTF8Text<>nil then
    repeat
      c := byte(UTF8Text^); inc(UTF8Text);
      if c=0 then break else
      if c and $80=0 then
        continue else begin
        if UTF8Text^=#0 then break;
        if c and $20=0 then begin
          c := c shl 6+byte(UTF8Text^)-$00003080; inc(UTF8Text);
        end else begin
          c := c shl 6+byte(UTF8Text^); inc(UTF8Text);
          if UTF8Text^=#0 then break;
          c := c shl 6+byte(UTF8Text^)-$000E2080; inc(UTF8Text);
        end;
        if c>255 then begin
          //if WinAnsiTableSortedFind(c)<0 then
          if FindAnsiChar(c, pSynTable1252)<0 then //Sha: use new version of search
            exit; // invalid char in the WinAnsi code page
        end else
        //if WinAnsiTable[c]>255 then
        if pSynTable1252.AnsiToWide[c]>255 then //Sha: use new table
          exit; // invalid char in the WinAnsi code page
      end;
    until false;
  result := true;
end;

//Sha: new version
function UTF8ToWinPChar(dest: PAnsiChar; source: PUTF8Char; count: integer): integer;
var c: cardinal;
    begd: PAnsiChar;
    endSource: PUTF8Char;
begin
  result := 0;
  if source=nil then exit;
  begd := dest;
  endSource := source+count;
  repeat
    c := byte(source^); inc(source);
    if byte(c) and $80=0 then begin
      dest^ := AnsiChar(byte(c)); inc(dest);
      if source<endsource then continue else break;
    end else begin
      if source>=endsource then break;
      if c and $20=0 then begin
        c := c shl 6+byte(source^)-$00003080; inc(source);
        if c and $ffffff00=0 then begin
          //if WinAnsiTable[c]>255 then
          if pSynTable1252.AnsiToWide[c]>255 then //Sha: use new table
            dest^ := ' ' else // invalid char in the WinAnsi code page
            dest^ := AnsiChar(c);
          inc(dest);  // #128..#255 -> direct copy
          if source<endsource then continue else break;
        end;
      end else begin
        c := c shl 6+byte(source^); inc(source);
        if source>=endsource then break;
        c := c shl 6+byte(source^)-$000E2080; inc(source);
      end;
      // #256.. -> slower but accurate conversion
      //c := WinAnsiTableSortedFind(c);
      c := FindAnsiChar(c, pSynTable1252); //Sha: use new version of search
      if integer(c)>=0 then begin
        dest^ := AnsiChar(Byte(c)); // don't add invalid wide char
        inc(dest);
      end;
      if source>=endsource then break;
    end;
  until false;
  result := dest-begd;
end;

//Sha: new version
procedure UTF8ToShortString(var dest: shortstring; source: PUTF8Char);
var c: cardinal;
    len: integer;
begin
  len := 0;
  if source<>nil then
  repeat
    c := byte(source^); inc(source);
    if c=0 then break else
    if c and $80=0 then begin
      inc(len); dest[len] := AnsiChar(c);
      if len<255 then continue else break;
    end else begin
      if source^=#0 then break;
      if c and $20=0 then begin
        c := c shl 6+byte(source^)-$00003080; inc(source);
      end else begin
        c := c shl 6+byte(source^); inc(source);
        if source^=#0 then break;
        c := c shl 6+byte(source^)-$000E2080; inc(source);
      end;
      // #256.. -> slower but accurate conversion
      inc(len);
      //c := WinAnsiTableSortedFind(c);
      c := FindAnsiChar(c, pSynTable1252); //Sha: use new version of search
      if integer(c)<0 then
        c := ord('?');
      dest[len] := AnsiChar(byte(c)); // #128..#255 -> direct copy
      if len<255 then continue else break;
    end;
  until false;
  dest[0] := AnsiChar(len);
end;

//Sha: new version
procedure RawUnicodeToWinPChar(dest: PAnsiChar; source: PWideChar; WideCharCount: Integer);
var i: integer;
    wc: integer;
begin
  for i := 0 to WideCharCount-1 do begin
    wc := integer(source[i]);
    if wc<256 then
    //if WinAnsiTable[wc]<256 then
    if pSynTable1252.AnsiToWide[wc]<256 then //Sha: use new table
      dest[i] := AnsiChar(wc) else
      dest[i] := ' ' else begin
      //wc := WinAnsiTableSortedFind(wc);
      wc := FindAnsiChar(wc, pSynTable1252); //Sha: use new version of search
      if integer(wc)>=0 then
        dest[i] := AnsiChar(byte(wc)) else
        dest[i] := ' '; // space for invalid wide char
    end;
  end;
end;

It is easy to create new fast common functions w/o Windows API
by adding parameter, for example

//Sha: new function
function WideCharToSynAnsiChar(wc: cardinal; pSynTable: PConversionTable): AnsiChar;
begin // code generated for this function is very fast
  if wc<256 then
    if pSynTable.AnsiToWide[wc]<256 then begin
      result := AnsiChar(wc);
      exit;
    end else begin
      result := ' '; // invalid ansi char for this code page (e.g. #128)
      exit;
    end else begin // wc>255:
      wc:=FindAnsiChar(wc, pSynTable);
      if integer(wc)>=0 then
        result := AnsiChar(byte(wc)) else
        result := ' '; // space for invalid wide char
    exit;
  end;
end;

Unit has internal full self-test:

//how to validate all tables in use
procedure TForm1.bValidateClick(Sender: TObject);
const
  msg: array[boolean] of string= ('failed', 'passed');
begin;
  Memo1.Lines.Add('Test of conversion tables ' + msg[TestSynTables]);
  end;

It is easy to add support for other code pages.
Just copy/paste data from TMemo to the unit.

//how to fill your default table
procedure TForm1.bShowWideClick(Sender: TObject);
var
  c: array[0..127] of byte;
  w: array[0..127] of word;
  i: integer;
begin;
  for i:=0 to 127 do c[i]:=i+128;
  MultiByteToWideChar(GetACP,0,@c[0],128,@w[0],128);
  i:=0;
  while i<=128-16 do begin;
    Memo1.Lines.Add(Format('{%d:} %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, ',
                          [i+128, w[i+0],w[i+1],w[i+2], w[i+3], w[i+4], w[i+5], w[i+6], w[i+7],
                                  w[i+8],w[i+9],w[i+10],w[i+11],w[i+12],w[i+13],w[i+14],w[i+15]]));
    i:=i+16;
    end;
  end;

#29 Re: mORMot Framework » Low level common UTF-8 test failed for code page 1251 » 2012-01-09 21:37:43

Sha

Yes.
Here is code for demonstration the idea.

const
  Ansi1252: packed array[128..159] of word = (
    8364,  129, 8218,  402, 8222, 8230, 8224, 8225,  710, 8240,  352, 8249,  338,  141,  381,  143,
     144, 8216, 8217, 8220, 8221, 8226, 8211, 8212,  732, 8482,  353, 8250,  339,  157,  382,  376);
  Ansi1251: packed array[128..255] of word = (
    1026, 1027, 8218, 1107, 8222, 8230, 8224, 8225, 8364, 8240, 1033, 8249, 1034, 1036, 1035, 1039,
    1106, 8216, 8217, 8220, 8221, 8226, 8211, 8212,  152, 8482, 1113, 8250, 1114, 1116, 1115, 1119,
     160, 1038, 1118, 1032,  164, 1168,  166,  167, 1025,  169, 1028,  171,  172,  173,  174, 1031,
     176,  177, 1030, 1110, 1169,  181,  182,  183, 1105, 8470, 1108,  187, 1112, 1029, 1109, 1111,
    1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
    1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
    1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
    1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103);

type
  TConversionTable= record
    CodePage: integer;
    WideToAnsiLast: integer;
    WideToAnsi: array of cardinal; //0..WideToAnsiLast
    AnsiToWide: array of word;     //0..255
    end;
  PConversionTable= ^TConversionTable;

var
  ConversionTable: array of TConversionTable;
  ConversionTableLast: integer= -1;

procedure AddConversionTable(CodePage: integer; PWA: PWordArray; Count: integer);
var
  i, len, min: integer;
  tmp: cardinal;
  pct: PConversionTable;
begin;
  if CodePage=0 then CodePage:=GetACP;

  for i:=0 to ConversionTableLast do if ConversionTable[i].CodePage=CodePage then exit;
  inc(ConversionTableLast);
  SetLength(ConversionTable, ConversionTableLast+1);
  pct:=@ConversionTable[ConversionTableLast];
  pct.CodePage:=CodePage;

  SetLength(pct.AnsiToWide,256);
  for i:=0 to 255 do pct.AnsiToWide[i]:=i;
  len:=0;
  for i:=0 to Count-1 do if PWA[i]>255 then inc(len);
  SetLength(pct.WideToAnsi,len);
  pct.WideToAnsiLast:=len-1;
  len:=0;
  min:=0;
  for i:=0 to Count-1 do begin;
    pct.AnsiToWide[i+128]:=PWA[i];
    if PWA[i]>255 then begin;
      pct.WideToAnsi[len]:=integer(PWA[i]) shl 8 or (i+128);
      if pct.WideToAnsi[min]>pct.WideToAnsi[len] then min:=len;
      inc(len);
      end;
    end;

  // insertion sort of pct.WideToAnsi
  if min>0 then begin;
    tmp:=pct.WideToAnsi[0];
    pct.WideToAnsi[0]:=pct.WideToAnsi[min];
    pct.WideToAnsi[min]:=tmp;
    end;
  dec(len); // last index
  i:=1;
  while i<len do begin;
    inc(i);
    if pct.WideToAnsi[i]<pct.WideToAnsi[i-1] then begin;
      tmp:=pct.WideToAnsi[i];
      min:=i;
      repeat;
        pct.WideToAnsi[min]:=pct.WideToAnsi[min-1];
        dec(min);
        until tmp>=pct.WideToAnsi[min-1];
      pct.WideToAnsi[min]:=tmp;
      end;
    end;
  end;

procedure InitConversionTables;
begin;
  AddConversionTable(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
  AddConversionTable(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
  end;

function FindAnsiChar(wc: cardinal; CP: integer= 1252): integer;
var
  i, left, right: PtrInt;
  pct: PConversionTable;
begin;
  i:=ConversionTableLast;
  while (i>=0) and (ConversionTable[i].CodePage<>CP) do dec(i);
  if i>=0 then begin;
    pct:=@ConversionTable[i];
    right:=pct.WideToAnsiLast;
    left:=-1;
    wc:=wc shl 8;
    while left<right do begin;
      i:=(left + right + 1) shr 1;
      if pct.WideToAnsi[i]<wc then left:=i else right:=i - 1;
      end;
    inc(left);
    if left<=pct.WideToAnsiLast then begin;
      wc:=wc xor pct.WideToAnsi[left];
      if wc<256 then begin;
        Result:=wc;
        exit;
        end;
      end;
    end;
  Result:=-1; // invalid wide char or CP not found
  end;

function TestFindChars(CodePage: integer; PWA: PWordArray; Count: integer): boolean;
var
  i: integer;
begin;
  Result:=false;
  for i:=0 to Count-1 do if (PWA[i]>255) and (FindAnsiChar(PWA[i],CodePage)<>i+128) then exit;
  Result:=true;
  end;

function TestCountChars(CodePage: integer): integer;
var
  i: integer;
begin;
  Result:=0;
  for i:=$100 to $FFFF do if FindAnsiChar(i,CodePage)>=0 then inc(Result);
  end;

procedure TForm1.Button4Click(Sender: TObject);
const
  msg: array[boolean] of string= ('failed', 'OK');
var
  FoundAll: boolean;
  CountAll: integer;
begin;
  InitConversionTables;

  FoundAll:=TestFindChars(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
  CountAll:=TestCountChars(1252);
  Memo1.Lines.Add(Format('CP1252: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=27]]));

  FoundAll:=TestFindChars(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
  CountAll:=TestCountChars(1251);
  Memo1.Lines.Add(Format('CP1251: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=112]]));
  end;

procedure TForm1.Button2Click(Sender: TObject);
var
  c: array[0..255] of byte;
  w: array[0..255] of word;
  i: integer;
begin;
  for i:=0 to 255 do c[i]:=i;
  MultiByteToWideChar(1251,0,@c[0],256,@w[0],256);
  i:=128;
  while i<=256-16 do begin;
    Memo1.Lines.Add(Format('%d:   %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, ',
                            [i, w[i+0],w[i+1],w[i+2], w[i+3], w[i+4], w[i+5], w[i+6], w[i+7],
                                w[i+8],w[i+9],w[i+10],w[i+11],w[i+12],w[i+13],w[i+14],w[i+15]]));
    i:=i+16;
    end;
  end;

#30 Re: mORMot Framework » Low level common UTF-8 test failed for code page 1251 » 2012-01-08 16:59:18

Sha

OK.
1 using found in SynPdf. (Note: pdf test also failed on CP 1251).

function TPdfWrite.ToWideChar(const Ansi: PDFString; out DLen: Integer): PWideChar;
var L: integer;
begin
  L := Length(Ansi)*2+2; // maximum possible length
  getmem(result,L);
  if FCodePage=CODEPAGE_US then begin // use our internal fast conversion
    DLen := Length(Ansi);
    WinAnsiToUnicodeBuffer(WinAnsiString(Ansi), pointer(result), DLen+1);
  end else begin
    {$IFDEF MSWINDOWS}
    DLen := MultiByteToWideChar(FCodePage, 0, Pointer(Ansi), length(Ansi), result, L);
    result[DLen] := #0;
    {$ENDIF}
    {$IFDEF LINUX}
    StringToWideChar(Ansi, result, L); // only work with current system CharSet
    DLen := 0; while result[DLen]<>#0 do inc(DLen);
    {$ENDIF}
  end;
end;

WinAnsiString is used here for acceleration of work with CP 1252.
No specificity. So WinAnsiString = Win1252String.

I suggest acceleration for all Ansi-strings. Why not?

Certainly the code guarantees correct transformation for CP 1252.
It is good to do the same for CP 1251 and for others.

#31 Re: mORMot Framework » Low level common UTF-8 test failed for code page 1251 » 2012-01-08 10:11:51

Sha
ab wrote:

WinAnsiString are CP 1252 exactly, by definition - and some units in the framework expect this behavior.
I do not want to make it Code Page independent.

Сan I learn, what for it is necessary?

#32 Re: mORMot Framework » Low level common UTF-8 test failed for code page 1251 » 2012-01-07 22:47:25

Sha

Details are here.
Procedure TTestLowLevelCommon._UTF8 in statement U := WinAnsiToUtf8(W);
calls WinAnsiToUtf8 -> WinAnsiBufferToUtf8 -> c := WinAnsiTable[c];
and assertion failed on Check(StringToUTF8(UTF8ToString(U))=U);

I am thinking about version for user defined CP (1251/1252 or other supported page)
to work with any Ansi-strings including Cyrillic strings. 
I am going to move all unicode support to different unit for easy replacement with framework unit.
Is it possible?

The idea is filling or switching tables of fast transformations (for example, WinAnsiTable) dynamically.
Framework fills tables when the user globally set value of his CP (1251/1252 or other).
Calls of GetACP become unnecessary.

#33 mORMot Framework » Low level common UTF-8 test failed for code page 1251 » 2012-01-07 14:54:41

Sha
Replies: 10

I found that function WinAnsiBufferToUtf8 uses WinAnsiTable (for code page 1252) only.

Second question. How to speed up working with my default code page (1251)?

Board footer

Powered by FluxBB