Low level common UTF-8 test failed for code page 1251

Sha · 2012-01-07 14:54:41

I found that function WinAnsiBufferToUtf8 uses WinAnsiTable (for code page 1252) only.

Second question. How to speed up working with my default code page (1251)?

ab · 2012-01-07 20:17:20

WinAnsiBufferToUtf8 is indeed for code page 1252 only.

Only special field encoding is WinAnsiString.
You'll have to use RawUTF8 for other code pages.

You could create your own version for CP 1251, but without touching the current code.
It is possible to add a CyrillicAnsiString kind of field to the framework.

But for real code, I'm not sure it will be much faster than the current UTF-8 implementation.

If you find some error in the current code test, please be more specific about the issue.

Sha · 2012-01-07 22:47:25

Details are here.
Procedure TTestLowLevelCommon._UTF8 in statement U := WinAnsiToUtf8(W);
calls WinAnsiToUtf8 -> WinAnsiBufferToUtf8 -> c := WinAnsiTable[c];
and assertion failed on Check(StringToUTF8(UTF8ToString(U))=U);

I am thinking about version for user defined CP (1251/1252 or other supported page)
to work with any Ansi-strings including Cyrillic strings.
I am going to move all unicode support to different unit for easy replacement with framework unit.
Is it possible?

The idea is filling or switching tables of fast transformations (for example, WinAnsiTable) dynamically.
Framework fills tables when the user globally set value of his CP (1251/1252 or other).
Calls of GetACP become unnecessary.

Last edited by Sha (2012-01-08 07:20:05)

ab · 2012-01-08 09:39:45

Check(StringToUTF8(UTF8ToString(U))=U) fails because characters in U are not part of CP 1251.

So the TTestLowLevelCommon._UTF8 tests have an issue: they work only in CP 1250 code page.
I'll need to fix it.

WinAnsiString are CP 1252 exactly, by definition - and some units in the framework expect this behavior.
I do not want to make it Code Page independent.
There is already the AnsiString kind of string for this purpose.

Sha · 2012-01-08 10:11:51

ab wrote:

WinAnsiString are CP 1252 exactly, by definition - and some units in the framework expect this behavior.
I do not want to make it Code Page independent.

Сan I learn, what for it is necessary?

ab · 2012-01-08 16:33:03

Just search for WinAnsiString kind of string.

You'll find some corresponding uses which expect it.

For instance, SynPdf has some optimization about CP 1252 exactly (this is the native code page of pdf content).

Sha · 2012-01-08 16:59:18

OK.
1 using found in SynPdf. (Note: pdf test also failed on CP 1251).

function TPdfWrite.ToWideChar(const Ansi: PDFString; out DLen: Integer): PWideChar;
var L: integer;
begin
  L := Length(Ansi)*2+2; // maximum possible length
  getmem(result,L);
  if FCodePage=CODEPAGE_US then begin // use our internal fast conversion
    DLen := Length(Ansi);
    WinAnsiToUnicodeBuffer(WinAnsiString(Ansi), pointer(result), DLen+1);
  end else begin
    {$IFDEF MSWINDOWS}
    DLen := MultiByteToWideChar(FCodePage, 0, Pointer(Ansi), length(Ansi), result, L);
    result[DLen] := #0;
    {$ENDIF}
    {$IFDEF LINUX}
    StringToWideChar(Ansi, result, L); // only work with current system CharSet
    DLen := 0; while result[DLen]<>#0 do inc(DLen);
    {$ENDIF}
  end;
end;

WinAnsiString is used here for acceleration of work with CP 1252.
No specificity. So WinAnsiString = Win1252String.

I suggest acceleration for all Ansi-strings. Why not?

Certainly the code guarantees correct transformation for CP 1252.
It is good to do the same for CP 1251 and for others.

Last edited by Sha (2012-01-08 20:27:19)

ab · 2012-01-09 06:51:44

SynPDF also uses WinAnsi encoding in lowest level of its implementation (e.g. for true type fonts use: in fact, pdf expects two fonts to be declared, one as winansi - CP 1252 - then one as unicode).

Acceleration for all Ansi-strings can be implemented.
For non MBCS code pages, of course.

Sha · 2012-01-09 21:37:43

Yes.
Here is code for demonstration the idea.

const
  Ansi1252: packed array[128..159] of word = (
    8364,  129, 8218,  402, 8222, 8230, 8224, 8225,  710, 8240,  352, 8249,  338,  141,  381,  143,
     144, 8216, 8217, 8220, 8221, 8226, 8211, 8212,  732, 8482,  353, 8250,  339,  157,  382,  376);
  Ansi1251: packed array[128..255] of word = (
    1026, 1027, 8218, 1107, 8222, 8230, 8224, 8225, 8364, 8240, 1033, 8249, 1034, 1036, 1035, 1039,
    1106, 8216, 8217, 8220, 8221, 8226, 8211, 8212,  152, 8482, 1113, 8250, 1114, 1116, 1115, 1119,
     160, 1038, 1118, 1032,  164, 1168,  166,  167, 1025,  169, 1028,  171,  172,  173,  174, 1031,
     176,  177, 1030, 1110, 1169,  181,  182,  183, 1105, 8470, 1108,  187, 1112, 1029, 1109, 1111,
    1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
    1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
    1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
    1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103);

type
  TConversionTable= record
    CodePage: integer;
    WideToAnsiLast: integer;
    WideToAnsi: array of cardinal; //0..WideToAnsiLast
    AnsiToWide: array of word;     //0..255
    end;
  PConversionTable= ^TConversionTable;

var
  ConversionTable: array of TConversionTable;
  ConversionTableLast: integer= -1;

procedure AddConversionTable(CodePage: integer; PWA: PWordArray; Count: integer);
var
  i, len, min: integer;
  tmp: cardinal;
  pct: PConversionTable;
begin;
  if CodePage=0 then CodePage:=GetACP;

  for i:=0 to ConversionTableLast do if ConversionTable[i].CodePage=CodePage then exit;
  inc(ConversionTableLast);
  SetLength(ConversionTable, ConversionTableLast+1);
  pct:=@ConversionTable[ConversionTableLast];
  pct.CodePage:=CodePage;

  SetLength(pct.AnsiToWide,256);
  for i:=0 to 255 do pct.AnsiToWide[i]:=i;
  len:=0;
  for i:=0 to Count-1 do if PWA[i]>255 then inc(len);
  SetLength(pct.WideToAnsi,len);
  pct.WideToAnsiLast:=len-1;
  len:=0;
  min:=0;
  for i:=0 to Count-1 do begin;
    pct.AnsiToWide[i+128]:=PWA[i];
    if PWA[i]>255 then begin;
      pct.WideToAnsi[len]:=integer(PWA[i]) shl 8 or (i+128);
      if pct.WideToAnsi[min]>pct.WideToAnsi[len] then min:=len;
      inc(len);
      end;
    end;

  // insertion sort of pct.WideToAnsi
  if min>0 then begin;
    tmp:=pct.WideToAnsi[0];
    pct.WideToAnsi[0]:=pct.WideToAnsi[min];
    pct.WideToAnsi[min]:=tmp;
    end;
  dec(len); // last index
  i:=1;
  while i<len do begin;
    inc(i);
    if pct.WideToAnsi[i]<pct.WideToAnsi[i-1] then begin;
      tmp:=pct.WideToAnsi[i];
      min:=i;
      repeat;
        pct.WideToAnsi[min]:=pct.WideToAnsi[min-1];
        dec(min);
        until tmp>=pct.WideToAnsi[min-1];
      pct.WideToAnsi[min]:=tmp;
      end;
    end;
  end;

procedure InitConversionTables;
begin;
  AddConversionTable(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
  AddConversionTable(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
  end;

function FindAnsiChar(wc: cardinal; CP: integer= 1252): integer;
var
  i, left, right: PtrInt;
  pct: PConversionTable;
begin;
  i:=ConversionTableLast;
  while (i>=0) and (ConversionTable[i].CodePage<>CP) do dec(i);
  if i>=0 then begin;
    pct:=@ConversionTable[i];
    right:=pct.WideToAnsiLast;
    left:=-1;
    wc:=wc shl 8;
    while left<right do begin;
      i:=(left + right + 1) shr 1;
      if pct.WideToAnsi[i]<wc then left:=i else right:=i - 1;
      end;
    inc(left);
    if left<=pct.WideToAnsiLast then begin;
      wc:=wc xor pct.WideToAnsi[left];
      if wc<256 then begin;
        Result:=wc;
        exit;
        end;
      end;
    end;
  Result:=-1; // invalid wide char or CP not found
  end;

function TestFindChars(CodePage: integer; PWA: PWordArray; Count: integer): boolean;
var
  i: integer;
begin;
  Result:=false;
  for i:=0 to Count-1 do if (PWA[i]>255) and (FindAnsiChar(PWA[i],CodePage)<>i+128) then exit;
  Result:=true;
  end;

function TestCountChars(CodePage: integer): integer;
var
  i: integer;
begin;
  Result:=0;
  for i:=$100 to $FFFF do if FindAnsiChar(i,CodePage)>=0 then inc(Result);
  end;

procedure TForm1.Button4Click(Sender: TObject);
const
  msg: array[boolean] of string= ('failed', 'OK');
var
  FoundAll: boolean;
  CountAll: integer;
begin;
  InitConversionTables;

  FoundAll:=TestFindChars(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
  CountAll:=TestCountChars(1252);
  Memo1.Lines.Add(Format('CP1252: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=27]]));

  FoundAll:=TestFindChars(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
  CountAll:=TestCountChars(1251);
  Memo1.Lines.Add(Format('CP1251: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=112]]));
  end;

procedure TForm1.Button2Click(Sender: TObject);
var
  c: array[0..255] of byte;
  w: array[0..255] of word;
  i: integer;
begin;
  for i:=0 to 255 do c[i]:=i;
  MultiByteToWideChar(1251,0,@c[0],256,@w[0],256);
  i:=128;
  while i<=256-16 do begin;
    Memo1.Lines.Add(Format('%d:   %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, ',
                            [i, w[i+0],w[i+1],w[i+2], w[i+3], w[i+4], w[i+5], w[i+6], w[i+7],
                                w[i+8],w[i+9],w[i+10],w[i+11],w[i+12],w[i+13],w[i+14],w[i+15]]));
    i:=i+16;
    end;
  end;

Sha · 2012-01-10 21:33:18

My unit for fast Ansi/Unicode conversion is here

ab · 2012-01-11 12:54:13

I've fixed the test failure in SynCommons.pas.

But I'll wait a little before add fast conversions for other Code Pages.
I suspect that current implementation is fast enough (*MultiByte* Windows APIs are not so slow here).

mORMot Open Source

#1 2012-01-07 14:54:41

Low level common UTF-8 test failed for code page 1251

#2 2012-01-07 20:17:20

Re: Low level common UTF-8 test failed for code page 1251

#3 2012-01-07 22:47:25

Re: Low level common UTF-8 test failed for code page 1251

#4 2012-01-08 09:39:45

Re: Low level common UTF-8 test failed for code page 1251

#5 2012-01-08 10:11:51

Re: Low level common UTF-8 test failed for code page 1251

#6 2012-01-08 16:33:03

Re: Low level common UTF-8 test failed for code page 1251

#7 2012-01-08 16:59:18

Re: Low level common UTF-8 test failed for code page 1251

#8 2012-01-09 06:51:44

Re: Low level common UTF-8 test failed for code page 1251

#9 2012-01-09 21:37:43

Re: Low level common UTF-8 test failed for code page 1251

#10 2012-01-10 21:33:18

Re: Low level common UTF-8 test failed for code page 1251

#11 2012-01-11 12:54:13

Re: Low level common UTF-8 test failed for code page 1251

Board footer