You are not logged in.
WinAnsiBufferToUtf8 is indeed for code page 1252 only.
Only special field encoding is WinAnsiString.
You'll have to use RawUTF8 for other code pages.
You could create your own version for CP 1251, but without touching the current code.
It is possible to add a CyrillicAnsiString kind of field to the framework.
But for real code, I'm not sure it will be much faster than the current UTF-8 implementation.
If you find some error in the current code test, please be more specific about the issue.
Online
Details are here.
Procedure TTestLowLevelCommon._UTF8 in statement U := WinAnsiToUtf8(W);
calls WinAnsiToUtf8 -> WinAnsiBufferToUtf8 -> c := WinAnsiTable[c];
and assertion failed on Check(StringToUTF8(UTF8ToString(U))=U);
I am thinking about version for user defined CP (1251/1252 or other supported page)
to work with any Ansi-strings including Cyrillic strings.
I am going to move all unicode support to different unit for easy replacement with framework unit.
Is it possible?
The idea is filling or switching tables of fast transformations (for example, WinAnsiTable) dynamically.
Framework fills tables when the user globally set value of his CP (1251/1252 or other).
Calls of GetACP become unnecessary.
Last edited by Sha (2012-01-08 07:20:05)
Offline
Check(StringToUTF8(UTF8ToString(U))=U) fails because characters in U are not part of CP 1251.
So the TTestLowLevelCommon._UTF8 tests have an issue: they work only in CP 1250 code page.
I'll need to fix it.
WinAnsiString are CP 1252 exactly, by definition - and some units in the framework expect this behavior.
I do not want to make it Code Page independent.
There is already the AnsiString kind of string for this purpose.
Online
WinAnsiString are CP 1252 exactly, by definition - and some units in the framework expect this behavior.
I do not want to make it Code Page independent.
Сan I learn, what for it is necessary?
Offline
Just search for WinAnsiString kind of string.
You'll find some corresponding uses which expect it.
For instance, SynPdf has some optimization about CP 1252 exactly (this is the native code page of pdf content).
Online
OK.
1 using found in SynPdf. (Note: pdf test also failed on CP 1251).
function TPdfWrite.ToWideChar(const Ansi: PDFString; out DLen: Integer): PWideChar;
var L: integer;
begin
L := Length(Ansi)*2+2; // maximum possible length
getmem(result,L);
if FCodePage=CODEPAGE_US then begin // use our internal fast conversion
DLen := Length(Ansi);
WinAnsiToUnicodeBuffer(WinAnsiString(Ansi), pointer(result), DLen+1);
end else begin
{$IFDEF MSWINDOWS}
DLen := MultiByteToWideChar(FCodePage, 0, Pointer(Ansi), length(Ansi), result, L);
result[DLen] := #0;
{$ENDIF}
{$IFDEF LINUX}
StringToWideChar(Ansi, result, L); // only work with current system CharSet
DLen := 0; while result[DLen]<>#0 do inc(DLen);
{$ENDIF}
end;
end;
WinAnsiString is used here for acceleration of work with CP 1252.
No specificity. So WinAnsiString = Win1252String.
I suggest acceleration for all Ansi-strings. Why not?
Certainly the code guarantees correct transformation for CP 1252.
It is good to do the same for CP 1251 and for others.
Last edited by Sha (2012-01-08 20:27:19)
Offline
SynPDF also uses WinAnsi encoding in lowest level of its implementation (e.g. for true type fonts use: in fact, pdf expects two fonts to be declared, one as winansi - CP 1252 - then one as unicode).
Acceleration for all Ansi-strings can be implemented.
For non MBCS code pages, of course.
Online
Yes.
Here is code for demonstration the idea.
const
Ansi1252: packed array[128..159] of word = (
8364, 129, 8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249, 338, 141, 381, 143,
144, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 732, 8482, 353, 8250, 339, 157, 382, 376);
Ansi1251: packed array[128..255] of word = (
1026, 1027, 8218, 1107, 8222, 8230, 8224, 8225, 8364, 8240, 1033, 8249, 1034, 1036, 1035, 1039,
1106, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 152, 8482, 1113, 8250, 1114, 1116, 1115, 1119,
160, 1038, 1118, 1032, 164, 1168, 166, 167, 1025, 169, 1028, 171, 172, 173, 174, 1031,
176, 177, 1030, 1110, 1169, 181, 182, 183, 1105, 8470, 1108, 187, 1112, 1029, 1109, 1111,
1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103);
type
TConversionTable= record
CodePage: integer;
WideToAnsiLast: integer;
WideToAnsi: array of cardinal; //0..WideToAnsiLast
AnsiToWide: array of word; //0..255
end;
PConversionTable= ^TConversionTable;
var
ConversionTable: array of TConversionTable;
ConversionTableLast: integer= -1;
procedure AddConversionTable(CodePage: integer; PWA: PWordArray; Count: integer);
var
i, len, min: integer;
tmp: cardinal;
pct: PConversionTable;
begin;
if CodePage=0 then CodePage:=GetACP;
for i:=0 to ConversionTableLast do if ConversionTable[i].CodePage=CodePage then exit;
inc(ConversionTableLast);
SetLength(ConversionTable, ConversionTableLast+1);
pct:=@ConversionTable[ConversionTableLast];
pct.CodePage:=CodePage;
SetLength(pct.AnsiToWide,256);
for i:=0 to 255 do pct.AnsiToWide[i]:=i;
len:=0;
for i:=0 to Count-1 do if PWA[i]>255 then inc(len);
SetLength(pct.WideToAnsi,len);
pct.WideToAnsiLast:=len-1;
len:=0;
min:=0;
for i:=0 to Count-1 do begin;
pct.AnsiToWide[i+128]:=PWA[i];
if PWA[i]>255 then begin;
pct.WideToAnsi[len]:=integer(PWA[i]) shl 8 or (i+128);
if pct.WideToAnsi[min]>pct.WideToAnsi[len] then min:=len;
inc(len);
end;
end;
// insertion sort of pct.WideToAnsi
if min>0 then begin;
tmp:=pct.WideToAnsi[0];
pct.WideToAnsi[0]:=pct.WideToAnsi[min];
pct.WideToAnsi[min]:=tmp;
end;
dec(len); // last index
i:=1;
while i<len do begin;
inc(i);
if pct.WideToAnsi[i]<pct.WideToAnsi[i-1] then begin;
tmp:=pct.WideToAnsi[i];
min:=i;
repeat;
pct.WideToAnsi[min]:=pct.WideToAnsi[min-1];
dec(min);
until tmp>=pct.WideToAnsi[min-1];
pct.WideToAnsi[min]:=tmp;
end;
end;
end;
procedure InitConversionTables;
begin;
AddConversionTable(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
AddConversionTable(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
end;
function FindAnsiChar(wc: cardinal; CP: integer= 1252): integer;
var
i, left, right: PtrInt;
pct: PConversionTable;
begin;
i:=ConversionTableLast;
while (i>=0) and (ConversionTable[i].CodePage<>CP) do dec(i);
if i>=0 then begin;
pct:=@ConversionTable[i];
right:=pct.WideToAnsiLast;
left:=-1;
wc:=wc shl 8;
while left<right do begin;
i:=(left + right + 1) shr 1;
if pct.WideToAnsi[i]<wc then left:=i else right:=i - 1;
end;
inc(left);
if left<=pct.WideToAnsiLast then begin;
wc:=wc xor pct.WideToAnsi[left];
if wc<256 then begin;
Result:=wc;
exit;
end;
end;
end;
Result:=-1; // invalid wide char or CP not found
end;
function TestFindChars(CodePage: integer; PWA: PWordArray; Count: integer): boolean;
var
i: integer;
begin;
Result:=false;
for i:=0 to Count-1 do if (PWA[i]>255) and (FindAnsiChar(PWA[i],CodePage)<>i+128) then exit;
Result:=true;
end;
function TestCountChars(CodePage: integer): integer;
var
i: integer;
begin;
Result:=0;
for i:=$100 to $FFFF do if FindAnsiChar(i,CodePage)>=0 then inc(Result);
end;
procedure TForm1.Button4Click(Sender: TObject);
const
msg: array[boolean] of string= ('failed', 'OK');
var
FoundAll: boolean;
CountAll: integer;
begin;
InitConversionTables;
FoundAll:=TestFindChars(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
CountAll:=TestCountChars(1252);
Memo1.Lines.Add(Format('CP1252: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=27]]));
FoundAll:=TestFindChars(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
CountAll:=TestCountChars(1251);
Memo1.Lines.Add(Format('CP1251: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=112]]));
end;
procedure TForm1.Button2Click(Sender: TObject);
var
c: array[0..255] of byte;
w: array[0..255] of word;
i: integer;
begin;
for i:=0 to 255 do c[i]:=i;
MultiByteToWideChar(1251,0,@c[0],256,@w[0],256);
i:=128;
while i<=256-16 do begin;
Memo1.Lines.Add(Format('%d: %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, ',
[i, w[i+0],w[i+1],w[i+2], w[i+3], w[i+4], w[i+5], w[i+6], w[i+7],
w[i+8],w[i+9],w[i+10],w[i+11],w[i+12],w[i+13],w[i+14],w[i+15]]));
i:=i+16;
end;
end;
Offline
Offline
I've fixed the test failure in SynCommons.pas.
But I'll wait a little before add fast conversions for other Code Pages.
I suspect that current implementation is fast enough (*MultiByte* Windows APIs are not so slow here).
Online