You are not logged in.
Pages: 1
Yes, result is correct, but is not shortest.
Hello AB,
using TDiff with files of string ['1','2','5'] and ['1','2','3','4','5']
I get valid result (2 add operations)
but with files of string ['1','2','3','4','5'] and ['1','2','5']
I get invalid result (1 modify and 2 delete operations)
Same, Delphi 7: 20% + 1
and take your attention to FastMM4-AVX
https://github.com/maximmasiutin/FastMM4-AVX
In my source code I didn't use the predetermined types to show possibility of switching of data presentation.
Post is changed according to SynCommons changes.
For the best translation from/to Russian you can use http://www.online-translator.com/Default.aspx/Text
Sorry.
Here some serialization examples http://guildalfa.ru/alsha/sites/default … Proj_0.zip
and here annotation (in Russian) http://guildalfa.ru/alsha/node/28
Thanks for the link. Very interesting. I will close read it and check UTF-updates in week.
I decided to use the simplified check on the basis of a trick with NULCHAR as in SynCommons.
The project below contains different variants of work with JSON.
I try to play with custom serialization of record as array of unnamed values.
If I correctly understand, for this purpose it is necessary to use RecordSaveJson.
The result, for example, is [365,-365,36.5,"2014-02-25T23:18:38","day\/3"]
For a unserialization I use RecordLoadJson. The result is expected.
The problem consists that it isn't clear how to check correctness of the received result.
Thanks. It works.
If two different classes have the same DynArrayTI and RecordTI,
we can use only one reader/writer pair to process these classes.
Class TI is useless.
OK, let's leave everything as is.
Anyway it isn't especially important.
Framework uses DynArrayTI and RecordTI and not method data to find and replace custom reader/writer in JSONCustomParsers array.
Yes, I see.
Then I will use this feature as follows.
type
TDynArrayWriter= procedure(aUserInfo: pointer; const aWriter: TTextWriter; const aValue);
TDynArrayReader= function(aUserInfo: pointer; p: PUTF8Char; var aValue; out aValid: Boolean): PUTF8Char;
procedure RegisterDynArrayRW(aTypeInfo: pointer; aReader: TDynArrayReader; aWriter: TDynArrayWriter; aUserInfo: pointer= nil);
implementation
procedure RegisterDynArrayRW(aTypeInfo: pointer; aReader: TDynArrayReader; aWriter: TDynArrayWriter; aUserInfo: pointer= nil);
var
Reader, Writer: TMethod;
begin;
Reader.Code:=@aReader; Reader.Data:=aUserInfo;
Writer.Code:=@aWriter; Writer.Data:=aUserInfo;
TTextWriter.RegisterCustomJSONSerializer(aTypeInfo,
TDynArrayJSONCustomReader(Reader),
TDynArrayJSONCustomWriter(Writer));
end;
procedure Int64ArrayWriter(aUserInfo: pointer; const aWriter: TTextWriter; const aValue);
var
v: int64 absolute aValue;
begin;
aWriter.Add(v);
end;
function Int64ArrayReader(aUserInfo: pointer; p: PUTF8Char; var aValue; out aValid: Boolean): PUTF8Char;
var
v: int64 absolute aValue;
wasString: boolean;
delim: AnsiChar;
begin;
aValid:=false;
Result:=nil;
if p=nil then exit;
v:=GetInt64(GetJSONField(p,p,@wasString,@delim));
if (p=nil) or wasString then exit;
dec(p); p^:=delim;
aValid:=true;
Result:=p; // ',' or ']' for last item of array
end;
But the problem consists that I can't register some serializers with different UserData for the same type.
To reproduce this bug register/unregister/register custom serializer.
You can see that default serializer is used. Not the last registered.
Another bug is that length of JSONCustomParsers will increase
for each registration of previously unregistered serializer.
Here is correct code.
class procedure TTextWriter.RegisterCustomJSONSerializer(aTypeInfo: pointer;
aReader: TDynArrayJSONCustomReader; aWriter: TDynArrayJSONCustomWriter);
var i: integer;
DynArrayTI, RecordTI: pointer;
begin
i := -MaxInt; //Sha: //i := -2;
RecordTI := nil;
DynArrayTI := nil;
if aTypeInfo<>nil then
case PDynArrayTypeInfo(aTypeInfo)^.kind of
tkDynArray: begin
DynArrayTI := aTypeInfo;
RecordTI := TypeInfoToRecordInfo(aTypeInfo);
i := JSONCustomParsersDynArrayIndex(aTypeInfo);
if (i<0) and (RecordTI<>nil) then
i := JSONCustomParsersRecordIndex(RecordTI);
end;
tkRecord: begin
i := JSONCustomParsersRecordIndex(aTypeInfo);
RecordTI := aTypeInfo;
DynArrayTI := nil;
end;
end;
if i=-MaxInt then //Sha: //if i=-2 then
raise ESynException.Create('Invalid TTextWriter.RegisterCustomJSONSerializer call');
if i<0 then
if Assigned(aWriter) or Assigned(aReader) then
if i<-1 then i := -(i+2) else //Sha: line added
begin
i := length(JSONCustomParsers);
SetLength(JSONCustomParsers,i+1);
end else
exit;
with JSONCustomParsers[i] do begin
DynArrayTypeInfo := DynArrayTI;
RecordTypeInfo := RecordTI;
Writer := aWriter;
Reader := aReader;
end;
end;
//Sha
function JSONCustomParsersDynArrayIndex(aTypeInfo: pointer): integer; {$ifdef HASINLINE}inline;{$endif}
begin
result := length(JSONCustomParsers) - 1;
while result >= 0 do begin
with JSONCustomParsers[result] do if DynArrayTypeInfo=aTypeInfo then begin
if not Assigned(Reader) and not Assigned(Writer) then result := -(result+2);
exit;
end;
dec(result);
end;
end;
//Sha
function JSONCustomParsersRecordIndex(aTypeInfo: pointer): integer; {$ifdef HASINLINE}inline;{$endif}
begin
result := length(JSONCustomParsers) - 1;
while result >= 0 do begin
with JSONCustomParsers[result] do if RecordTypeInfo=aTypeInfo then begin
if not Assigned(Reader) and not Assigned(Writer) then result := -(result+2);
exit;
end;
dec(result);
end;
end;
I think the use of regular procedures can save a lot of code
because we will need only one pair of them
for each array of ordinals or array of records
and not for each class used such arrays.
Of course, I can continue to write something like this.
procedure IntArrWriter(aObject: TObject; const aWriter: TTextWriter; const aValue);
var
v: integer absolute aValue;
begin;
aWriter.Add('%', [v], twJSONEscape);
end;
function IntArrReader(aObject: TObject; p: PUTF8Char; var aValue; out aValid: Boolean): PUTF8Char;
var
v: integer absolute aValue;
delim: AnsiChar;
begin;
aValid:=false;
Result:=nil;
if p=nil then exit;
v:=GetInteger(GetJSONField(p,p,nil,@delim));
if p=nil then exit;
dec(p); p^:=delim;
aValid:=true;
Result:=p; // ',' or ']' for last item of array
end;
var
reader: TDynArrayJSONCustomReader;
writer: TDynArrayJSONCustomWriter;
initialization
@reader:=@IntArrReader;
@writer:=@IntArrWriter;
TTextWriter.RegisterCustomJSONSerializer(TypeInfo(TIntArr), reader, writer);
Ok. Then let's try my code.
I have tested functons from current version of ShaUnicode
using 1000 random CP and 10000 random strings.
No error. You can repeat this. All code you need is below. You need also Button and Memo on the Form.
procedure FillRandomConvTable(pTable: PConvTable);
const
DefaultConvChar = 32;
var
i, n: NInt;
begin;
pTable.CodePage:=0;
for i:=0 to 127 do begin;
pTable.WideToAnsi[i]:=i;
pTable.AnsiToWide[i]:=i;
end;
for i:=128 to $FFFF do pTable.WideToAnsi[i]:=DefaultConvChar;
for i:=128 to 255 do begin;
repeat;
if i<128+64 then n:=Random($07FF - 255) + 256
else n:=Random($FFFF - $07FF) + $0800;
until pTable.WideToAnsi[n]=DefaultConvChar;
pTable.WideToAnsi[n]:=i;
pTable.AnsiToWide[i]:=n;
end;
for i:=0 to 255 do begin;
n:=pTable.AnsiToWide[i];
if n<=127 then n:=0
else if n<=$7FF then n:=1
else n:=2;
pTable.AnsiToUTF8Len[i]:=n;
end;
end;
function TSynTestCaseRandomString(CharCount: Integer): RawByteString;
var V: cardinal;
P: PAnsiChar;
begin
SetString(result,nil,CharCount);
P := pointer(Result);
while CharCount>0 do begin
if CharCount>5 then begin
V := Random(maxInt); // fast: one random compute per 5 chars
P[0] := AnsiChar(32+V and 127); V := V shr 7;
P[1] := AnsiChar(32+V and 127); V := V shr 7;
P[2] := AnsiChar(32+V and 127); V := V shr 7;
P[3] := AnsiChar(32+V and 127); V := V shr 7;
P[4] := AnsiChar(65+V);
Inc(P,5);
dec(CharCount,5);
end else begin
P^ := AnsiChar(32+Random(224));
inc(P);
dec(CharCount);
end;
end;
end;
procedure TestCP;
var
ST: array[0..10000] of RawByteString;
CP, i: integer;
Table: TConvTable;
begin
for i := 0 to high(ST) do ST[i] := TSynTestCaseRandomString(i shr 3);
for CP:=0 to 999 do begin;
FillRandomConvTable(@Table);
for i := 0 to high(ST) do
if ShaUTF8ToAnsi(ShaAnsiToUTF8(ST[i],@Table),@Table)<>ST[i] then begin;
Form1.Memo1.Lines.Add('ShaAnsiToUTF8 '+IntToStr(CP));
Form1.Memo1.Lines.Add(ST[i]);
break;
end;
for i := 0 to high(ST) do
if ShaUnicodeToAnsi(ShaAnsiToUnicode(ST[i],@Table),@Table)<>ST[i] then begin;
Form1.Memo1.Lines.Add('ShaAnsiToUnicode '+IntToStr(CP));
Form1.Memo1.Lines.Add(ST[i]);
break;
end;
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
TestCP;
Form1.Memo1.Lines.Add('TestCP done');
end;
Just use my above benchmark code, and you'll find out that it it works only with (CP=1250) or (CP=1251) or (CP=1254) or (CP=1256).
So may be problem in your above benchmark code? ;-)
Or you tested old my functions?
I haven't such function
Sha := GetShaConvTable(CP);
as well as conversion tables for all these CPs.
Could you provide minimal code that I could reproduce AV?
I noticed that your routines are a bit faster than mine.
But also that it works only with (CP=1250) or (CP=1251) or (CP=1254) or (CP=1256).
I've got access violation otherwise. I suspect there are some issues in your UTF8 conversion code.
I have validated all my functions for CP 1251 using code
function RandomString(MaxCharCount: Integer): RawByteString;
var
CharCount, CharRange: integer;
P: PAnsiChar;
begin;
// CharRange:=96 + 32; //ASCII + #128..159
CharRange:=96 + 128; //ASCII + #128..255
CharCount:=1+Random(MaxCharCount);
SetString(Result, nil, CharCount);
P:=pointer(Result);
while CharCount>0 do begin;
dec(CharCount); P[CharCount]:=AnsiChar(Random(CharRange)+32); //starting from #32
end;
end;
function Validate(const s: RawByteString; Status: integer=0): integer;
var
s2: RawByteString;
u, u2: RawUnicode;
t, t2: RawUTF8;
Len, Len2, Len8, TestNo: integer;
begin;
u:='';
t:='';
Len8:=0;
Len:=Length(s);
if Len>0 then begin;
SetLength(u,Len*2); u[Len+2]:=#0;
MultiByteToWideChar(GetACP, 0, pointer(s), Len, pointer(u), Len);
Len8:=WideCharToMultiByte(CP_UTF8, 0, pointer(u), Len, nil, 0, nil, nil);
SetLength(t, Len8);
WideCharToMultiByte(CP_UTF8, 0, pointer(u), Len, pointer(t), Len8, nil, nil);
end;
TestNo:=1;
if TestNo and Status=0 then begin;
u2:=ShaAnsiToUnicode(s, pConvDefault);
if u2=u then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
t2:=ShaUnicodeToUTF8(u);
if t2=t then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
s2:=ShaUnicodeToAnsi(u, pConvDefault);
if s2=s then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
u2:=ShaUTF8ToUnicode(t);
if u2=u then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
t2:=ShaAnsiToUTF8(s, pConvDefault);
if t2=t then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
s2:=ShaUTF8ToAnsi(t, pConvDefault);
if s2=s then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
Len2:=ShaUnicodeToUTF8(pointer(u), Len);
if Len2=Len8 then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
Len2:=ShaAnsiToUTF8(pointer(s), Len, pConvDefault);
if Len2=Len8 then Status:=Status or TestNo;
end;
TestNo:=TestNo*2;
if TestNo and Status=0 then begin;
Len2:=ShaUTF8ToAnsi(pointer(t), Len8);
if Len2=Len then Status:=Status or TestNo;
end;
while TestNo<>0 do begin;
TestNo:=TestNo*2;
Status:=Status or TestNo;
end;
Result:=Status;
end;
procedure TForm1.bValidateClick(Sender: TObject);
var
s: RawByteString;
i, len, ErrorCount, Status: integer;
begin;
ErrorCount:=0;
i:=0;
len:=0;
repeat;
s:=RandomString(len);
Status:=Validate(s);
if Status<>-1 then begin;
inc(ErrorCount);
Validate(s,Status);
end;
len:=32;
if i>100000 then len:=8*1024;
inc(i);
until i>150000;
Memo1.Lines.Add(Format('Validation done, %d errors',[ErrorCount]));
end;
Also I have tested them for CP 1251 and 1252 as you do
function TSynTestCaseRandomString(CharCount: Integer): RawByteString;
var V: cardinal;
P: PAnsiChar;
begin
SetString(result,nil,CharCount);
P := pointer(Result);
while CharCount>0 do begin
if CharCount>5 then begin
V := Random(maxInt); // fast: one random compute per 5 chars
P[0] := AnsiChar(32+V and 127); V := V shr 7;
P[1] := AnsiChar(32+V and 127); V := V shr 7;
P[2] := AnsiChar(32+V and 127); V := V shr 7;
P[3] := AnsiChar(32+V and 127); V := V shr 7;
P[4] := AnsiChar(65+V);
Inc(P,5);
dec(CharCount,5);
end else begin
P^ := AnsiChar(32+Random(224));
inc(P);
dec(CharCount);
end;
end;
end;
procedure TestCP;
var
CP, i: Integer;
ST: array[0..10000] of RawByteString;
Sha: PConvTable;
begin
for i := 0 to high(ST) do ST[i] := TSynTestCaseRandomString(i shr 3);
for CP:=1251 to 1252 do begin;
if CP=1251 then Sha:=pConvDefault else Sha:=pConvLatin;
for i := 0 to high(ST) do
if ShaUTF8ToAnsi(ShaAnsiToUTF8(ST[i],Sha),Sha)<>ST[i] then begin;
Form1.Memo1.Lines.Add('ShaAnsiToUTF8'+IntToStr(CP));
Form1.Memo1.Lines.Add(ST[i]);
break;
end;
for i := 0 to high(ST) do
if ShaUnicodeToAnsi(ShaAnsiToUnicode(ST[i],Sha),Sha)<>ST[i] then begin;
Form1.Memo1.Lines.Add('ShaAnsiToUnicode'+IntToStr(CP));
Form1.Memo1.Lines.Add(ST[i]);
break;
end;
end;
end;
procedure TForm1.Button1Click(Sender: TObject);
begin
TestCP;
Form1.Memo1.Lines.Add('TestCP done');
end;
No errors found in all tests.
Can you please point more details about AV for CP 1252.
I use text of same length in both languages (Russian and English). Measure differs in 15 times.
So in real applications for long texts it is better to use win1251 coding but not utf8. It makes difficult using of JSON in my case.
My default code page is 1251. It can be used for test with all russian and english (ASCII) texts.
I think 15 ms resolution of GetTickCount is sufficient in our case.
Changing to CurrentAnsiConvert.AnsiToUTF8()/UTF8ToAnsi() shows the same speed.
AnsiToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 2547 140 594
WinAnsi 0 203 0
Syn 3422 203 313
Sha 2547 140 594
WinAnsi 0 203 0
Syn 3422 203 313
Utf8ToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 2297 141 609
WinAnsi 0 219 0
Syn 3047 203 344
Sha 2234 156 610
WinAnsi 0 203 0
Syn 3062 203 329
I have used russian and english texts. Very simple tests. Some of them are here.
var
//137 chars
s1251: RawByteString='Лишь годные дятлы собираются в стаи, юникодом пугая мозги января. Их песни не стихнут, они не устанут. А елка как кактус беспокоит меня. ';
s1252: RawByteString='Only woodpeckers gather in flights, they frighten with unicode the brains of January. They were not tired, their songs will not abate. ';
a1251, a1252, a255: RawByteString;
u1251, u1252, u255: RawUnicode;
t1251, t1252, t255: RawUTF8;
Ticks: array[0..18] of cardinal;
TicksLast: integer;
Iterations: integer;
function NextTimer: integer;
begin;
inc(TicksLast);
if TicksLast<=High(Ticks) then Ticks[TicksLast]:=GetTickCount;
Result:=Iterations;
end;
function FirstTimer: integer;
begin;
TicksLast:=-1;
Result:=NextTimer;
end;
//--------------------------------------------------------------------------------------------------
procedure TForm1.FormCreate(Sender: TObject);
var
i: integer;
begin;
Iterations:=10000;
a1251:=s1251;
a1252:=s1252;
{}
for i:=1 to 8 do begin;
a1251:=a1251 + a1251;
a1252:=a1252 + a1252;
end;
//35209 chars
a1251:=a1251 + s1251;
a1252:=a1252 + s1252;
{}
a255:=a1252; a255[1]:=#255;
u1251:=ShaAnsiToUnicode(a1251,pConvDefault);
u1252:=ShaAnsiToUnicode(a1252,pConvDefault);
u255 :=ShaAnsiToUnicode(a255, pConvDefault);
t1251:=ShaUnicodeToUTF8(u1251);
t1252:=ShaUnicodeToUTF8(u1252);
t255 :=ShaUnicodeToUTF8(u255);
if (ShaUnicodeToAnsi(u1251,pConvDefault)<>a1251)
or (ShaUTF8ToUnicode(t1251)<>u1251)
or (ShaAnsiToUTF8(a1251,pConvDefault)<>t1251)
or (ShaUTF8ToAnsi(t1251,pConvDefault)<>a1251)
then ShowMessage('Error in FormCreate');
end;
//--------------------------------------------------------------------------------------------------
procedure TForm1.ShowResult(const Title: string);
begin;
Memo1.Lines.Add ('');
Memo1.Lines.Add (' ' + Title + ' time, ms');
Memo1.Lines.Add ('============================');
Memo1.Lines.Add (' Charset');
Memo1.Lines.Add ('Functions 1251 1252 1252*');
Memo1.Lines.Add ('----------------------------');
Memo1.Lines.Add(Format('Sha %5d %5d %5d',[Ticks[01]-Ticks[00],
Ticks[02]-Ticks[01],
Ticks[03]-Ticks[02]]));
Memo1.Lines.Add(Format('WinAnsi %5d %5d %5d',[Ticks[04]-Ticks[03],
Ticks[05]-Ticks[04],
Ticks[06]-Ticks[05]]));
Memo1.Lines.Add(Format('Syn %5d %5d %5d',[Ticks[07]-Ticks[06],
Ticks[08]-Ticks[07],
Ticks[09]-Ticks[08]]));
Memo1.Lines.Add(Format('Sha %5d %5d %5d',[Ticks[10]-Ticks[09],
Ticks[11]-Ticks[10],
Ticks[12]-Ticks[11]]));
Memo1.Lines.Add(Format('WinAnsi %5d %5d %5d',[Ticks[13]-Ticks[12],
Ticks[14]-Ticks[13],
Ticks[15]-Ticks[14]]));
Memo1.Lines.Add(Format('Syn %5d %5d %5d',[Ticks[16]-Ticks[15],
Ticks[17]-Ticks[16],
Ticks[18]-Ticks[17]]));
end;
//--------------------------------------------------------------------------------------------------
procedure TForm1.bAnsiToUtf8Click(Sender: TObject);
var
t: RawUTF8;
i: integer;
begin;
i:=FirstTimer;
repeat;
repeat;
t:=ShaAnsiToUtf8(a1251,pConvDefault);
t:=ShaAnsiToUtf8(a1251,pConvDefault);
dec(i); until i=0; i:=NextTimer;
repeat;
t:=ShaAnsiToUtf8(a1252,pConvDefault);
t:=ShaAnsiToUtf8(a1252,pConvDefault);
dec(i); until i=0; i:=NextTimer;
repeat;
t:=ShaAnsiToUtf8(a255,pConvDefault);
t:=ShaAnsiToUtf8(a255,pConvDefault);
dec(i); until i=0; i:=NextTimer;
repeat;
dec(i); until i=0; i:=NextTimer;
repeat;
t:=WinAnsiToUtf8(a1252);
t:=WinAnsiToUtf8(a1252);
dec(i); until i=0; i:=NextTimer;
repeat;
dec(i); until i=0; i:=NextTimer;
repeat;
AnsiCharToUTF8(pointer(a1251),length(a1251),t,1251);
AnsiCharToUTF8(pointer(a1251),length(a1251),t,1251);
dec(i); until i=0; i:=NextTimer;
repeat;
AnsiCharToUTF8(pointer(a1252),length(a1252),t,1251);
AnsiCharToUTF8(pointer(a1252),length(a1252),t,1251);
dec(i); until i=0; i:=NextTimer;
repeat;
AnsiCharToUTF8(pointer(a255),length(a255),t,1251);
AnsiCharToUTF8(pointer(a255),length(a255),t,1251);
dec(i); until i=0; i:=NextTimer;
until TicksLast>9;
ShowResult('AnsiToUtf8');
end;
//--------------------------------------------------------------------------------------------------
procedure TForm1.bUTF8ToAnsiClick(Sender: TObject);
var
t: AnsiString;
i: integer;
begin;
i:=FirstTimer;
repeat;
repeat;
t:=ShaUTF8ToAnsi(t1251,pConvDefault);
t:=ShaUTF8ToAnsi(t1251,pConvDefault);
dec(i); until i=0; i:=NextTimer;
repeat;
t:=ShaUTF8ToAnsi(t1252,pConvDefault);
t:=ShaUTF8ToAnsi(t1252,pConvDefault);
dec(i); until i=0; i:=NextTimer;
repeat;
t:=ShaUTF8ToAnsi(t255,pConvDefault);
t:=ShaUTF8ToAnsi(t255,pConvDefault);
dec(i); until i=0; i:=NextTimer;
repeat;
dec(i); until i=0; i:=NextTimer;
repeat;
t:=Utf8ToWinAnsi(t1252);
t:=Utf8ToWinAnsi(t1252);
dec(i); until i=0; i:=NextTimer;
repeat;
dec(i); until i=0; i:=NextTimer;
repeat;
t:=Utf8ToString(t1251);
t:=Utf8ToString(t1251);
dec(i); until i=0; i:=NextTimer;
repeat;
t:=Utf8ToString(t1252);
t:=Utf8ToString(t1252);
dec(i); until i=0; i:=NextTimer;
repeat;
t:=Utf8ToString(t255);
t:=Utf8ToString(t255);
dec(i); until i=0; i:=NextTimer;
until TicksLast>9;
ShowResult('Utf8ToAnsi');
end;
Speed tests (2 times each test)
of new ShaUnicode functions (unbuffered version 2012-02-09)
and new framework functions at E6850
AnsiToUnicode time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 562 563 562
WinAnsi 735
Syn 734 735 734
Sha 562 563 547
WinAnsi 734
Syn 750 735 734
UnicodeToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 500 531 531
WinAnsi 813
Syn 1109 1109 1110
Sha 500 531 531
WinAnsi 813
Syn 1109 1109 1110
UnicodeToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 2266 359 485
Syn 2734 360 718
Sha 2266 359 469
Syn 2734 375 704
AnsiToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 2547 140 485
WinAnsi 218
Syn 3422 203 329
Sha 2546 141 484
WinAnsi 219
Syn 3422 203 328
Utf8ToUnicode time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 2109 453 406
Syn 2500 516 687
Sha 2110 453 406
Syn 2500 516 687
Utf8ToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 2297 141 609
WinAnsi 219
Syn 3047 218 329
Sha 2250 156 609
WinAnsi 203
Syn 3063 219 328
Code page 1252 is very fast, but 1251 is not.
Yes, I see :-)
But the question is
why sometimes the framework calculates data in the buffer and then copies data to result string,
and sometimes it calculates data in the result string and then reallocates it?
Yes, I see.
The question is why the framework uses SetString (buffered result) in one case and SetLength (unbuffered result) in another case?
Thanks!
I will test it tomorrow.
One question.
Why TSynAnsiConvert.AnsiBufferToRawUTF8 calls Getmem, but RawUnicodeToUtf8 doesn't ?
Thanks a lot for sharing your code!
Some remarks:
- ShaAnsiToUnicode() will be correct only for 7 bits ascii - so I guess this is about Ansi7ToString() which is not often called so I did not modified it;
- I've updated RawUnicodeToUtf8(), WinAnsiBufferToUtf8(), UTF8ToWideChar(), UTF8ToWinPChar() to handle any trailing 7 bit ASCII AnsiChars, by pairs - this is a very nice trick in practice;
- I did not introduce CP 1252 specific optimization yet, since I'd like to implement a clean class-driven approach here - I've added it to the mORMot roadmap.
- ShaAnsiToUnicode() uses pointer to the translation table as well as all Ansi-functions from ShaUnicode.pas. So it trasforms correctly any Ansi-characters. You just need to call the function with pConvLatin or pConvDefault as second parameter.
- I think it is simple to use new unicode functions from ShaUnicode.pas changing *all* calls in SynCommons.pas:
WinAnsiToRawUnicode/StringToRawUnicode/... --> ShaAnsiToUnicode
RawUnicodeToWinAnsi/RawUnicodeToString/... --> ShaUnicodeToAnsi
WinAnsiToUTF8/AnsiCharToUTF8/... --> ShaAnsiToUTF8
RawUnicodeToUTF8/... --> ShaUnicodeToUTF8
UTF8DecodeToRawUnicode/... --> ShaUTF8ToUnicode
UTF8ToString/... --> ShaUTF8ToAnsi
- Main optimization for win1252 and user default code page is done in ShaUnicode.pas. Of course, we always may do more.
Hi
New unicode functions are here
Some coments (russian) are here
Timings at E6850, table headers:
Charset 1251 - russian text
Charset 1252 - english (ASCII) text
Charset 1252* - english (ASCII) text, but first char is russian
Sha - functions from ShaUnicode.pas
Syn - functions from SynCommons.pas
WinAnsi - special functions from SynCommons.pas for WinAnsiString
AnsiToUnicode time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 563 562 547
WinAnsi 766
Syn 734 734 735
UnicodeToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 500 531 515
WinAnsi 938
Syn 828 1703 813
UnicodeToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 563 422 547
Syn 734 703 719
AnsiToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 594 219 594
WinAnsi 593
Syn 1469 1438 1453
Utf8ToUnicode time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 547 594 563
Syn 671 641 672
Utf8ToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 687 219 703
WinAnsi 594
Syn 1047 1015 1032
Results at i5-2300
AnsiToUnicode time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 468 452 468
WinAnsi 0 718 0
Syn 686 702 702
UnicodeToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 390 375 390
WinAnsi 0 936 0
Syn 982 1685 983
UnicodeToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 530 406 530
Syn 702 702 702
AnsiToUtf8 time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 562 171 562
WinAnsi 0 546 0
Syn 1404 1388 1404
Utf8ToUnicode time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 499 530 499
Syn 609 608 609
Utf8ToAnsi time, ms
============================
Charset
Functions 1251 1252 1252*
----------------------------
Sha 655 218 640
WinAnsi 0 562 0
Syn 1201 1185 1186
Some bugs fixed in ShaUnicode.pas 2012-02-05
I have tested some new functions. They are faster than framework's ones even on WinAnsiStrings.
I hope I will write full set of Ansi/Unicode/UTF8 conversions in a week.
Hi, Arnaud.
Here is my module for fast conversion
//Sha 2012
unit SynConversionTables;
interface
uses
Windows, // GetACP
SysUtils; // PWordArray
type
PConversionTable= ^TConversionTable;
TConversionTable= packed record
WideToAnsiW: packed array[0..127] of word;
WideToAnsiA: packed array[0..127] of byte;
WideToAnsiCount: integer;
CodePage: integer; //table code page
pDataExt: pWordArray;
DataCount: integer;
AnsiToWide: packed array[0..255] of word;
end;
var //READ ONLY VARS
pSynTableDefault: PConversionTable= nil; //table for default code page
pSynTable1252: PConversionTable= nil; //win1252 Latin table
pSynTable1251: PConversionTable= nil; //win1251 Cyrillic table
type
PtrInt= {$ifdef UNICODE} NativeInt {$else} integer {$endif};
//extended version of WinAnsiTableSortedFind
function FindAnsiChar(wc: cardinal; pTable: PConversionTable): PtrInt;
//get table by index
function GetSynTable(i: integer): PConversionTable;
//test of conversion tables
function TestSynTables: boolean;
implementation
const
Win1252Ext: packed array[0..31] of word = (
8364, 129, 8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249, 338, 141, 381, 143,
144, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 732, 8482, 353, 8250, 339, 157, 382, 376);
Win1251Ext: packed array[0..127] of word = (
1026, 1027, 8218, 1107, 8222, 8230, 8224, 8225, 8364, 8240, 1033, 8249, 1034, 1036, 1035, 1039,
1106, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 152, 8482, 1113, 8250, 1114, 1116, 1115, 1119,
160, 1038, 1118, 1032, 164, 1168, 166, 167, 1025, 169, 1028, 171, 172, 173, 174, 1031,
176, 177, 1030, 1110, 1169, 181, 182, 183, 1105, 8470, 1108, 187, 1112, 1029, 1109, 1111,
1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103);
var
SynDefaultExt: packed array[0..127] of word;
SynDefaultCP: integer= 0; //default code page
SynTables: array of array of word;
function FindAnsiChar(wc: cardinal; pTable: PConversionTable): PtrInt;
var
Cur, Left, Right: PtrInt;
begin;
//fast search of main 32 letters for win1251
if (pTable=pSynTable1251) and (cardinal(wc-1040)<32) then begin;
Result:=wc-848;
exit;
end;
Right:=pTable.WideToAnsiCount;
Left:=0;
while Left<Right do begin;
Cur:=(Left + Right) shr 1;
if wc>pTable.WideToAnsiW[Cur] then Right:=Cur else Left:=Cur + 1;
end;
dec(Right);
if (Right>=0) and (wc=pTable.WideToAnsiW[Right])
then Result:=pTable.WideToAnsiA[Right]
else Result:=-1;
end;
function GetSynTable(i: integer): PConversionTable;
begin;
if (i>=0) and (i<Length(SynTables))
then Result:=@SynTables[i,0]
else Result:=nil;
end;
function TestFindChars(pTable: PConversionTable): integer;
var
i: integer;
begin;
Result:=0;
for i:=0 to pTable.DataCount-1 do if pTable.pDataExt[i]>255 then begin;
dec(Result);
if FindAnsiChar(pTable.pDataExt[i],pTable)<>i+128 then exit;
end;
Result:=-Result;
end;
function TestCountChars(pTable: PConversionTable): integer;
var
i: integer;
begin;
Result:=0;
for i:=$100 to $FFFF do if FindAnsiChar(i,pTable)>=0 then inc(Result);
end;
function TestSynTables: boolean;
var
i, FoundAll, CountAll: integer;
pTable: PConversionTable;
begin;
Result:=true;
for i:=0 to Length(SynTables)-1 do begin;
pTable:=@SynTables[i,0];
FoundAll:=TestFindChars(pTable);
CountAll:=TestCountChars(pTable);
Result:=Result and (FoundAll>0) and (CountAll=FoundAll);
end;
end;
procedure AddConversionTable(CodePage: integer; pDataExt: PWordArray; DataCount: integer);
var
save: array[0..127] of cardinal;
tmp: cardinal;
i, len, max: integer;
pTable: PConversionTable;
begin;
if (CodePage<=0) or (DataCount<=0) or (DataCount>128) then exit;
len:=Length(SynTables);
for i:=0 to len-1 do begin;
pTable:=@SynTables[i,0];
if pTable.CodePage=CodePage then exit;
end;
SetLength(SynTables,len+1);
SetLength(SynTables[len], SizeOf(TConversionTable) div SizeOf(word));
pTable:=@SynTables[len,0];
pTable.CodePage:=CodePage;
pTable.pDataExt:=pDataExt;
pTable.DataCount:=DataCount;
for i:=0 to 255 do pTable.AnsiToWide[i]:=i;
len:=0;
for i:=0 to DataCount-1 do if pDataExt[i]>255 then inc(len);
pTable.WideToAnsiCount:=len;
len:=0;
max:=0;
for i:=DataCount-1 downto 0 do begin;
pTable.AnsiToWide[i+128]:=pDataExt[i];
if pDataExt[i]>255 then begin;
save[len]:=integer(pDataExt[i]) shl 8 or (i+128);
if save[max]<save[len] then max:=len;
inc(len);
end;
end;
dec(len); // last index
// insertion sort of save[0..len]
tmp:=save[0]; save[0]:=save[max]; save[max]:=tmp;
i:=1;
while i<len do begin;
inc(i);
tmp:=save[i];
if tmp>save[i-1] then begin;
max:=i;
repeat;
save[max]:=save[max-1];
dec(max);
until tmp<=save[max-1];
save[max]:=tmp;
end;
end;
for i:=0 to len do begin;
pTable.WideToAnsiW[i]:=save[i] shr 8;
pTable.WideToAnsiA[i]:=byte(save[i]);
end;
if CodePage=1252 then pSynTable1252:=pTable;
if CodePage=1251 then pSynTable1251:=pTable;
if CodePage=SynDefaultCP then pSynTableDefault:=pTable;
end;
function InitConversionTables: boolean;
var
c: array[0..127] of byte;
i: integer;
begin;
SynDefaultCP:=GetACP;
AddConversionTable(1252, @Win1252Ext[0], Length(Win1252Ext)); //Latin
AddConversionTable(1251, @Win1251Ext[0], Length(Win1251Ext)); //Cyrillic
if pSynTableDefault=nil then begin;
for i:=0 to 127 do c[i]:=i+128;
MultiByteToWideChar(SynDefaultCP,0,@c[0],128,@SynDefaultExt[0],128);
AddConversionTable(SynDefaultCP, @SynDefaultExt[0], 128);
end;
Result:=(pSynTableDefault<>nil) and TestSynTables;
end;
procedure FinalConversionTables;
begin;
SynDefaultCP:=0;
pSynTableDefault:=nil;
pSynTable1252:=nil;
pSynTable1251:=nil;
SynTables:=nil;
end;
initialization
InitConversionTables;
finalization
FinalConversionTables;
end.
I suggest to replace WinAnsiTableSortedFind(wc) with FindAnsiChar(wc,pSynTable1252),
for example:
//Sha: new version
function WideCharToWinAnsiChar(wc: cardinal): AnsiChar;
begin // code generated for this function is very fast
if wc<256 then
//if WinAnsiTable[wc]<256 then begin
if pSynTable1252.AnsiToWide[wc]<256 then begin //Sha: use new table
result := AnsiChar(wc);
exit;
end else begin
result := ' '; // invalid ansi char for this code page (e.g. #128)
exit;
end else begin // wc>255:
//wc := WinAnsiTableSortedFind(wc);
wc:=FindAnsiChar(wc, pSynTable1252); //Sha: use new version of search
if integer(wc)>=0 then
result := AnsiChar(byte(wc)) else
result := ' '; // space for invalid wide char
exit;
end;
end;
//Sha: new version
function WideCharToWinAnsi(wc: cardinal): integer;
begin
if wc<256 then
//if WinAnsiTable[wc]<256 then
if pSynTable1252.AnsiToWide[wc]<256 then //Sha: use new table
result := wc else
result := -1 else // invalid ansi char for this code page (e.g. #128)
//result := WinAnsiTableSortedFind(wc);
result := FindAnsiChar(wc, pSynTable1252); //Sha: use new version of search
end;
//Sha: new version
function IsWinAnsiU(UTF8Text: PUTF8Char): boolean;
var c: Cardinal;
begin
result := false;
if UTF8Text<>nil then
repeat
c := byte(UTF8Text^); inc(UTF8Text);
if c=0 then break else
if c and $80=0 then
continue else begin
if UTF8Text^=#0 then break;
if c and $20=0 then begin
c := c shl 6+byte(UTF8Text^)-$00003080; inc(UTF8Text);
end else begin
c := c shl 6+byte(UTF8Text^); inc(UTF8Text);
if UTF8Text^=#0 then break;
c := c shl 6+byte(UTF8Text^)-$000E2080; inc(UTF8Text);
end;
if c>255 then begin
//if WinAnsiTableSortedFind(c)<0 then
if FindAnsiChar(c, pSynTable1252)<0 then //Sha: use new version of search
exit; // invalid char in the WinAnsi code page
end else
//if WinAnsiTable[c]>255 then
if pSynTable1252.AnsiToWide[c]>255 then //Sha: use new table
exit; // invalid char in the WinAnsi code page
end;
until false;
result := true;
end;
//Sha: new version
function UTF8ToWinPChar(dest: PAnsiChar; source: PUTF8Char; count: integer): integer;
var c: cardinal;
begd: PAnsiChar;
endSource: PUTF8Char;
begin
result := 0;
if source=nil then exit;
begd := dest;
endSource := source+count;
repeat
c := byte(source^); inc(source);
if byte(c) and $80=0 then begin
dest^ := AnsiChar(byte(c)); inc(dest);
if source<endsource then continue else break;
end else begin
if source>=endsource then break;
if c and $20=0 then begin
c := c shl 6+byte(source^)-$00003080; inc(source);
if c and $ffffff00=0 then begin
//if WinAnsiTable[c]>255 then
if pSynTable1252.AnsiToWide[c]>255 then //Sha: use new table
dest^ := ' ' else // invalid char in the WinAnsi code page
dest^ := AnsiChar(c);
inc(dest); // #128..#255 -> direct copy
if source<endsource then continue else break;
end;
end else begin
c := c shl 6+byte(source^); inc(source);
if source>=endsource then break;
c := c shl 6+byte(source^)-$000E2080; inc(source);
end;
// #256.. -> slower but accurate conversion
//c := WinAnsiTableSortedFind(c);
c := FindAnsiChar(c, pSynTable1252); //Sha: use new version of search
if integer(c)>=0 then begin
dest^ := AnsiChar(Byte(c)); // don't add invalid wide char
inc(dest);
end;
if source>=endsource then break;
end;
until false;
result := dest-begd;
end;
//Sha: new version
procedure UTF8ToShortString(var dest: shortstring; source: PUTF8Char);
var c: cardinal;
len: integer;
begin
len := 0;
if source<>nil then
repeat
c := byte(source^); inc(source);
if c=0 then break else
if c and $80=0 then begin
inc(len); dest[len] := AnsiChar(c);
if len<255 then continue else break;
end else begin
if source^=#0 then break;
if c and $20=0 then begin
c := c shl 6+byte(source^)-$00003080; inc(source);
end else begin
c := c shl 6+byte(source^); inc(source);
if source^=#0 then break;
c := c shl 6+byte(source^)-$000E2080; inc(source);
end;
// #256.. -> slower but accurate conversion
inc(len);
//c := WinAnsiTableSortedFind(c);
c := FindAnsiChar(c, pSynTable1252); //Sha: use new version of search
if integer(c)<0 then
c := ord('?');
dest[len] := AnsiChar(byte(c)); // #128..#255 -> direct copy
if len<255 then continue else break;
end;
until false;
dest[0] := AnsiChar(len);
end;
//Sha: new version
procedure RawUnicodeToWinPChar(dest: PAnsiChar; source: PWideChar; WideCharCount: Integer);
var i: integer;
wc: integer;
begin
for i := 0 to WideCharCount-1 do begin
wc := integer(source[i]);
if wc<256 then
//if WinAnsiTable[wc]<256 then
if pSynTable1252.AnsiToWide[wc]<256 then //Sha: use new table
dest[i] := AnsiChar(wc) else
dest[i] := ' ' else begin
//wc := WinAnsiTableSortedFind(wc);
wc := FindAnsiChar(wc, pSynTable1252); //Sha: use new version of search
if integer(wc)>=0 then
dest[i] := AnsiChar(byte(wc)) else
dest[i] := ' '; // space for invalid wide char
end;
end;
end;
It is easy to create new fast common functions w/o Windows API
by adding parameter, for example
//Sha: new function
function WideCharToSynAnsiChar(wc: cardinal; pSynTable: PConversionTable): AnsiChar;
begin // code generated for this function is very fast
if wc<256 then
if pSynTable.AnsiToWide[wc]<256 then begin
result := AnsiChar(wc);
exit;
end else begin
result := ' '; // invalid ansi char for this code page (e.g. #128)
exit;
end else begin // wc>255:
wc:=FindAnsiChar(wc, pSynTable);
if integer(wc)>=0 then
result := AnsiChar(byte(wc)) else
result := ' '; // space for invalid wide char
exit;
end;
end;
Unit has internal full self-test:
//how to validate all tables in use
procedure TForm1.bValidateClick(Sender: TObject);
const
msg: array[boolean] of string= ('failed', 'passed');
begin;
Memo1.Lines.Add('Test of conversion tables ' + msg[TestSynTables]);
end;
It is easy to add support for other code pages.
Just copy/paste data from TMemo to the unit.
//how to fill your default table
procedure TForm1.bShowWideClick(Sender: TObject);
var
c: array[0..127] of byte;
w: array[0..127] of word;
i: integer;
begin;
for i:=0 to 127 do c[i]:=i+128;
MultiByteToWideChar(GetACP,0,@c[0],128,@w[0],128);
i:=0;
while i<=128-16 do begin;
Memo1.Lines.Add(Format('{%d:} %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, ',
[i+128, w[i+0],w[i+1],w[i+2], w[i+3], w[i+4], w[i+5], w[i+6], w[i+7],
w[i+8],w[i+9],w[i+10],w[i+11],w[i+12],w[i+13],w[i+14],w[i+15]]));
i:=i+16;
end;
end;
Yes.
Here is code for demonstration the idea.
const
Ansi1252: packed array[128..159] of word = (
8364, 129, 8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249, 338, 141, 381, 143,
144, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 732, 8482, 353, 8250, 339, 157, 382, 376);
Ansi1251: packed array[128..255] of word = (
1026, 1027, 8218, 1107, 8222, 8230, 8224, 8225, 8364, 8240, 1033, 8249, 1034, 1036, 1035, 1039,
1106, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 152, 8482, 1113, 8250, 1114, 1116, 1115, 1119,
160, 1038, 1118, 1032, 164, 1168, 166, 167, 1025, 169, 1028, 171, 172, 173, 174, 1031,
176, 177, 1030, 1110, 1169, 181, 182, 183, 1105, 8470, 1108, 187, 1112, 1029, 1109, 1111,
1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055,
1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071,
1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081, 1082, 1083, 1084, 1085, 1086, 1087,
1088, 1089, 1090, 1091, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103);
type
TConversionTable= record
CodePage: integer;
WideToAnsiLast: integer;
WideToAnsi: array of cardinal; //0..WideToAnsiLast
AnsiToWide: array of word; //0..255
end;
PConversionTable= ^TConversionTable;
var
ConversionTable: array of TConversionTable;
ConversionTableLast: integer= -1;
procedure AddConversionTable(CodePage: integer; PWA: PWordArray; Count: integer);
var
i, len, min: integer;
tmp: cardinal;
pct: PConversionTable;
begin;
if CodePage=0 then CodePage:=GetACP;
for i:=0 to ConversionTableLast do if ConversionTable[i].CodePage=CodePage then exit;
inc(ConversionTableLast);
SetLength(ConversionTable, ConversionTableLast+1);
pct:=@ConversionTable[ConversionTableLast];
pct.CodePage:=CodePage;
SetLength(pct.AnsiToWide,256);
for i:=0 to 255 do pct.AnsiToWide[i]:=i;
len:=0;
for i:=0 to Count-1 do if PWA[i]>255 then inc(len);
SetLength(pct.WideToAnsi,len);
pct.WideToAnsiLast:=len-1;
len:=0;
min:=0;
for i:=0 to Count-1 do begin;
pct.AnsiToWide[i+128]:=PWA[i];
if PWA[i]>255 then begin;
pct.WideToAnsi[len]:=integer(PWA[i]) shl 8 or (i+128);
if pct.WideToAnsi[min]>pct.WideToAnsi[len] then min:=len;
inc(len);
end;
end;
// insertion sort of pct.WideToAnsi
if min>0 then begin;
tmp:=pct.WideToAnsi[0];
pct.WideToAnsi[0]:=pct.WideToAnsi[min];
pct.WideToAnsi[min]:=tmp;
end;
dec(len); // last index
i:=1;
while i<len do begin;
inc(i);
if pct.WideToAnsi[i]<pct.WideToAnsi[i-1] then begin;
tmp:=pct.WideToAnsi[i];
min:=i;
repeat;
pct.WideToAnsi[min]:=pct.WideToAnsi[min-1];
dec(min);
until tmp>=pct.WideToAnsi[min-1];
pct.WideToAnsi[min]:=tmp;
end;
end;
end;
procedure InitConversionTables;
begin;
AddConversionTable(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
AddConversionTable(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
end;
function FindAnsiChar(wc: cardinal; CP: integer= 1252): integer;
var
i, left, right: PtrInt;
pct: PConversionTable;
begin;
i:=ConversionTableLast;
while (i>=0) and (ConversionTable[i].CodePage<>CP) do dec(i);
if i>=0 then begin;
pct:=@ConversionTable[i];
right:=pct.WideToAnsiLast;
left:=-1;
wc:=wc shl 8;
while left<right do begin;
i:=(left + right + 1) shr 1;
if pct.WideToAnsi[i]<wc then left:=i else right:=i - 1;
end;
inc(left);
if left<=pct.WideToAnsiLast then begin;
wc:=wc xor pct.WideToAnsi[left];
if wc<256 then begin;
Result:=wc;
exit;
end;
end;
end;
Result:=-1; // invalid wide char or CP not found
end;
function TestFindChars(CodePage: integer; PWA: PWordArray; Count: integer): boolean;
var
i: integer;
begin;
Result:=false;
for i:=0 to Count-1 do if (PWA[i]>255) and (FindAnsiChar(PWA[i],CodePage)<>i+128) then exit;
Result:=true;
end;
function TestCountChars(CodePage: integer): integer;
var
i: integer;
begin;
Result:=0;
for i:=$100 to $FFFF do if FindAnsiChar(i,CodePage)>=0 then inc(Result);
end;
procedure TForm1.Button4Click(Sender: TObject);
const
msg: array[boolean] of string= ('failed', 'OK');
var
FoundAll: boolean;
CountAll: integer;
begin;
InitConversionTables;
FoundAll:=TestFindChars(1252, @Ansi1252[Low(Ansi1252)], High(Ansi1252)-Low(Ansi1252)+1);
CountAll:=TestCountChars(1252);
Memo1.Lines.Add(Format('CP1252: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=27]]));
FoundAll:=TestFindChars(1251, @Ansi1251[Low(Ansi1251)], High(Ansi1251)-Low(Ansi1251)+1);
CountAll:=TestCountChars(1251);
Memo1.Lines.Add(Format('CP1251: test1 %s, test2 %s',[msg[FoundAll], msg[CountAll=112]]));
end;
procedure TForm1.Button2Click(Sender: TObject);
var
c: array[0..255] of byte;
w: array[0..255] of word;
i: integer;
begin;
for i:=0 to 255 do c[i]:=i;
MultiByteToWideChar(1251,0,@c[0],256,@w[0],256);
i:=128;
while i<=256-16 do begin;
Memo1.Lines.Add(Format('%d: %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d, ',
[i, w[i+0],w[i+1],w[i+2], w[i+3], w[i+4], w[i+5], w[i+6], w[i+7],
w[i+8],w[i+9],w[i+10],w[i+11],w[i+12],w[i+13],w[i+14],w[i+15]]));
i:=i+16;
end;
end;
OK.
1 using found in SynPdf. (Note: pdf test also failed on CP 1251).
function TPdfWrite.ToWideChar(const Ansi: PDFString; out DLen: Integer): PWideChar;
var L: integer;
begin
L := Length(Ansi)*2+2; // maximum possible length
getmem(result,L);
if FCodePage=CODEPAGE_US then begin // use our internal fast conversion
DLen := Length(Ansi);
WinAnsiToUnicodeBuffer(WinAnsiString(Ansi), pointer(result), DLen+1);
end else begin
{$IFDEF MSWINDOWS}
DLen := MultiByteToWideChar(FCodePage, 0, Pointer(Ansi), length(Ansi), result, L);
result[DLen] := #0;
{$ENDIF}
{$IFDEF LINUX}
StringToWideChar(Ansi, result, L); // only work with current system CharSet
DLen := 0; while result[DLen]<>#0 do inc(DLen);
{$ENDIF}
end;
end;
WinAnsiString is used here for acceleration of work with CP 1252.
No specificity. So WinAnsiString = Win1252String.
I suggest acceleration for all Ansi-strings. Why not?
Certainly the code guarantees correct transformation for CP 1252.
It is good to do the same for CP 1251 and for others.
WinAnsiString are CP 1252 exactly, by definition - and some units in the framework expect this behavior.
I do not want to make it Code Page independent.
Сan I learn, what for it is necessary?
Details are here.
Procedure TTestLowLevelCommon._UTF8 in statement U := WinAnsiToUtf8(W);
calls WinAnsiToUtf8 -> WinAnsiBufferToUtf8 -> c := WinAnsiTable[c];
and assertion failed on Check(StringToUTF8(UTF8ToString(U))=U);
I am thinking about version for user defined CP (1251/1252 or other supported page)
to work with any Ansi-strings including Cyrillic strings.
I am going to move all unicode support to different unit for easy replacement with framework unit.
Is it possible?
The idea is filling or switching tables of fast transformations (for example, WinAnsiTable) dynamically.
Framework fills tables when the user globally set value of his CP (1251/1252 or other).
Calls of GetACP become unnecessary.
I found that function WinAnsiBufferToUtf8 uses WinAnsiTable (for code page 1252) only.
Second question. How to speed up working with my default code page (1251)?
Pages: 1