You are not logged in.
Pages: 1
After some speed debates in the Delphi community, I've rewritten the _CopyRecord function of the system.pas unit, with speed in mind.
Here is the resulting code, which should work from Delphi 7 up to 2010.
procedure _CopyRecord{ dest, source, typeInfo: Pointer };
asm // faster version by AB
{ -> EAX pointer to dest }
{ EDX pointer to source }
{ ECX pointer to typeInfo }
push ebp
push ebx
push esi
push edi
movzx ebx,byte ptr [ecx].TTypeInfo.Name[0]
mov esi,edx // esi = source
mov edi,eax // edi = dest
add ebx,ecx // ebx = TFieldTable
xor eax,eax // eax = current offset
mov ebp,[ebx].TFieldTable.Count // ebp = TFieldInfo count
mov ecx,[ebx].TFieldTable.Size
test ebp,ebp
jz @fullcopy
push ecx // sizeof(record) on stack
add ebx,offset TFieldTable.Fields[0] // ebx = first TFieldInfo
@next: mov ecx,[ebx].TFieldInfo.&Offset
mov edx,[ebx].TFieldInfo.TypeInfo
sub ecx,eax
mov edx,[edx]
jle @nomov
lea esi,esi+ecx
lea edi,edi+ecx
neg ecx
@mov1: mov al,[esi+ecx] // fast copy not destructable data
mov [edi+ecx],al
inc ecx
jnz @mov1
@nomov: mov eax,edi
movzx ecx,[edx] // data type
cmp ecx,tkLString
je @@LString
jb @@err
{$ifdef UNICODE}
cmp ecx,tkUString
je @@UString
{$else} cmp ecx,tkDynArray
je @@DynArray
{$endif}ja @@err
jmp dword ptr [ecx*4+@@tab-tkWString*4]
@@Tab: dd @@WString,@@Variant,@@Array,@@Record,@@Interface,@@err
{$ifdef UNICODE}dd @@DynArray{$endif}
@@errv: mov al,reVarInvalidOp
jmp @@err2
@@err: mov al,reInvalidPtr
@@err2: pop edi
pop esi
pop ebx
pop ebp
jmp Error
nop // all functions below have esi=source edi=dest
@@Array:
movzx ecx,byte ptr [edx].TTypeInfo.Name[0]
push dword ptr [edx+ecx].TFieldTable.Size
push dword ptr [edx+ecx].TFieldTable.Count
mov ecx,dword ptr [edx+ecx].TFieldTable.Fields[0]
mov ecx,[ecx]
mov edx,esi
call _CopyArray
pop eax // restore sizeof(Array)
jmp @@finish
@@Record:
movzx ecx,byte ptr [edx].TTypeInfo.Name[0]
mov ecx,[edx+ecx].TFieldTable.Size
push ecx
mov ecx,edx
mov edx,esi
call _CopyRecord
pop eax // restore sizeof(Record)
jmp @@finish
nop;nop;nop
@@Variant:
cmp dword ptr [VarCopyProc],0
mov edx,esi
jz @@errv
call [VarCopyProc]
mov eax,16
jmp @@finish
@@Interface:
mov edx,[esi]
call _IntfCopy
jmp @@fin4
nop
@@DynArray:
mov ecx,edx // ecx=TypeInfo
mov edx,[esi]
call _DynArrayAsg
jmp @@fin4
@@WString:
{$ifndef LINUX}
mov edx,[esi]
call _WStrAsg
jmp @@fin4
nop;nop
{$endif}
@@LString:
mov edx,[esi]
call _LStrAsg
{$ifdef UNICODE}
jmp @@fin4
nop; nop
@@UString:
mov edx,[esi]
call _UStrAsg
{$endif}
@@fin4: mov eax,4
@@finish:
add esi,eax
add edi,eax
add eax,[ebx].TFieldInfo.&Offset
dec ebp // any other TFieldInfo?
lea ebx,ebx+8 // next TFieldInfo
jnz @next
pop ecx // ecx= sizeof(record)
@fullcopy:
mov edx,edi
sub ecx,eax
mov eax,esi
jle @nomov2
call move
@nomov2:pop edi
pop esi
pop ebx
pop ebp
end;
I've tested this source code with some unit testing, and IMHO it works fine. Speed increase is noticeable. At least my code is much more readable than the original from Borland/Embarcadero, since I detailed the field names (TFieldInfo/TFieldData), and commented the source.
If you can guess if my inlined code in @mov1 is faster than a "call move", please tell me! It seems that move is most of the time faster than my inlined version.
The code and test function can be downloaded from http://synopse.info/files/CopyRecord.pas
Offline
Arnaud and I was discussing the speed of memory copying using various instructions.
I was quite interested to see how exactly they perform on actual machines, and wrote a little test program.
I ran it on several machines with different generations of processor, and got some really interesting results.The y-axis shows the high-performance counter ticks;
The x-axis shows the number of bytes to copy;
Each data point is the time used to perform $1000 copies of the corresponding number of bytes;
The "stub" shows the overhead time, i.e. time for looping $1000 times and function calls.Pentium M 2.0G
http://i42.tinypic.com/v2uu5u.pngCore2Duo 2.4G
http://i39.tinypic.com/6dzgie.pngAtom 1.6G
http://i44.tinypic.com/2w2lfyt.pngXeon (Core i) 2.4G
http://i42.tinypic.com/4scysm.pngOverall, I think the best options is to call the move function.
On PentiumM and Core2, rep movsb is always slower than loop byte copying;
rep movsd has a high start up cost, but tends to perform on par with move function when size get large;On Atom, rep movsb seems to beat loop byte copying from size 10 up;
rep movsd again has a high start up cost, but both rep movs tends to perform on par with move function when size get large;On Corei, rep movsb has the highest startup cost when size > 3, but it becomes faster than loop byte copying from size 24~30;
The startup cost of rep movsd is still a bit high, but it gains speed quickly, and even out performs move function from size 38~44;Other interesting thing to note: some processor runs faster when counting up to zero, some is better at counting down...
Extracted from https://forums.embarcadero.com/thread.j … eID=231468
Offline
In your blog post dated March 23, 2010, you mentioned a faster CopyRecord implementation inspired by the discussion that came about when I posted about TValue's speed issues. You mentioned that you could produce a FastCode style patch unit for your CopyRecord improvement, but it doesn't look like you ever followed through on that.
Have you set one up and I just can't find the link? If not, would you mind to prepare one? I'm working on some RTTI-heavy code, and any speed improvements would be quite welcome.
Thank you
OK, I've done my homework... and the CopyRecord.pas file is now an unit with FastCode style patch.
Offline
Pages: 1