CopyRecord faster proposal

ab · 2010-06-21 16:47:30

After some speed debates in the Delphi community, I've rewritten the _CopyRecord function of the system.pas unit, with speed in mind.

Here is the resulting code, which should work from Delphi 7 up to 2010.

procedure _CopyRecord{ dest, source, typeInfo: Pointer };
asm  // faster version by AB
        { ->    EAX pointer to dest             }
        {       EDX pointer to source           }
        {       ECX pointer to typeInfo         }
        push ebp
        push ebx
        push esi
        push edi
        movzx ebx,byte ptr [ecx].TTypeInfo.Name[0]
        mov esi,edx                     // esi = source
        mov edi,eax                     // edi = dest
        add ebx,ecx                     // ebx = TFieldTable
        xor eax,eax                     // eax = current offset
        mov ebp,[ebx].TFieldTable.Count // ebp = TFieldInfo count
        mov ecx,[ebx].TFieldTable.Size
        test ebp,ebp
        jz @fullcopy
        push ecx                        // sizeof(record) on stack
        add ebx,offset TFieldTable.Fields[0]   // ebx = first TFieldInfo
@next:  mov ecx,[ebx].TFieldInfo.&Offset    
        mov edx,[ebx].TFieldInfo.TypeInfo
        sub ecx,eax
        mov edx,[edx]
        jle @nomov
        lea esi,esi+ecx
        lea edi,edi+ecx
        neg ecx
@mov1:  mov al,[esi+ecx] // fast copy not destructable data
        mov [edi+ecx],al
        inc ecx
        jnz @mov1
@nomov: mov eax,edi
        movzx ecx,[edx]    // data type
        cmp ecx,tkLString
        je @@LString
        jb @@err
{$ifdef UNICODE}
        cmp ecx,tkUString
        je @@UString
{$else} cmp ecx,tkDynArray
        je @@DynArray
{$endif}ja @@err
        jmp dword ptr [ecx*4+@@tab-tkWString*4]
@@Tab:  dd @@WString,@@Variant,@@Array,@@Record,@@Interface,@@err
        {$ifdef UNICODE}dd @@DynArray{$endif}
@@errv: mov al,reVarInvalidOp
        jmp @@err2
@@err:  mov al,reInvalidPtr
@@err2: pop edi
        pop esi
        pop ebx
        pop ebp
        jmp Error
        nop // all functions below have esi=source edi=dest
@@Array:
        movzx ecx,byte ptr [edx].TTypeInfo.Name[0]
        push dword ptr [edx+ecx].TFieldTable.Size
        push dword ptr [edx+ecx].TFieldTable.Count
        mov ecx,dword ptr [edx+ecx].TFieldTable.Fields[0]
        mov ecx,[ecx]
        mov edx,esi
        call _CopyArray
        pop eax // restore sizeof(Array)
        jmp @@finish
@@Record:
        movzx ecx,byte ptr [edx].TTypeInfo.Name[0]
        mov ecx,[edx+ecx].TFieldTable.Size
        push ecx
        mov ecx,edx
        mov edx,esi
        call _CopyRecord
        pop eax // restore sizeof(Record)
        jmp @@finish
        nop;nop;nop
@@Variant:
        cmp dword ptr [VarCopyProc],0
        mov edx,esi
        jz @@errv
        call [VarCopyProc]
        mov eax,16
        jmp @@finish
@@Interface:
        mov edx,[esi]
        call _IntfCopy
        jmp @@fin4
        nop
@@DynArray:
        mov ecx,edx // ecx=TypeInfo
        mov edx,[esi]
        call _DynArrayAsg
        jmp @@fin4
@@WString:
{$ifndef LINUX}
        mov edx,[esi]
        call _WStrAsg
        jmp @@fin4
        nop;nop
{$endif}
@@LString:
        mov edx,[esi]
        call _LStrAsg
{$ifdef UNICODE}
        jmp @@fin4
        nop; nop
@@UString:
        mov edx,[esi]
        call _UStrAsg
{$endif}
@@fin4: mov eax,4
@@finish:
        add esi,eax
        add edi,eax
        add eax,[ebx].TFieldInfo.&Offset
        dec ebp    // any other TFieldInfo?
        lea ebx,ebx+8 // next TFieldInfo
        jnz @next
        pop ecx // ecx= sizeof(record)
@fullcopy:
        mov edx,edi
        sub ecx,eax
        mov eax,esi
        jle @nomov2
        call move
@nomov2:pop edi
        pop esi
        pop ebx
        pop ebp
end;

I've tested this source code with some unit testing, and IMHO it works fine. Speed increase is noticeable. At least my code is much more readable than the original from Borland/Embarcadero, since I detailed the field names (TFieldInfo/TFieldData), and commented the source.

If you can guess if my inlined code in @mov1 is faster than a "call move", please tell me! It seems that move is most of the time faster than my inlined version.

The code and test function can be downloaded from http://synopse.info/files/CopyRecord.pas

ab · 2010-06-21 16:48:43

Adam Wu wrote:

Arnaud and I was discussing the speed of memory copying using various instructions.
I was quite interested to see how exactly they perform on actual machines, and wrote a little test program.
I ran it on several machines with different generations of processor, and got some really interesting results.
The y-axis shows the high-performance counter ticks;
The x-axis shows the number of bytes to copy;
Each data point is the time used to perform $1000 copies of the corresponding number of bytes;
The "stub" shows the overhead time, i.e. time for looping $1000 times and function calls.
Pentium M 2.0G
http://i42.tinypic.com/v2uu5u.png
Core2Duo 2.4G
http://i39.tinypic.com/6dzgie.png
Atom 1.6G
http://i44.tinypic.com/2w2lfyt.png
Xeon (Core i) 2.4G
http://i42.tinypic.com/4scysm.png
Overall, I think the best options is to call the move function.
On PentiumM and Core2, rep movsb is always slower than loop byte copying;
rep movsd has a high start up cost, but tends to perform on par with move function when size get large;
On Atom, rep movsb seems to beat loop byte copying from size 10 up;
rep movsd again has a high start up cost, but both rep movs tends to perform on par with move function when size get large;
On Corei, rep movsb has the highest startup cost when size > 3, but it becomes faster than loop byte copying from size 24~30;
The startup cost of rep movsd is still a bit high, but it gains speed quickly, and even out performs move function from size 38~44;
Other interesting thing to note: some processor runs faster when counting up to zero, some is better at counting down...

Extracted from https://forums.embarcadero.com/thread.j … eID=231468

ab · 2010-06-21 17:27:42

Mason Wheeler wrote:

In your blog post dated March 23, 2010, you mentioned a faster CopyRecord implementation inspired by the discussion that came about when I posted about TValue's speed issues. You mentioned that you could produce a FastCode style patch unit for your CopyRecord improvement, but it doesn't look like you ever followed through on that.
Have you set one up and I just can't find the link? If not, would you mind to prepare one? I'm working on some RTTI-heavy code, and any speed improvements would be quite welcome.
Thank you

OK, I've done my homework... and the CopyRecord.pas file is now an unit with FastCode style patch.

mORMot Open Source

#1 2010-06-21 16:47:30

CopyRecord faster proposal

#2 2010-06-21 16:48:43

Re: CopyRecord faster proposal

#3 2010-06-21 17:27:42

Re: CopyRecord faster proposal

Board footer