#1 2010-07-15 16:08:06

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Delphi doesn't like multi-core CPUs (or the contrary)

If you're like me, you are proud of the new CPU your computer runs on - in my case a i7-720Q with 8 embedded cores...

But Delphi is not very multi-thread or multi-core friendly... guess why....

The root cause the is the LOCK prefix

All X86 CPUs are equipped with the ability to lock a specific memory address, preventing other system buses to read or modify it while the following instruction runs.

The LOCK prefix to an assembly instruction causes the CPUs to assert the LOCK# signal, and practically ensures exclusive use of the memory address in multiprocessors / multi-thread environments.

The LOCK prefix works only with the following instructions:

BT, BTS, BTR, BTC   (mem, reg/imm)
XCHG, XADD  (reg, mem / mem, reg)
ADD, OR, ADC, SBB   (mem, reg/imm)
AND, SUB, XOR   (mem, reg/imm)
NOT, NEG, INC, DEC  (mem)

Note: XCHG and XADD (and all the 'X' family of instructions) are planned to be thread-safe, and always asserts LOCK# regardless of the presence of the LOCK prefix.

These low-level LOCK mechanisms ensure that some memory is modified by only one thread at a time.

So what is wrong with these LOCKed instructions?

On a multi-core CPU, all cores just freeze in order to make this LOCKed asm function threadsafe. If you have a lot of threads with more than one CPU, the context of every CPU core has to be frozen, cleared, all cores wait for the LOCKed asm instruction to complete, then the context is to be retrieved, and execution continue.

So guess what... when the CPU has to execute such instructions, all cores just freeze and your brand new 8 cores CPU just run as a 1 core CPU...

This is the same LOCKed asm function which is used internally by Windows with its <i>Critical Sections</i>. That's why Windows itself is told not to be very multi-core friendly, because it does use a lot of critical sections in its internal... Linux is much more advanced, and scales pretty well on massive multi-core architectures.

What about Delphi?

In Delphi, I discovered at least two performance problem in its RTL:
1. Default memory manager, i.e. FastMM4, uses a LOCKed asm instruction for every memory allocation or dis-allocation.
2. string types and dynamic arrays just use the same LOCKed asm instruction everywhere, i.e. for every access which may lead into a write to the string.

See what I wrote in the Embarcadero forum... this post was not very popular, but indeed I think I've raised a big issue on the Delphi compiler internals and performance here - and I don't think Embarcadero has plans to resolve this...

IMHO if you use strings in your application and need speed, using another memory manager than FastMM4 is not enough. You'll have to avoid most string use, and implement a safe TStringBuilder-like class.

ShortStrings could be handy here, even if they are limited to 255 character long.

Using regular PAnsiChar, and fixed buffers in the stack is also a solution, but it must be safe...

Our enhanced RTL

In our enhanced RTL for Delphi 7, we avoid use of this LOCKed asm instruction if your application has only one thread: so if you use our enhanced RTL, and make thread by yourself (not using the TThread object), you'll have the best multi-thread performance possible.
For example, here is how we coded the clearing of a string:

procedure _LStrClr(var S);
asm     { ->    EAX pointer to str      }
        MOV     EDX,[EAX]                       { fetch str                     }
        TEST    EDX,EDX                         { if nil, nothing to do         }
        JE      @@done 
        lea ecx,[edx-skew].StrRec.refCnt
        mov dword ptr [EAX],0  // clear str
        cmp [ecx],0
        jl @@done     // refCnt=-1: literal str
{$ifdef AVOIDLOCK}
        cmp IsMultiThread,false
        jnz @@lock
        dec [ecx]     // not threadsafe dec refCount, but faster
{$else} lock  dec [ecx]     // threadsafe dec refCount
{$endif}jne @@done
@@free: push eax
        mov eax,ecx
        call MemoryManager.FreeMem // inlined _FreeMem() code
        or eax,eax
        jnz _JmpInvalidPtr
        pop eax
{$ifdef AVOIDLOCK}
        ret
@@lock: lock  dec [ecx]     // threadsafe dec refCount
        je @@free
{$else}{$ifndef NOAMD}ret{$endif} // AMD trick: avoid branch misprediction
{$endif}
@@done:{$ifndef NOAMD}db $F3{$endif} // rep ret AMD trick: avoid branch misprediction
end;

So if the AVOIDLOCK conditional is defined, and there is only one thread in your application (the IsMultiThread is false), the lock dec [ecx] instruction won't be called, but a much faster (and core-friendly) dec [ecx] instruction is used.

Note that there is a similar check already in FastMM4: if IsMultiThread is false, no LOCKed instruction will be used.

The only drawback is that if you want to use threads in your application, you'll have to ensure that:
1. TThread is not to be used: the creation of one TThread just set IsMultiThread to true, so enable LOCKed instructions;
2. BeginThread() function must be avoided also (it set also the flag);
3. So you'll have to call directly CreateThread() Win32 APIs for your threads;
4. And none of your units should use either TThread either BeginThread!

That's why it could be useful that Embarcadero take this problem in account, and try to resolve it at the compiler level....

Offline

#2 2010-07-22 17:26:05

AdamWu
Member
Registered: 2010-07-22
Posts: 20

Re: Delphi doesn't like multi-core CPUs (or the contrary)

Well, to achieve thread synchronization and avoid race condition, some type of atomic operation is needed. And I think the LOCKed instruction is the finest atomic operation you will get. So I don't think it can be avoided -- may be its usage can be reduced, but eventually you will have to use one.

Eliminating LOCKed instruction is not what a compiler can do. Architectural (hardware) changes are needed, for example, using transational memory.

Offline

#3 2010-07-22 20:05:04

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

There are a lot of algorithms which are able to avoid most atomic operations.
Just look at http://www.linuxjournal.com/article/5833 and you'll find why Linux is in another galaxy than Windows... this article was written in 2002, and Linux is running on massive parallel computers with the same kernel source code than the one in your last geeky Ubuntu... wink

AdamWu wrote:

Eliminating LOCKed instruction is not what a compiler can do.

I 100% agree with you: that's why I said the locks are to be avoided in the Delphi RTL, that is in memory allocation and string handling.

One another approach is to have multiple heaps, avoid most copy.

For example, I'm working on some very powerful and multi-thread ready architecture named XMLRAD.

See http://xmlrad.cvs.sourceforge.net/viewv … ision=1.96

XSystem.pas wrote:

Main ideas of UltimX:
- avoid unnecessary memory allocations moving data from source buffer to final buffer
- avoid intermediate string copy
- Memory buffers are maintained on the stack when possible
- Memory allocations are maintained outside the scope of variables so variables do not need any longer to reserve physical memory, and can rely on pre-allocated spaces to move data from source to destination
- when memory allocations are needed and are local to the scope of the procedure stack, the programmer should privilege the use of the heap CurrentScratch with XScratchStack/Unstack the memory blocks reside for the scope of the procedure (ephemary allocation),all these local memory allocations are collected at each XScratchUnstack and when thread terminates, the whole CurrentScratch heap is collected at once
- when memory allocations are needed to survive out of the scope of the current procedure stack, the programmer should use the heap of the current request (XMLRequest.Heap), the memory blocks reside for the duration of the request (ephemary allocation) and all these local memory allocations are collected at once when the request is released
- at last, when no other alternative is possible but to perform a global memory allocation,then such allocation should be performed using the global heap (auto-detect of the Numa Node)

Without changing the compiler, but just your coding style and your library, you can increase multi-core performance a lot!

I don't know if you ever noticed that, but you can even have multiple heaps in your Delphi application, and have objects which are created not on the main global heap, but on a private heap.

Here is my rewrite of the allocation part of any class in Delphi (I can post this code here because it's my code, not Embarcadero's):

class function TObject.NewInstance: TObject;
asm // faster version by AB
    push eax
    mov eax,[eax].vmtInstanceSize
    call MemoryManager.GetMem // inlined _GetMem() code
    or eax,eax
    mov edx,eax
    jz _JmpOutOfMemory
    pop eax
    jmp InitInstance
end;

So you can customize the memory allocation by using some other call than GetMem.

There are so many ways of improving the speed... but you have to open your mind and use a profiler, and a timer, in order to really make things faster, not only having fun (even if making it faster is having a lot of fun for me)...

Offline

#4 2010-07-23 01:43:32

AdamWu
Member
Registered: 2010-07-22
Posts: 20

Re: Delphi doesn't like multi-core CPUs (or the contrary)

True, local heap can avoid locking, but there will be other problems introduced:
1. Some application uses massive multi-threading, if each thread opens a local heap, then the address space may get very fragmented. (When Delphi eventually move to 64bit, we probably won't have to worry about it anymore... tongue)
2. Each heap would need an instance of memory manager, and then there will be inter-op problems. If a piece of memory is passed between threads, it has to be somehow "linked" to its allocator memory manager instance otherwise there will be problem freeing it. And also the free operation needs to be synchronized, probably using LOCK, maybe RCU -- but I think it is too heavy weighted for memory management.

Last edited by AdamWu (2010-07-23 01:51:03)

Offline

#5 2010-07-23 07:10:12

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

1. I don't get your point here: local heap may be in the same address space. A local / by thread heap is just another way of allocating memory from the global heap, but with some pre-allocated memory by thread.

2. Local heap memory manager can be just a growing heap, with all memory unallocated at once, when the thread finished.

2.1. Exchange of data between threads is in fact not so common. You can avoid most of its usage in some applications (like a web server which does HTML rendering).

2.2. Indeed, every object must know it's heap manager, but it's not difficult to fix it by design.

2.3. With a local heap freed at once, you'll avoid most locks.

3. Most locks are performed by the current version of Delphi:
  - when you allocate memory;
  - when your memory has to grow up or down;
  - when your memory is freed;
  - when a string is assigned to another string, which is very common if you use methods/function which returns a string as a result;
  - when a char is about to be written in the string, i.e. when a string is about to be modified (implicit UniqueString() call generated by the compiler);
  - the same for dynamic arrays...

See the UltimX.pas code I referred to in my previous post (from XMLRad project): you'll find a lot of interesting techniques to avoid CPU locking, for Delphi.

Offline

#6 2010-07-24 01:18:39

AdamWu
Member
Registered: 2010-07-22
Posts: 20

Re: Delphi doesn't like multi-core CPUs (or the contrary)

I agree that all these are possible and can be done.
But I kind of worried about doing all these puts too many constraints on the programming style, to the point of counter productive.

For example, exchange of data between threads is not common, agreed, but the problem is that, it does happen, and you won't know when.
So, in order to correctly handle those 0.1% cases that happen randomly, you have to be prepared all the time - i.e. the alloc / free list operations must be synchronized.

There are many existing optimizations such as RCU used in Linux kernel, but nearly all of them speeds up just the readers, some even penalize the writers in order to optimize for reading.
But most of the cases you argue in the first post are actually writing - string assignment, freeing memory blocks ... I doubt optimization can help.

But yes, there is an alternative way, just change the existing programming model and avoid the need for lock.
Like suggested in UltimX - whenever you need something local, just use the stack.

But everything comes at some cost, right?
1. The stack space is quite limited, and may not be extendable, so the programmer needs to be extra careful not to overflow it.
2. The local heap is basically a chunk a memory taken out from middle of the otherwise continous address space. So if you take too many, the address space gets fragmented.
32-bit program only has 2GB address space, in which it must fit in the executable plus all libraries, and the stacks and heaps for all threads.
So it is actually quite easy to get fragmented, and what happens then is that you will fail to allocate a relatively big chunk of memory (say 300MB), even there seems to be much more avaliable (say 600MB), because it is not a single piece (such as 3x200MB)... tongue

------
I've seem some nice hardware solutions. I have programmed on an OCTEON networking processor, which is designed to handle packets at extremely fast speed. On board are special hardware assistances to avoid locking, such as a programmable "thread" scheduler, and a hardware memory manager. But again, there is some cost - the memory manager can only give you fix sized blocks, so some wastage is unavoidable...

Last edited by AdamWu (2010-07-24 01:39:03)

Offline

#7 2010-07-26 12:44:19

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

You just have to write with diverse objects/allocators. What is called a thread-local heap below may also be a request-local heap, i.e. for a Web Server, a heap which is initialized when the HTTP request is received, used to calculate the answer, then release at once when the request is finished or outdated.

Therefore you have two ways of implementing thread-local and global heaps:

1. Define diverse classes types for diverse heaps: one kind of class will have a thread-local storage (by overloading the TObject.NewInstance class function as I just stated above). If you define you classes carefully, Delphi strong typing will prevent you using thread-local allocated objects outside the thread scope.

2. Another trick is to put the thread ID into every memory allocated block (there is already the block size in all blocks, you can add such a parameter just before any data, and I think a 4 bytes overhead is not much). If a block memory is used outside the thread which created it, a copy will be made. Some kind of cheap RCU. But you'll have to overwrite the memory allocator. No lock in this case, and 100% thread-safe. This perhaps could also be enabled in the system.pas long string implementation, independently from the heap manager used... perhaps some track to follow here...

And don't forget it's always possible to use plain object types in the stack rather than class types: this object types are faster to initialize and are excellent candidates for local/temporary storage. Even shortstrings and PChar buffers are very fast if you need fast parsing and text appending, which is most server work has to do. And I suspect the main implementation of massive multi-thread is in such server applications.

About memory fragmentation, it can be avoided if you use clever algorithms. It's not difficult, if you can guess the average/maximum needed memory per thread, to make such allocation not fragmented. And in a Pentium-class CPU, memory fragmentation is not an issue here. L1/L2/L3 cache latency is the main performance parameter. So if you reserve some space in the main heap for local threads in order to create a cache (i.e. a list of preallocated heaps), then allocate and deallocate local-thread heap from this "private heap" cache, you won't have any fragmentation. If your local-thread heap is deallocated at once when the thread (or request) is finished, you won't either have to check about memory liberation (you can forget about calling free if the destroy destructor was not overriden to do something else than free memory).

Offline

#8 2010-07-27 17:21:30

AdamWu
Member
Registered: 2010-07-22
Posts: 20

Re: Delphi doesn't like multi-core CPUs (or the contrary)

Yes, stack based memory allocate is definitely faster, agreed. big_smile

But as I said, extra careness is needed on the programmer side, especially when writing a library, or calling external libraries, because you can't control the stack when you don't have the execution control (i.e. the external caller may only leave you a tiny stack, or the external callee may demand too much stack space than you have left for it)

And, sometimes a stack overflow exhibit totally weird behavior, it is hard to detect or debug.

> If a block memory is used outside the thread which created it, a copy will be made.

Who is going to make the copy?

If the user thread does the copy, then locking is needed, since the owner thread may operation the memory during the copy, and thus leading to race condition. (RCU works only if ALL write accesses are first made in a separate copy, so in order to make your scheme an RCU, even the owner thread need to make a copy of its memory block before doing any update -- this is one of the case I mentioned that the writer is penalized for faster reads)

if the owner thread does the copy, then the user thread needs to provide synchronized access to its memory manager, and most likely via some kind of locking. (Say, a memory block owned by thread A is going to be passed to thread B, then thread A needs to ask thread B's memory manager for a chunk of memory before A can perform the copy...)

Last edited by AdamWu (2010-07-27 17:38:20)

Offline

#9 2010-07-30 09:48:40

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

I've a compiler enhancement proposal, which could make Delphi more multi-thread friendly:
  see http://synopse.info/forum/viewtopic.php?pid=305

Offline

#10 2010-09-07 09:05:42

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

Here are some benchmarks... even if I found it difficult to get real data about this.

http://www.dlugosz.com/Repertoire/refma … paper.html
-> on Dual Pentium Pro, it was a 178/31=5.7 factor, on Pentium 4 w/hyperthreading it was a 38/2=19 factor
http://www.spiral.net/software/barrier.html
-> Core2Duo, it was 530/220 = 2.4 factor, on Core2Extreme, it was 2500/1130 = 2.2 factor

So the LOCK sounds to get better, with the new CPU architectures.

Offline

#11 2010-09-10 06:55:29

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

DD wrote:

Bottom of page:
http://www.realworldtech.com/page.cfm?A … 182719&p=8

"Nehalem also lowers the latency for synchronization primitives such as
LOCK, XCHG and CMPXCHG, which are necessary for multi-threaded
programming. Intel claims that the latency for LOCK CMPXCHG
instructions (which serializes the whole pipeline) is 20% of what it
was for the P4 (which was absolutely horrible) and about 60% of the
Core 2. While the latency is lower, the behavior is still similar to
prior generaitons; Lock instructions are not pipelined, although
younger operations can execute ahead of a LOCK instruction."

So I guess that the timing from spiral.net is somewhat wrong.

Offline

#12 2010-11-10 16:52:09

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

DD wrote:

Lock instructions are not pipelined, although
younger operations can execute ahead of a LOCK instruction."

This will definitively be a bottleneck for fast string processing in a thread.

In multi-threaded apps, static allocated buffers are always preferred. But it's not so easy to implement.

The best performance I ever get was with calling directly the Windows thread API, with pre-allocated memory buffers, and assigning a thread to each core (or at least assigning one thread per core, minus one or two, or div per two if you know that your task is not HyperThread-friendly). It worked very well for real-time image processing.

Or you can take a look at the new http://synopse.info/fossil/finfo?name=SynCrypto.pas file, and the way I implemented the AES multi-threaded compression, via pure Windows API. Since it won't use TThread, IsMultiThread will stay equal to false, and FastMM4 will perform better... if you use no TThread in your application, of course!

The resulting code is very easy to read and follow... the Windows API implementation is even clearer than creating a TThread class:

var
  CPUCount: integer = 0;

procedure GetCPUCount;
var SI: TSystemInfo;
begin
  GetSystemInfo(SI); // USETHREADSFORBIGAESBLOCKS is defined for Windows only
  CPUCount := SI.dwNumberOfProcessors;
end;

type
  TThreadParams = record
    bIn, bOut: pAESBlock;
    BlockCount: integer;
    Encrypt: boolean;
    ID: THandle;
    AES: TAES;
  end;

{ we use direct Windows threads, since we don't need any exception handling
  nor memory usage inside the Thread handler
   -> avoid classes.TThread and system.BeginThread() use
   -> application is still "officialy" mono-threaded (i.e. IsMultiThread=false),
     for faster System.pas and FastMM4 (no locking)
   -> code is even shorter then original one using TThread }
function ThreadWrapper(var P: TThreadParams): Integer; stdcall;
begin
  with P do
    AES.DoBlocks(bIn,bOut,bIn,bOut,BlockCount,Encrypt);
  ExitThread(0);
  result := 0; // make the compiler happy, but won't never be called
end;

procedure TAES.DoBlocksThread(var bIn, bOut: PAESBlock; Count: integer; doEncrypt: boolean);
var Thread: array[0..3] of TThreadParams; // faster than dynamic array
    Handle: array[0..high(Thread)] of THandle;
    nThread, i, nOne: integer;
    pIn, pOut: PAESBlock;
begin
  if Count=0 then exit;
  if CPUCount=0 then
    GetCPUCount;
  if {$ifdef USEPADLOCK} padlock_available or {$endif}
    (CPUCount=1) or // (DebugHook<>0) or
    (Count<((512*1024) div AESBlockSize)) then begin // not needed below 512 KB
    DoBlocks(bIn,bOut,bIn,bOut,Count,doEncrypt);
    exit;
  end;
  nThread := CPUCount;
  if nThread>length(Thread) then // a quad-core is enough ;)
    nThread := length(Thread);
  nOne := Count div nThread;
  pIn := bIn;
  pOut := bOut;
  for i := 0 to nThread-1 do
  with Thread[i] do begin // create threads parameters
    bIn := pIn;
    bOut := pOut;
    BlockCount := nOne;
    Encrypt := doEncrypt;
    AES := self; // local copy of the AES context for every thread
    Handle[i] := CreateThread(nil,0,@ThreadWrapper,@Thread[i],0,ID);
    inc(pIn,nOne);
    inc(pOut,nOne);
    dec(Count,nOne);
  end;
  if Count>0 then
    DoBlocks(pIn,pOut,pIn,pOut,Count,doEncrypt); // remaining blocks
  inc(Count,nOne*nThread);
  assert(integer(pIn)-integer(bIn)=Count*AESBlockSize);
  assert(integer(pOut)-integer(bOut)=Count*AESBlockSize);
  bIn := pIn;
  bOut := pOut;
  WaitForMultipleObjects(nThread,@Handle[0],True,INFINITE); 
  for i := 0 to nThread-1 do
    CloseHandle(Handle[i]);
end;

A possibility is therefore to have the main GUI thread allocating memory for all background threads, before calling them, then have these threads be created by pure Windows API, instead of defining some TThread.Execute... You can also use memory allocated by the Windows API, either a local Heap or VirtualAlloc: if you need big chunk of memory, VirtualAlloc is as fast as FastMM4, since for such big blocks, FastMM4 just call VirtualAlloc. smile

But be aware that you don't have to use exceptions in pure Windows API threads, or you'll have to copy the exception handling in TThread code... sad

Offline

#13 2010-11-18 10:56:08

andrewdynamo
Member
Registered: 2010-11-18
Posts: 65

Re: Delphi doesn't like multi-core CPUs (or the contrary)

I made a new memory manager for Delphi, which uses a threadvar/TLS inside, so each thread has its own memory. This way no locks are used, and memory intensive apps scales (almost) perfectly now!
The name of the new MM is ScaleMM:
http://code.google.com/p/scalemm/

I have released version 0.9 yesterday: still a little bit rough but functional :-).
It is also 4x faster than FastMM, even for single threaded!

I had 2 major things in my mind:
1 = make it scale
2 = make it as simple as possible (so faster than bloated FastMM)

I still use a simple CAS lock for interthread memory (memory allocated in thread 1, freed in thread 2).
And for the global cache/pool I use simple lock, which maybe can replaced with an interlocked index for getting/putting an item in an array/pool.
If anybody has better ideas how to do it without locking: say it please :-).

Also other comments or help are welcome!

Last edited by andrewdynamo (2010-11-18 11:27:33)

Offline

#14 2010-11-18 12:08:11

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

andrewdynamo wrote:

I made a new memory manager for Delphi, which uses a threadvar/TLS inside, so each thread has its own memory. This way no locks are used, and memory intensive apps scales (almost) perfectly now!

Very good unit!
It's definitively easy to read and follow, because of its nice and simple design.

In TThreadMemManager.GetMem/Reallocmem:
- You don't handle aSize<=0, which should return nil (you'll have a GPF in the current implementation).

Everywhere in the code (to be 64 bit ready):
- You should use pointer(NativeUInt()) instead of pointer(cardinal()) typecast.
- You should use NativeInt instead of integer (to be sure that this will be stored in a register, without any overflow check).

Some ideas/remarks:
- the RET in ending of _GetOwnTls is pointless.
- Perhaps joining all code in only one ScaleMM unit could make sense (easier to deploy/install).
- RELEASE should be the default: a NORELEASE could make sense by default (for version 1.0) wink
- why did you use {$A8} ? {$A4} or 'packed' wouldn't be wrong either. 16 bytes alignment could be an option only, and FastMM4 doesn't always returns 16 byte aligned blocks in the default compilation mode (via OldMM.GetMem).
- If you use 16 bytes per TMemHeader, you could add a "magic" cardinal number to it to identify Freemem/Reallocmem misuses.
- you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls.
- You use some div 32 or div 256: you'd better use  shl 5  and shl 8, which will be faster when working with integer/NativeInt.

IMHO in Scale_ReallocMem(), the 

//new size smaller then current size?
  if (aSize <= Owner.FItemSize) then    //we have already incremented blocksize in "TMemList.AllocMem" and "TMemStore.GetMemFromList"
  begin
    if (aSize + Owner.FItemSize >= Owner.FItemSize) then
      Result := aMemory //no resize needed
    else
    //too much downscaling
    begin
      Result := Scale_GetMem(aSize);   //new mem
      if aMemory <> Result then
      begin
        Move(aMemory^, Result^,
             aSize);                   //copy (use smaller new size)
        Scale_FreeMem(aMemory);        //free old mem
      end;
    end;
  end

doesn't make sense to me. As it is written, the "too much downscaling" block won't never be trigered.


I see that you rely on FastMM4 for bigger block size. What about using Windows Heap API?
It's slower than FastMM4 for mono threading applications, but I read that it behaves better than FastMM4, even for big blocks, in multi-threaded applications.
Code extracted from our Enhanced RTL for Delphi (perhaps the HEAP_NO_SERIALIZE is not a good idea and should be avoided in multi thread apps):

const
  HEAP_NO_SERIALIZE              = $00001;
  HEAP_GROWABLE                  = $00002;
  HEAP_GENERATE_EXCEPTIONS       = $00004;
  HEAP_ZERO_MEMORY               = $00008;
  HEAP_REALLOC_IN_PLACE_ONLY     = $00010;
  HEAP_TAIL_CHECKING_ENABLED     = $00020;
  HEAP_FREE_CHECKING_ENABLED     = $00040;
  HEAP_DISABLE_COALESCE_ON_FREE  = $00080;
  HEAP_CREATE_ALIGN_16           = $10000;
  HEAP_CREATE_ENABLE_TRACING     = $20000;
  HEAP_MAXIMUM_TAG               = $00FFF;
  HEAP_PSEUDO_TAG_FLAG           = $08000;
  HEAP_TAG_SHIFT                 = 16;

var
  HeapHandle: THandle = 0;
  {* Global handle to the heap. Do not change it! }

  HeapFlags: cardinal = 0;
  {* Possible flags are:
     HEAP_GENERATE_EXCEPTIONS - system will raise an exception to indicate a
                              function failure, such as an out-of-memory
                              condition, instead of returning NULL.
     HEAP_NO_SERIALIZE - mutual exclusion will not be used while the HeapAlloc
                       function is accessing the heap. Be careful!
                       Not recommended for multi-thread applications.
                       But faster.
     HEAP_ZERO_MEMORY - obviously. (Slower!)
  }

  { Note from MSDN:
    The granularity of heap allocations in Win32 is 16 bytes. So if you
    request a global memory allocation of 1 byte, the heap returns a pointer
    to a chunk of memory, guaranteeing that the 1 byte is available. Chances
    are, 16 bytes will actually be available because the heap cannot allocate
    less than 16 bytes at a time.
  }

function GetProcessHeap: THandle; stdcall; external kernel;
function HeapAlloc(hHeap: THandle; dwFlags, dwBytes: Cardinal): Pointer; stdcall; external kernel;
function HeapReAlloc(hHeap: THandle; dwFlags: Cardinal; lpMem: Pointer; dwBytes: Cardinal): Pointer; stdcall; external kernel;
function HeapFree(hHeap: THandle; dwFlags: Cardinal; lpMem: Pointer): LongBool; stdcall; external kernel;

function SysGetMem(size: Integer): Pointer;
// Allocate memory block
begin
  if HeapHandle=0 then
    HeapHandle := GetProcessHeap; // allow the heap to be shared within dll
  Result := HeapAlloc(HeapHandle, HeapFlags, size);
end;

function SysFreeMem(p: Pointer): Integer;
// Deallocate memory block
begin
  if HeapHandle=0 then
    Result := 0 else
    Result := Integer(not HeapFree(HeapHandle, HeapFlags and HEAP_NO_SERIALIZE, p));
end;

function SysReallocMem(p: Pointer; size: Integer): Pointer;
// Resize memory block
begin
  if HeapHandle=0 then
    HeapHandle := GetProcessHeap; // allow the heap to be shared within dll
  Result := HeapRealloc(HeapHandle, HeapFlags and (HEAP_NO_SERIALIZE and
         HEAP_GENERATE_EXCEPTIONS and HEAP_ZERO_MEMORY),
         // (Prevent using flag HEAP_REALLOC_IN_PLACE_ONLY here - to allow
         // system to move the block if necessary).
         p, size);
end;

{$IFDEF MSWINDOWS}
function GetHeapStatus: THeapStatus;
begin
  fillchar(result,sizeof(result),0);
end;
{$endif}

procedure UninitAllocator;
// Shutdown; do nothing
begin
end;

Offline

#15 2010-11-18 13:09:31

andrewdynamo
Member
Registered: 2010-11-18
Posts: 65

Re: Delphi doesn't like multi-core CPUs (or the contrary)

Thank you for the positive reply!

- "You don't handle aSize<=0"
  Thank you, will add a check for that (or maybe the RTL already does that?)
- "Everywhere in the code (to be 64 bit ready)"
  Yes, I will change that too, really hope 64bit (preview) will be available soon to test ScaleMM.
  But for now, I used my sparse spare time to get it working. 64bit compatible is (near) future, lower prio.
- "the RET in ending of _GetOwnTls is pointless"
  Ok, thanks
- "Perhaps joining all code in only one ScaleMM unit could make sense (easier to deploy/install)"
  Yes, I first thought small units would be better than one big unit, but it not easier to install (use multiple units in dpr or use search path)
- "RELEASE should be the default: a NORELEASE could make sense by default (for version 1.0)"
  Ok, thanks
- "why did you use {$A8} ? {$A4} or 'packed' wouldn't be wrong either. 16 bytes alignment could be an option only, and FastMM4 doesn't always returns 16 byte aligned blocks in the default compilation mode (via OldMM.GetMem)."
  Just some test to see if it makes a difference :-)
  I do not have much experience with alignment yet
- "If you use 16 bytes per TMemHeader, you could add a "magic" cardinal number to it to identify Freemem/Reallocmem misuses"
  yes I can use the 8 extra bytes for something else then, but what do you mean with "misuses"?
- "you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls."
  Yes and no: this way it can use it's own memory (that's the reason for the "recursive" check)
- "You use some div 32 or div 256: you'd better use  shl 5  and shl 8, which will be faster when working with integer/NativeInt."
  Aha, thanks for the tip
- about downscaling:
  E.g. aSize is 1, itemsize is 256 then:
  1 < 256
    1 + 256 >= 256  ....   hmmm you're right

  should be:
    if aSize < Owner.FItemSize div 2 then   //downscale if more than half smaller
   1    < 256 div 2  -> not ok, too much downscaling
   128 < 256 div 2  -> ok, just the half but ok
- about large mem:
  For the first version, I focus on "make a working POC", so ScaleMM on top of FastMM was the easiest :-).
  Btw, "on top" is also a nice feature: you can put any other MM before ScaleMM!
  I haven't thought much about how to deal with large mem (I would like to make version 2 independent of FastMM),
  but your proposal of using the heap sounds good, thanks! Do you have good ideas about a "large memory" strategy?
  I know FastMM has some special logic for medium and large blocks.

Offline

#16 2010-11-18 13:25:52

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

andrewdynamo wrote:

- "you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls."
  Yes and no: this way it can use it's own memory (that's the reason for the "recursive" check)

So replace it with GetMem and FreeMem.

andrewdynamo wrote:

  For the first version, I focus on "make a working POC", so ScaleMM on top of FastMM was the easiest :-).
  Btw, "on top" is also a nice feature: you can put any other MM before ScaleMM!
  I haven't thought much about how to deal with large mem (I would like to make version 2 independent of FastMM),
  but your proposal of using the heap sounds good, thanks! Do you have good ideas about a "large memory" strategy?
  I know FastMM has some special logic for medium and large blocks.

FastMM is perhaps the best option for blocks > 16 KB (medium blocks are handled internally up to a bit more than 256 KB, then it calls VirtualAlloc for bigger blocks).
The "on top" aspect is really really great. I really like it.

For the "div 2", replace it with "shl 1". Or use cardinals (the compiler will replace "aCardinal div 2" by  "aCardinal shl 1", but it won't do it for an integer, because it must check if the integer is not <0).
For unsigned integers, div 2=shl 1, div 4=shl 2, div 8=shl 3, and so on...
Since you're using positive integers, you could safely use NativeUInt everywhere.

You could have used a threadvar instead of low-level Windows API.
The _FixedOffset is perhaps not worth it. Reading some threadvar memory could be quite as fast.

I've inlined the threadvar access in asm in our Enhanced RTL, for example for the ioresult function.
Using a threadvar would allow you to get rid of the asm and use regular pascal code in some cases (like being ready for 64 bit or FPC).
Do you want me to rewrite the getmem/freemem versions in asm?
Perhaps when the algorithm will be proven, I'll go into this, if you want.

In all cases, your unit, with a thread pool (to avoid creating a new heap every time a thread is created), could make our server scales even better.
It's very similar to the MM included in FPC 2.4. Did you take a look at it?

Offline

#17 2010-11-18 13:41:26

andrewdynamo
Member
Registered: 2010-11-18
Posts: 65

Re: Delphi doesn't like multi-core CPUs (or the contrary)

- "So replace it with GetMem and FreeMem"
  Ok thats better, Scale_GetMem etc it will be then :-)
- "replace all div with shl"
  Yes I will do that for version 1.0
- about using _fixedoffset:
  this way no (extra) multiply is done each time to calculate offset to threadmem, should be faster...
  But yes, for 64bit (no asm support :-( ) I will probably need to make use of normal threadvar.
- about asm rewrite:
  Yes, that would be nice! I made it a lower prio because of no 64bit asm, but for 32bit it would be good anyway.
  I am curious how much faster you can make it in asm!
- about MM of FPC
  No, I completely coded it myself, but I used some ideas of TopMM (which is slower than FastMM in real world test, is also too bloated, that's why I made ScaleMM) and of course some of FastMM. Also used the CAS32 of the Lock free queue:
http://www.thedelphigeek.com/2010/02/dy … right.html
  But I will take a look at it!

Offline

#18 2010-11-18 13:45:46

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

andrewdynamo wrote:

- about using _fixedoffset:
  this way no (extra) multiply is done each time to calculate offset to threadmem, should be faster...

In practice, the *4 multiplication is done in one CPU clock.
My point was about using a threadvar content, i.e. reading the offset from the threadvar position.
Since this call is only made one, it won't be slower, and would avoid the memory protection trick.

Offline

#19 2010-11-19 12:40:49

andrewdynamo
Member
Registered: 2010-11-18
Posts: 65

Re: Delphi doesn't like multi-core CPUs (or the contrary)

ScaleMM fixes are updated in SVN

Offline

#20 2010-11-19 13:00:50

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

Nice!

What about your testing status?
Does it scale well on real applications (like a server application)?

Offline

#21 2010-11-19 13:38:48

andrewdynamo
Member
Registered: 2010-11-18
Posts: 65

Re: Delphi doesn't like multi-core CPUs (or the contrary)

ab wrote:

Nice!
What about your testing status?
Does it scale well on real applications (like a server application)?

Not yet, only in a demo program with 30.000.000 mem allocs, strings + objects in 16 threads on a quad core machine :-).

I'm busy right now @ work with a large refactoring of old CRUDs objects, so I hope next week

Offline

#22 2010-11-19 16:28:14

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

andrewdynamo wrote:

Not yet, only in a demo program with 30.000.000 mem allocs, strings + objects in 16 threads on a quad core machine :-).

Not bad. wink

andrewdynamo wrote:

I'm busy right now @ work with a large refactoring of old CRUDs objects, so I hope next week

So do I. Refactoring and maintaining is paying the bills...

Offline

#23 2010-11-19 17:34:14

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

I just noticed (from svn code) that you changed some i: integer into i: byte
Using an integer will be faster in the generated code: it will use a 32 bit register, instead of a 8 bit par of it, with some typecast.
integer will be more adequate than byte, even if the optimizer is generating an integer in the asm code because it knows it'll be faster
nativeint would be perfect, in all cases fitting the processor register size.

In all cases, my advice is always to use Alt-F2 to see the generated asm code, before any change.
In TMemBlockList.AddNewMemoryBlock I'm not sure your with use it the better one. Alt-F2 could help.

In TGlobalMemManager.GetBlockMemory, I guess your try...finally bl.FRecursive := False is pointless. Previous implementation sounded better to me.

IMHO TMemBlockList.FItemSize should be NativeInt to avoid a word to integer/NativeInt cast in the generated code.

Offline

#24 2010-11-22 07:41:31

andrewdynamo
Member
Registered: 2010-11-18
Posts: 65

Re: Delphi doesn't like multi-core CPUs (or the contrary)

ab wrote:

I just noticed (from svn code) that you changed some i: integer into i: byte
In TGlobalMemManager.GetBlockMemory, I guess your try...finally bl.FRecursive := False is pointless. Previous implementation sounded better to me.
IMHO TMemBlockList.FItemSize should be NativeInt to avoid a word to integer/NativeInt cast in the generated code.

I tried your proposal and I got no speed improvement (byte to integer/nativeint). Changing FItemSize from word to integer/nativeint made it even some slower!
(probably due to alignment).
So I kept the byte and word, especially to show it uses low values and these sizes are minimum needed (and not getting slower with it). So the only parts I use
integer instead of NativeInt are the TMemoryManagerEx functions. I changed all other parts to nativeint if it made sense and to another type if an integer was not
needed.

About try..finally: just to be sure in case of an AV

Last edited by andrewdynamo (2010-11-22 08:51:24)

Offline

#25 2010-11-22 17:26:16

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 5,535
Website

Re: Delphi doesn't like multi-core CPUs (or the contrary)

I'll work on the source tonight, in order to rewrite some part, according to some ideas I got.
For example, some blocks sizes are overlaping (256 and 2048).

I'll do the refactoring looking at the generated asm code (Alt+F2) which is a good idea of speed optimization, whatever CPU it will run on.

I'll try to make a simple medium sized block implementation.

Offline

Board footer

Powered by FluxBB