Delphi doesn't like multi-core CPUs (or the contrary)

ab · 2010-07-15 16:08:06

If you're like me, you are proud of the new CPU your computer runs on - in my case a i7-720Q with 8 embedded cores...

But Delphi is not very multi-thread or multi-core friendly... guess why....

The root cause the is the LOCK prefix

All X86 CPUs are equipped with the ability to lock a specific memory address, preventing other system buses to read or modify it while the following instruction runs.

The LOCK prefix to an assembly instruction causes the CPUs to assert the LOCK# signal, and practically ensures exclusive use of the memory address in multiprocessors / multi-thread environments.

The LOCK prefix works only with the following instructions:

BT, BTS, BTR, BTC   (mem, reg/imm)
XCHG, XADD  (reg, mem / mem, reg)
ADD, OR, ADC, SBB   (mem, reg/imm)
AND, SUB, XOR   (mem, reg/imm)
NOT, NEG, INC, DEC  (mem)

Note: XCHG and XADD (and all the 'X' family of instructions) are planned to be thread-safe, and always asserts LOCK# regardless of the presence of the LOCK prefix.

These low-level LOCK mechanisms ensure that some memory is modified by only one thread at a time.

So what is wrong with these LOCKed instructions?

On a multi-core CPU, all cores just freeze in order to make this LOCKed asm function threadsafe. If you have a lot of threads with more than one CPU, the context of every CPU core has to be frozen, cleared, all cores wait for the LOCKed asm instruction to complete, then the context is to be retrieved, and execution continue.

So guess what... when the CPU has to execute such instructions, all cores just freeze and your brand new 8 cores CPU just run as a 1 core CPU...

This is the same LOCKed asm function which is used internally by Windows with its <i>Critical Sections</i>. That's why Windows itself is told not to be very multi-core friendly, because it does use a lot of critical sections in its internal... Linux is much more advanced, and scales pretty well on massive multi-core architectures.

What about Delphi?

In Delphi, I discovered at least two performance problem in its RTL:
1. Default memory manager, i.e. FastMM4, uses a LOCKed asm instruction for every memory allocation or dis-allocation.
2. string types and dynamic arrays just use the same LOCKed asm instruction everywhere, i.e. for every access which may lead into a write to the string.

See what I wrote in the Embarcadero forum... this post was not very popular, but indeed I think I've raised a big issue on the Delphi compiler internals and performance here - and I don't think Embarcadero has plans to resolve this...

IMHO if you use strings in your application and need speed, using another memory manager than FastMM4 is not enough. You'll have to avoid most string use, and implement a safe TStringBuilder-like class.

ShortStrings could be handy here, even if they are limited to 255 character long.

Using regular PAnsiChar, and fixed buffers in the stack is also a solution, but it must be safe...

Our enhanced RTL

In our enhanced RTL for Delphi 7, we avoid use of this LOCKed asm instruction if your application has only one thread: so if you use our enhanced RTL, and make thread by yourself (not using the TThread object), you'll have the best multi-thread performance possible.
For example, here is how we coded the clearing of a string:

procedure _LStrClr(var S);
asm     { ->    EAX pointer to str      }
        MOV     EDX,[EAX]                       { fetch str                     }
        TEST    EDX,EDX                         { if nil, nothing to do         }
        JE      @@done 
        lea ecx,[edx-skew].StrRec.refCnt
        mov dword ptr [EAX],0  // clear str
        cmp [ecx],0
        jl @@done     // refCnt=-1: literal str
{$ifdef AVOIDLOCK}
        cmp IsMultiThread,false
        jnz @@lock
        dec [ecx]     // not threadsafe dec refCount, but faster
{$else} lock  dec [ecx]     // threadsafe dec refCount
{$endif}jne @@done
@@free: push eax
        mov eax,ecx
        call MemoryManager.FreeMem // inlined _FreeMem() code
        or eax,eax
        jnz _JmpInvalidPtr
        pop eax
{$ifdef AVOIDLOCK}
        ret
@@lock: lock  dec [ecx]     // threadsafe dec refCount
        je @@free
{$else}{$ifndef NOAMD}ret{$endif} // AMD trick: avoid branch misprediction
{$endif}
@@done:{$ifndef NOAMD}db $F3{$endif} // rep ret AMD trick: avoid branch misprediction
end;

So if the AVOIDLOCK conditional is defined, and there is only one thread in your application (the IsMultiThread is false), the lock dec [ecx] instruction won't be called, but a much faster (and core-friendly) dec [ecx] instruction is used.

Note that there is a similar check already in FastMM4: if IsMultiThread is false, no LOCKed instruction will be used.

The only drawback is that if you want to use threads in your application, you'll have to ensure that:
1. TThread is not to be used: the creation of one TThread just set IsMultiThread to true, so enable LOCKed instructions;
2. BeginThread() function must be avoided also (it set also the flag);
3. So you'll have to call directly CreateThread() Win32 APIs for your threads;
4. And none of your units should use either TThread either BeginThread!

That's why it could be useful that Embarcadero take this problem in account, and try to resolve it at the compiler level....

AdamWu · 2010-07-22 17:26:05

Well, to achieve thread synchronization and avoid race condition, some type of atomic operation is needed. And I think the LOCKed instruction is the finest atomic operation you will get. So I don't think it can be avoided -- may be its usage can be reduced, but eventually you will have to use one.

Eliminating LOCKed instruction is not what a compiler can do. Architectural (hardware) changes are needed, for example, using transational memory.

ab · 2010-07-22 20:05:04

There are a lot of algorithms which are able to avoid most atomic operations.
Just look at http://www.linuxjournal.com/article/5833 and you'll find why Linux is in another galaxy than Windows... this article was written in 2002, and Linux is running on massive parallel computers with the same kernel source code than the one in your last geeky Ubuntu...

AdamWu wrote:

Eliminating LOCKed instruction is not what a compiler can do.

I 100% agree with you: that's why I said the locks are to be avoided in the Delphi RTL, that is in memory allocation and string handling.

One another approach is to have multiple heaps, avoid most copy.

For example, I'm working on some very powerful and multi-thread ready architecture named XMLRAD.

See http://xmlrad.cvs.sourceforge.net/viewv … ision=1.96

XSystem.pas wrote:

Main ideas of UltimX:
- avoid unnecessary memory allocations moving data from source buffer to final buffer
- avoid intermediate string copy
- Memory buffers are maintained on the stack when possible
- Memory allocations are maintained outside the scope of variables so variables do not need any longer to reserve physical memory, and can rely on pre-allocated spaces to move data from source to destination
- when memory allocations are needed and are local to the scope of the procedure stack, the programmer should privilege the use of the heap CurrentScratch with XScratchStack/Unstack the memory blocks reside for the scope of the procedure (ephemary allocation),all these local memory allocations are collected at each XScratchUnstack and when thread terminates, the whole CurrentScratch heap is collected at once
- when memory allocations are needed to survive out of the scope of the current procedure stack, the programmer should use the heap of the current request (XMLRequest.Heap), the memory blocks reside for the duration of the request (ephemary allocation) and all these local memory allocations are collected at once when the request is released
- at last, when no other alternative is possible but to perform a global memory allocation,then such allocation should be performed using the global heap (auto-detect of the Numa Node)

Without changing the compiler, but just your coding style and your library, you can increase multi-core performance a lot!

I don't know if you ever noticed that, but you can even have multiple heaps in your Delphi application, and have objects which are created not on the main global heap, but on a private heap.

Here is my rewrite of the allocation part of any class in Delphi (I can post this code here because it's my code, not Embarcadero's):

class function TObject.NewInstance: TObject;
asm // faster version by AB
    push eax
    mov eax,[eax].vmtInstanceSize
    call MemoryManager.GetMem // inlined _GetMem() code
    or eax,eax
    mov edx,eax
    jz _JmpOutOfMemory
    pop eax
    jmp InitInstance
end;

So you can customize the memory allocation by using some other call than GetMem.

There are so many ways of improving the speed... but you have to open your mind and use a profiler, and a timer, in order to really make things faster, not only having fun (even if making it faster is having a lot of fun for me)...

AdamWu · 2010-07-23 01:43:32

True, local heap can avoid locking, but there will be other problems introduced:
1. Some application uses massive multi-threading, if each thread opens a local heap, then the address space may get very fragmented. (When Delphi eventually move to 64bit, we probably won't have to worry about it anymore... )
2. Each heap would need an instance of memory manager, and then there will be inter-op problems. If a piece of memory is passed between threads, it has to be somehow "linked" to its allocator memory manager instance otherwise there will be problem freeing it. And also the free operation needs to be synchronized, probably using LOCK, maybe RCU -- but I think it is too heavy weighted for memory management.

Last edited by AdamWu (2010-07-23 01:51:03)

ab · 2010-07-23 07:10:12

1. I don't get your point here: local heap may be in the same address space. A local / by thread heap is just another way of allocating memory from the global heap, but with some pre-allocated memory by thread.

2. Local heap memory manager can be just a growing heap, with all memory unallocated at once, when the thread finished.

2.1. Exchange of data between threads is in fact not so common. You can avoid most of its usage in some applications (like a web server which does HTML rendering).

2.2. Indeed, every object must know it's heap manager, but it's not difficult to fix it by design.

2.3. With a local heap freed at once, you'll avoid most locks.

3. Most locks are performed by the current version of Delphi:
- when you allocate memory;
- when your memory has to grow up or down;
- when your memory is freed;
- when a string is assigned to another string, which is very common if you use methods/function which returns a string as a result;
- when a char is about to be written in the string, i.e. when a string is about to be modified (implicit UniqueString() call generated by the compiler);
- the same for dynamic arrays...

See the UltimX.pas code I referred to in my previous post (from XMLRad project): you'll find a lot of interesting techniques to avoid CPU locking, for Delphi.

AdamWu · 2010-07-24 01:18:39

I agree that all these are possible and can be done.
But I kind of worried about doing all these puts too many constraints on the programming style, to the point of counter productive.

For example, exchange of data between threads is not common, agreed, but the problem is that, it does happen, and you won't know when.
So, in order to correctly handle those 0.1% cases that happen randomly, you have to be prepared all the time - i.e. the alloc / free list operations must be synchronized.

There are many existing optimizations such as RCU used in Linux kernel, but nearly all of them speeds up just the readers, some even penalize the writers in order to optimize for reading.
But most of the cases you argue in the first post are actually writing - string assignment, freeing memory blocks ... I doubt optimization can help.

But yes, there is an alternative way, just change the existing programming model and avoid the need for lock.
Like suggested in UltimX - whenever you need something local, just use the stack.

But everything comes at some cost, right?
1. The stack space is quite limited, and may not be extendable, so the programmer needs to be extra careful not to overflow it.
2. The local heap is basically a chunk a memory taken out from middle of the otherwise continous address space. So if you take too many, the address space gets fragmented.
32-bit program only has 2GB address space, in which it must fit in the executable plus all libraries, and the stacks and heaps for all threads.
So it is actually quite easy to get fragmented, and what happens then is that you will fail to allocate a relatively big chunk of memory (say 300MB), even there seems to be much more avaliable (say 600MB), because it is not a single piece (such as 3x200MB)...

------
I've seem some nice hardware solutions. I have programmed on an OCTEON networking processor, which is designed to handle packets at extremely fast speed. On board are special hardware assistances to avoid locking, such as a programmable "thread" scheduler, and a hardware memory manager. But again, there is some cost - the memory manager can only give you fix sized blocks, so some wastage is unavoidable...

Last edited by AdamWu (2010-07-24 01:39:03)

ab · 2010-07-26 12:44:19

You just have to write with diverse objects/allocators. What is called a thread-local heap below may also be a request-local heap, i.e. for a Web Server, a heap which is initialized when the HTTP request is received, used to calculate the answer, then release at once when the request is finished or outdated.

Therefore you have two ways of implementing thread-local and global heaps:

1. Define diverse classes types for diverse heaps: one kind of class will have a thread-local storage (by overloading the TObject.NewInstance class function as I just stated above). If you define you classes carefully, Delphi strong typing will prevent you using thread-local allocated objects outside the thread scope.

2. Another trick is to put the thread ID into every memory allocated block (there is already the block size in all blocks, you can add such a parameter just before any data, and I think a 4 bytes overhead is not much). If a block memory is used outside the thread which created it, a copy will be made. Some kind of cheap RCU. But you'll have to overwrite the memory allocator. No lock in this case, and 100% thread-safe. This perhaps could also be enabled in the system.pas long string implementation, independently from the heap manager used... perhaps some track to follow here...

And don't forget it's always possible to use plain object types in the stack rather than class types: this object types are faster to initialize and are excellent candidates for local/temporary storage. Even shortstrings and PChar buffers are very fast if you need fast parsing and text appending, which is most server work has to do. And I suspect the main implementation of massive multi-thread is in such server applications.

About memory fragmentation, it can be avoided if you use clever algorithms. It's not difficult, if you can guess the average/maximum needed memory per thread, to make such allocation not fragmented. And in a Pentium-class CPU, memory fragmentation is not an issue here. L1/L2/L3 cache latency is the main performance parameter. So if you reserve some space in the main heap for local threads in order to create a cache (i.e. a list of preallocated heaps), then allocate and deallocate local-thread heap from this "private heap" cache, you won't have any fragmentation. If your local-thread heap is deallocated at once when the thread (or request) is finished, you won't either have to check about memory liberation (you can forget about calling free if the destroy destructor was not overriden to do something else than free memory).

AdamWu · 2010-07-27 17:21:30

Yes, stack based memory allocate is definitely faster, agreed.

But as I said, extra careness is needed on the programmer side, especially when writing a library, or calling external libraries, because you can't control the stack when you don't have the execution control (i.e. the external caller may only leave you a tiny stack, or the external callee may demand too much stack space than you have left for it)

And, sometimes a stack overflow exhibit totally weird behavior, it is hard to detect or debug.

> If a block memory is used outside the thread which created it, a copy will be made.

Who is going to make the copy?

If the user thread does the copy, then locking is needed, since the owner thread may operation the memory during the copy, and thus leading to race condition. (RCU works only if ALL write accesses are first made in a separate copy, so in order to make your scheme an RCU, even the owner thread need to make a copy of its memory block before doing any update -- this is one of the case I mentioned that the writer is penalized for faster reads)

if the owner thread does the copy, then the user thread needs to provide synchronized access to its memory manager, and most likely via some kind of locking. (Say, a memory block owned by thread A is going to be passed to thread B, then thread A needs to ask thread B's memory manager for a chunk of memory before A can perform the copy...)

Last edited by AdamWu (2010-07-27 17:38:20)

ab · 2010-07-30 09:48:40

I've a compiler enhancement proposal, which could make Delphi more multi-thread friendly:
see http://synopse.info/forum/viewtopic.php?pid=305

ab · 2010-09-07 09:05:42

Here are some benchmarks... even if I found it difficult to get real data about this.

http://www.dlugosz.com/Repertoire/refma … paper.html
-> on Dual Pentium Pro, it was a 178/31=5.7 factor, on Pentium 4 w/hyperthreading it was a 38/2=19 factor
http://www.spiral.net/software/barrier.html
-> Core2Duo, it was 530/220 = 2.4 factor, on Core2Extreme, it was 2500/1130 = 2.2 factor

So the LOCK sounds to get better, with the new CPU architectures.

ab · 2010-09-10 06:55:29

DD wrote:

Bottom of page:
http://www.realworldtech.com/page.cfm?A … 182719&p=8
"Nehalem also lowers the latency for synchronization primitives such as
LOCK, XCHG and CMPXCHG, which are necessary for multi-threaded
programming. Intel claims that the latency for LOCK CMPXCHG
instructions (which serializes the whole pipeline) is 20% of what it
was for the P4 (which was absolutely horrible) and about 60% of the
Core 2. While the latency is lower, the behavior is still similar to
prior generaitons; Lock instructions are not pipelined, although
younger operations can execute ahead of a LOCK instruction."

So I guess that the timing from spiral.net is somewhat wrong.

ab · 2010-11-10 16:52:09

DD wrote:

Lock instructions are not pipelined, although
younger operations can execute ahead of a LOCK instruction."

This will definitively be a bottleneck for fast string processing in a thread.

In multi-threaded apps, static allocated buffers are always preferred. But it's not so easy to implement.

The best performance I ever get was with calling directly the Windows thread API, with pre-allocated memory buffers, and assigning a thread to each core (or at least assigning one thread per core, minus one or two, or div per two if you know that your task is not HyperThread-friendly). It worked very well for real-time image processing.

Or you can take a look at the new http://synopse.info/fossil/finfo?name=SynCrypto.pas file, and the way I implemented the AES multi-threaded compression, via pure Windows API. Since it won't use TThread, IsMultiThread will stay equal to false, and FastMM4 will perform better... if you use no TThread in your application, of course!

The resulting code is very easy to read and follow... the Windows API implementation is even clearer than creating a TThread class:

var
  CPUCount: integer = 0;

procedure GetCPUCount;
var SI: TSystemInfo;
begin
  GetSystemInfo(SI); // USETHREADSFORBIGAESBLOCKS is defined for Windows only
  CPUCount := SI.dwNumberOfProcessors;
end;

type
  TThreadParams = record
    bIn, bOut: pAESBlock;
    BlockCount: integer;
    Encrypt: boolean;
    ID: THandle;
    AES: TAES;
  end;

{ we use direct Windows threads, since we don't need any exception handling
  nor memory usage inside the Thread handler
   -> avoid classes.TThread and system.BeginThread() use
   -> application is still "officialy" mono-threaded (i.e. IsMultiThread=false),
     for faster System.pas and FastMM4 (no locking)
   -> code is even shorter then original one using TThread }
function ThreadWrapper(var P: TThreadParams): Integer; stdcall;
begin
  with P do
    AES.DoBlocks(bIn,bOut,bIn,bOut,BlockCount,Encrypt);
  ExitThread(0);
  result := 0; // make the compiler happy, but won't never be called
end;

procedure TAES.DoBlocksThread(var bIn, bOut: PAESBlock; Count: integer; doEncrypt: boolean);
var Thread: array[0..3] of TThreadParams; // faster than dynamic array
    Handle: array[0..high(Thread)] of THandle;
    nThread, i, nOne: integer;
    pIn, pOut: PAESBlock;
begin
  if Count=0 then exit;
  if CPUCount=0 then
    GetCPUCount;
  if {$ifdef USEPADLOCK} padlock_available or {$endif}
    (CPUCount=1) or // (DebugHook<>0) or
    (Count<((512*1024) div AESBlockSize)) then begin // not needed below 512 KB
    DoBlocks(bIn,bOut,bIn,bOut,Count,doEncrypt);
    exit;
  end;
  nThread := CPUCount;
  if nThread>length(Thread) then // a quad-core is enough ;)
    nThread := length(Thread);
  nOne := Count div nThread;
  pIn := bIn;
  pOut := bOut;
  for i := 0 to nThread-1 do
  with Thread[i] do begin // create threads parameters
    bIn := pIn;
    bOut := pOut;
    BlockCount := nOne;
    Encrypt := doEncrypt;
    AES := self; // local copy of the AES context for every thread
    Handle[i] := CreateThread(nil,0,@ThreadWrapper,@Thread[i],0,ID);
    inc(pIn,nOne);
    inc(pOut,nOne);
    dec(Count,nOne);
  end;
  if Count>0 then
    DoBlocks(pIn,pOut,pIn,pOut,Count,doEncrypt); // remaining blocks
  inc(Count,nOne*nThread);
  assert(integer(pIn)-integer(bIn)=Count*AESBlockSize);
  assert(integer(pOut)-integer(bOut)=Count*AESBlockSize);
  bIn := pIn;
  bOut := pOut;
  WaitForMultipleObjects(nThread,@Handle[0],True,INFINITE); 
  for i := 0 to nThread-1 do
    CloseHandle(Handle[i]);
end;

A possibility is therefore to have the main GUI thread allocating memory for all background threads, before calling them, then have these threads be created by pure Windows API, instead of defining some TThread.Execute... You can also use memory allocated by the Windows API, either a local Heap or VirtualAlloc: if you need big chunk of memory, VirtualAlloc is as fast as FastMM4, since for such big blocks, FastMM4 just call VirtualAlloc.

But be aware that you don't have to use exceptions in pure Windows API threads, or you'll have to copy the exception handling in TThread code...

andrewdynamo · 2010-11-18 10:56:08

I made a new memory manager for Delphi, which uses a threadvar/TLS inside, so each thread has its own memory. This way no locks are used, and memory intensive apps scales (almost) perfectly now!
The name of the new MM is ScaleMM:
http://code.google.com/p/scalemm/

I have released version 0.9 yesterday: still a little bit rough but functional :-).
It is also 4x faster than FastMM, even for single threaded!

I had 2 major things in my mind:
1 = make it scale
2 = make it as simple as possible (so faster than bloated FastMM)

I still use a simple CAS lock for interthread memory (memory allocated in thread 1, freed in thread 2).
And for the global cache/pool I use simple lock, which maybe can replaced with an interlocked index for getting/putting an item in an array/pool.
If anybody has better ideas how to do it without locking: say it please :-).

Also other comments or help are welcome!

Last edited by andrewdynamo (2010-11-18 11:27:33)

ab · 2010-11-18 12:08:11

andrewdynamo wrote:

I made a new memory manager for Delphi, which uses a threadvar/TLS inside, so each thread has its own memory. This way no locks are used, and memory intensive apps scales (almost) perfectly now!

Very good unit!
It's definitively easy to read and follow, because of its nice and simple design.

In TThreadMemManager.GetMem/Reallocmem:
- You don't handle aSize<=0, which should return nil (you'll have a GPF in the current implementation).

Everywhere in the code (to be 64 bit ready):
- You should use pointer(NativeUInt()) instead of pointer(cardinal()) typecast.
- You should use NativeInt instead of integer (to be sure that this will be stored in a register, without any overflow check).

Some ideas/remarks:
- the RET in ending of _GetOwnTls is pointless.
- Perhaps joining all code in only one ScaleMM unit could make sense (easier to deploy/install).
- RELEASE should be the default: a NORELEASE could make sense by default (for version 1.0)
- why did you use {$A8} ? {$A4} or 'packed' wouldn't be wrong either. 16 bytes alignment could be an option only, and FastMM4 doesn't always returns 16 byte aligned blocks in the default compilation mode (via OldMM.GetMem).
- If you use 16 bytes per TMemHeader, you could add a "magic" cardinal number to it to identify Freemem/Reallocmem misuses.
- you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls.
- You use some div 32 or div 256: you'd better use shl 5 and shl 8, which will be faster when working with integer/NativeInt.

IMHO in Scale_ReallocMem(), the

//new size smaller then current size?
  if (aSize <= Owner.FItemSize) then    //we have already incremented blocksize in "TMemList.AllocMem" and "TMemStore.GetMemFromList"
  begin
    if (aSize + Owner.FItemSize >= Owner.FItemSize) then
      Result := aMemory //no resize needed
    else
    //too much downscaling
    begin
      Result := Scale_GetMem(aSize);   //new mem
      if aMemory <> Result then
      begin
        Move(aMemory^, Result^,
             aSize);                   //copy (use smaller new size)
        Scale_FreeMem(aMemory);        //free old mem
      end;
    end;
  end

doesn't make sense to me. As it is written, the "too much downscaling" block won't never be trigered.

I see that you rely on FastMM4 for bigger block size. What about using Windows Heap API?
It's slower than FastMM4 for mono threading applications, but I read that it behaves better than FastMM4, even for big blocks, in multi-threaded applications.
Code extracted from our Enhanced RTL for Delphi (perhaps the HEAP_NO_SERIALIZE is not a good idea and should be avoided in multi thread apps):

const
  HEAP_NO_SERIALIZE              = $00001;
  HEAP_GROWABLE                  = $00002;
  HEAP_GENERATE_EXCEPTIONS       = $00004;
  HEAP_ZERO_MEMORY               = $00008;
  HEAP_REALLOC_IN_PLACE_ONLY     = $00010;
  HEAP_TAIL_CHECKING_ENABLED     = $00020;
  HEAP_FREE_CHECKING_ENABLED     = $00040;
  HEAP_DISABLE_COALESCE_ON_FREE  = $00080;
  HEAP_CREATE_ALIGN_16           = $10000;
  HEAP_CREATE_ENABLE_TRACING     = $20000;
  HEAP_MAXIMUM_TAG               = $00FFF;
  HEAP_PSEUDO_TAG_FLAG           = $08000;
  HEAP_TAG_SHIFT                 = 16;

var
  HeapHandle: THandle = 0;
  {* Global handle to the heap. Do not change it! }

  HeapFlags: cardinal = 0;
  {* Possible flags are:
     HEAP_GENERATE_EXCEPTIONS - system will raise an exception to indicate a
                              function failure, such as an out-of-memory
                              condition, instead of returning NULL.
     HEAP_NO_SERIALIZE - mutual exclusion will not be used while the HeapAlloc
                       function is accessing the heap. Be careful!
                       Not recommended for multi-thread applications.
                       But faster.
     HEAP_ZERO_MEMORY - obviously. (Slower!)
  }

  { Note from MSDN:
    The granularity of heap allocations in Win32 is 16 bytes. So if you
    request a global memory allocation of 1 byte, the heap returns a pointer
    to a chunk of memory, guaranteeing that the 1 byte is available. Chances
    are, 16 bytes will actually be available because the heap cannot allocate
    less than 16 bytes at a time.
  }

function GetProcessHeap: THandle; stdcall; external kernel;
function HeapAlloc(hHeap: THandle; dwFlags, dwBytes: Cardinal): Pointer; stdcall; external kernel;
function HeapReAlloc(hHeap: THandle; dwFlags: Cardinal; lpMem: Pointer; dwBytes: Cardinal): Pointer; stdcall; external kernel;
function HeapFree(hHeap: THandle; dwFlags: Cardinal; lpMem: Pointer): LongBool; stdcall; external kernel;

function SysGetMem(size: Integer): Pointer;
// Allocate memory block
begin
  if HeapHandle=0 then
    HeapHandle := GetProcessHeap; // allow the heap to be shared within dll
  Result := HeapAlloc(HeapHandle, HeapFlags, size);
end;

function SysFreeMem(p: Pointer): Integer;
// Deallocate memory block
begin
  if HeapHandle=0 then
    Result := 0 else
    Result := Integer(not HeapFree(HeapHandle, HeapFlags and HEAP_NO_SERIALIZE, p));
end;

function SysReallocMem(p: Pointer; size: Integer): Pointer;
// Resize memory block
begin
  if HeapHandle=0 then
    HeapHandle := GetProcessHeap; // allow the heap to be shared within dll
  Result := HeapRealloc(HeapHandle, HeapFlags and (HEAP_NO_SERIALIZE and
         HEAP_GENERATE_EXCEPTIONS and HEAP_ZERO_MEMORY),
         // (Prevent using flag HEAP_REALLOC_IN_PLACE_ONLY here - to allow
         // system to move the block if necessary).
         p, size);
end;

{$IFDEF MSWINDOWS}
function GetHeapStatus: THeapStatus;
begin
  fillchar(result,sizeof(result),0);
end;
{$endif}

procedure UninitAllocator;
// Shutdown; do nothing
begin
end;

andrewdynamo · 2010-11-18 13:09:31

Thank you for the positive reply!

- "You don't handle aSize<=0"
Thank you, will add a check for that (or maybe the RTL already does that?)
- "Everywhere in the code (to be 64 bit ready)"
Yes, I will change that too, really hope 64bit (preview) will be available soon to test ScaleMM.
But for now, I used my sparse spare time to get it working. 64bit compatible is (near) future, lower prio.
- "the RET in ending of _GetOwnTls is pointless"
Ok, thanks
- "Perhaps joining all code in only one ScaleMM unit could make sense (easier to deploy/install)"
Yes, I first thought small units would be better than one big unit, but it not easier to install (use multiple units in dpr or use search path)
- "RELEASE should be the default: a NORELEASE could make sense by default (for version 1.0)"
Ok, thanks
- "why did you use {$A8} ? {$A4} or 'packed' wouldn't be wrong either. 16 bytes alignment could be an option only, and FastMM4 doesn't always returns 16 byte aligned blocks in the default compilation mode (via OldMM.GetMem)."
Just some test to see if it makes a difference :-)
I do not have much experience with alignment yet
- "If you use 16 bytes per TMemHeader, you could add a "magic" cardinal number to it to identify Freemem/Reallocmem misuses"
yes I can use the 8 extra bytes for something else then, but what do you mean with "misuses"?
- "you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls."
Yes and no: this way it can use it's own memory (that's the reason for the "recursive" check)
- "You use some div 32 or div 256: you'd better use shl 5 and shl 8, which will be faster when working with integer/NativeInt."
Aha, thanks for the tip
- about downscaling:
E.g. aSize is 1, itemsize is 256 then:
1 < 256
1 + 256 >= 256 .... hmmm you're right

should be:
if aSize < Owner.FItemSize div 2 then //downscale if more than half smaller
1 < 256 div 2 -> not ok, too much downscaling
128 < 256 div 2 -> ok, just the half but ok
- about large mem:
For the first version, I focus on "make a working POC", so ScaleMM on top of FastMM was the easiest :-).
Btw, "on top" is also a nice feature: you can put any other MM before ScaleMM!
I haven't thought much about how to deal with large mem (I would like to make version 2 independent of FastMM),
but your proposal of using the heap sounds good, thanks! Do you have good ideas about a "large memory" strategy?
I know FastMM has some special logic for medium and large blocks.

ab · 2010-11-18 13:25:52

andrewdynamo wrote:

- "you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls."
Yes and no: this way it can use it's own memory (that's the reason for the "recursive" check)

So replace it with GetMem and FreeMem.

andrewdynamo wrote:

For the first version, I focus on "make a working POC", so ScaleMM on top of FastMM was the easiest :-).
Btw, "on top" is also a nice feature: you can put any other MM before ScaleMM!
I haven't thought much about how to deal with large mem (I would like to make version 2 independent of FastMM),
but your proposal of using the heap sounds good, thanks! Do you have good ideas about a "large memory" strategy?
I know FastMM has some special logic for medium and large blocks.

FastMM is perhaps the best option for blocks > 16 KB (medium blocks are handled internally up to a bit more than 256 KB, then it calls VirtualAlloc for bigger blocks).
The "on top" aspect is really really great. I really like it.

For the "div 2", replace it with "shl 1". Or use cardinals (the compiler will replace "aCardinal div 2" by "aCardinal shl 1", but it won't do it for an integer, because it must check if the integer is not <0).
For unsigned integers, div 2=shl 1, div 4=shl 2, div 8=shl 3, and so on...
Since you're using positive integers, you could safely use NativeUInt everywhere.

You could have used a threadvar instead of low-level Windows API.
The _FixedOffset is perhaps not worth it. Reading some threadvar memory could be quite as fast.

I've inlined the threadvar access in asm in our Enhanced RTL, for example for the ioresult function.
Using a threadvar would allow you to get rid of the asm and use regular pascal code in some cases (like being ready for 64 bit or FPC).
Do you want me to rewrite the getmem/freemem versions in asm?
Perhaps when the algorithm will be proven, I'll go into this, if you want.

In all cases, your unit, with a thread pool (to avoid creating a new heap every time a thread is created), could make our server scales even better.
It's very similar to the MM included in FPC 2.4. Did you take a look at it?

andrewdynamo · 2010-11-18 13:41:26

- "So replace it with GetMem and FreeMem"
Ok thats better, Scale_GetMem etc it will be then :-)
- "replace all div with shl"
Yes I will do that for version 1.0
- about using _fixedoffset:
this way no (extra) multiply is done each time to calculate offset to threadmem, should be faster...
But yes, for 64bit (no asm support :-( ) I will probably need to make use of normal threadvar.
- about asm rewrite:
Yes, that would be nice! I made it a lower prio because of no 64bit asm, but for 32bit it would be good anyway.
I am curious how much faster you can make it in asm!
- about MM of FPC
No, I completely coded it myself, but I used some ideas of TopMM (which is slower than FastMM in real world test, is also too bloated, that's why I made ScaleMM) and of course some of FastMM. Also used the CAS32 of the Lock free queue:
http://www.thedelphigeek.com/2010/02/dy … right.html
But I will take a look at it!

ab · 2010-11-18 13:45:46

andrewdynamo wrote:

- about using _fixedoffset:
this way no (extra) multiply is done each time to calculate offset to threadmem, should be faster...

In practice, the *4 multiplication is done in one CPU clock.
My point was about using a threadvar content, i.e. reading the offset from the threadvar position.
Since this call is only made one, it won't be slower, and would avoid the memory protection trick.

andrewdynamo · 2010-11-19 12:40:49

ScaleMM fixes are updated in SVN

ab · 2010-11-19 13:00:50

Nice!

What about your testing status?
Does it scale well on real applications (like a server application)?

andrewdynamo · 2010-11-19 13:38:48

ab wrote:

Nice!
What about your testing status?
Does it scale well on real applications (like a server application)?

Not yet, only in a demo program with 30.000.000 mem allocs, strings + objects in 16 threads on a quad core machine :-).

I'm busy right now @ work with a large refactoring of old CRUDs objects, so I hope next week

ab · 2010-11-19 16:28:14

andrewdynamo wrote:

Not yet, only in a demo program with 30.000.000 mem allocs, strings + objects in 16 threads on a quad core machine :-).

Not bad.

andrewdynamo wrote:

I'm busy right now @ work with a large refactoring of old CRUDs objects, so I hope next week

So do I. Refactoring and maintaining is paying the bills...

ab · 2010-11-19 17:34:14

I just noticed (from svn code) that you changed some i: integer into i: byte
Using an integer will be faster in the generated code: it will use a 32 bit register, instead of a 8 bit par of it, with some typecast.
integer will be more adequate than byte, even if the optimizer is generating an integer in the asm code because it knows it'll be faster
nativeint would be perfect, in all cases fitting the processor register size.

In all cases, my advice is always to use Alt-F2 to see the generated asm code, before any change.
In TMemBlockList.AddNewMemoryBlock I'm not sure your with use it the better one. Alt-F2 could help.

In TGlobalMemManager.GetBlockMemory, I guess your try...finally bl.FRecursive := False is pointless. Previous implementation sounded better to me.

IMHO TMemBlockList.FItemSize should be NativeInt to avoid a word to integer/NativeInt cast in the generated code.

andrewdynamo · 2010-11-22 07:41:31

ab wrote:

I just noticed (from svn code) that you changed some i: integer into i: byte
In TGlobalMemManager.GetBlockMemory, I guess your try...finally bl.FRecursive := False is pointless. Previous implementation sounded better to me.
IMHO TMemBlockList.FItemSize should be NativeInt to avoid a word to integer/NativeInt cast in the generated code.

I tried your proposal and I got no speed improvement (byte to integer/nativeint). Changing FItemSize from word to integer/nativeint made it even some slower!
(probably due to alignment).
So I kept the byte and word, especially to show it uses low values and these sizes are minimum needed (and not getting slower with it). So the only parts I use
integer instead of NativeInt are the TMemoryManagerEx functions. I changed all other parts to nativeint if it made sense and to another type if an integer was not
needed.

About try..finally: just to be sure in case of an AV

Last edited by andrewdynamo (2010-11-22 08:51:24)

ab · 2010-11-22 17:26:16

I'll work on the source tonight, in order to rewrite some part, according to some ideas I got.
For example, some blocks sizes are overlaping (256 and 2048).

I'll do the refactoring looking at the generated asm code (Alt+F2) which is a good idea of speed optimization, whatever CPU it will run on.

I'll try to make a simple medium sized block implementation.

andrewdynamo · 2010-11-23 08:06:42

Would be nice!

I already implemented the simple "medium size" array (4096-36kb) because our real world apps use lot of these sizes.
Still some bugs (AVs) in our multithreaded app (intertread memory) so need to change that part (also too slow due to too many LOCKs). I already wanted to change that because I want to use "Bit Scanning: BSF and BSR" (32 or 64 bit array) instead of fixed array of pointer (too much memory overhead), using 32bit masks to quickly find and set free/allocated items, etc. Too many thoughts and wished, too little time... (especially due to annoying bugs)

Last edited by andrewdynamo (2010-11-23 08:16:17)

ab · 2010-11-23 09:01:30

I worked hard tonight.

I've finished some fork of ScaleMM unit:
- Based on r8 revision, from Nov 19, 2010;
- Compiles from Delphi 6 up to Delphi XE;
- Some pascal code converted to faster asm;
- Some code refactoring, a lot of comments added;
- Added (experimental) medium block handling from 2048 bytes up to 16384;
- Released under MPL 1.1/GPL 2.0/LGPL 2.1 tri-license.

Could you give it a try, and see what's happening in your test or real world apps?
I only tested it under Delphi 7 yet. Feedback is welcome with newer versions of the compiler.

It's available from our repository:
http://synopse.info/fossil/artifact?nam … 8875e4f09b
(there is a "Download" link at the top the page, if you've logged in, even anonymously)

ab · 2010-11-23 09:20:11

andrewdynamo wrote:

I already wanted to change that because I want to use "Bit Scanning: BSF and BSR" (32 or 64 bit array) instead of fixed array of pointer (too much memory overhead), using 32bit masks to quickly find and set free/allocated items, etc. Too many thoughts and wished, too little time... (especially due to annoying bugs)

Bit scanning is the way FastMM4 handles its buffering mask.
There is some proven code to start with here.
But for my medium block implementation, I just reused your 32/256 bytes blocks logic.
See my comments about C_ARRAYSIZE and MAX_MEDIUMMEMBLOCK const in SynScaleMM.pas

ab · 2010-11-23 09:46:48

I've added two small enhancements to the unit:
* inlined TLS access for Scale_GetMem() and Scale_FreeMem()
* includes faster John O'Harrow Move() for compilers before Delphi 2007

Any feedback is welcome.

Get the latest version from http://synopse.info/fossil/finfo?name=SynScaleMM.pas

tueddy · 2010-11-23 11:50:08

Hi AB,

i made a test with D2007 and get an inlining error on compiling. GetMem and FreeMem uses local symbol OldMM.
After compiling a test app with a virtualtree i get an invalid pointer exception on freeing nodes (FreeMem).
One question:
What means on top of FastMM4? What is the correct unit use:

program1

uses
SynScaleMM,
// FastMM4, use it together with SynScaleMM??
Forms,
...

Best regards
Dirk Carstensen

ab · 2010-11-23 13:53:17

tueddy wrote:

i made a test with D2007 and get an inlining error on compiling. GetMem and FreeMem uses local symbol OldMM.

Indeed. I don't have D2007 or D2010/XE at hand so I couldn't test inlining.
I have made OldMM public.

tueddy wrote:

After compiling a test app with a virtualtree i get an invalid pointer exception on freeing nodes (FreeMem).

I'll look into this. But note that SynScaleMM is more sensitive to multiple FreeMem() calls than other Memory Manager. Are you sure the FreeMem() is called once for this object? If this is triggered in several FreeMem(), problem is in SynScaleMM.

tueddy wrote:

What means on top of FastMM4?

With D2007, FastMM4 is the default memory manager.
So you don't have anything to do.
Just put SynScaleMM in the uses clause of your program.

Before D2007, you have to specify FastMM4 as the first unit to use, THEN SynScaleMM4:

program Test;

uses
  FastMM4,
  SynScaleMM,
  Forms,
 ...

andrewdynamo · 2010-11-23 14:48:50

Thanks for the rework!
I will try to do some bugfixing (interthread mem etc) and testing in our real world app, I hope this week.
I will post my changes in:
http://code.google.com/p/scalemm/source … caleMM.pas

Btw: why did you change record in object? little bit confusing because it is not a class/object?

ab · 2010-11-23 15:03:35

andrewdynamo wrote:

Btw: why did you change record in object? little bit confusing because it is not a class/object

In order to compile in Delphi 6/7/2007, you have to use object instead of record.
The object type is the same as a record with methods. It has nothing to do with the class type.

The object type (existing since Turbo Pascal 5.5 as far as I remember these glorious days) has been changed into the record type, with the extension of method to the record types. Then object has been marked as deprecated (not by the compiler, but in the documentation).

IMHO this was a pure marketing decision. Just to add some items to the "What's new?" feature list.
See http://synopse.info/forum/viewtopic.php?id=80

So for ScaleMM, object=record

I made it packed (via the {$A-} define), because the "packed object" type (which is correct by all means) is buggy with Delphi 2009 for instance (Delphi 2009 compiler was very buggy, in all cases).

andrewdynamo · 2010-11-25 09:32:11

FYI: I'm busy with some rework (using an Integer as 32bit array with BSF and BTR for bit scanning and toggling) which makes it more memory efficient (no fixed array needed) and also makes it simpler (also double free is no problem). Maybe even using the same technique to remove the 2 double linked lists...

I'm working/thinking on interthread memory: it uses a LOCK for every freemem call now, which makes it slow and bad scaling. Also the current implementation is not threadsafe, so AVs when passing memory from thread 1 to thread 2. The new implementation does not use a lock, but needs some kind of GC thread... If someone needs a version with a simple but bad scaling interthread mem implementation, I can make it quite easy.
(I want a fast interthread algoritme, because I use async threads for background data retrieval from the database, so lot of memory is freed in the mainthread)

I hope I have working version next week.

ab · 2010-11-25 10:52:44

What do you mean by "double free is no problem"?

A background thread for freeing memory is not difficult to implement, by using low-level Windows API.

andrewdynamo · 2010-11-25 11:05:51

Double free: In the current version, if you do
P := GetMem
FreeMem(p)
FreeMem(p)
you get no error, but very strange results (because mem can be given to 2 different GetMem calls after that)

Background thread is not difficult, but threadsafe is :-) (for example, if you have many short living threads, mem blocks can be reused in many different threads than the original)
(current version is not threadsafe for "interthread mem")

ab · 2010-11-25 15:01:16

andrewdynamo wrote:

Double free: In the current version, if you do
P := GetMem
FreeMem(p)
FreeMem(p)
you get no error, but very strange results (because mem can be given to 2 different GetMem calls after that)

Such FreeMem called twice is a big error. Program doing this must be corrected.
I try to use FastMM4 in full debug mode to track such errors.

andrewdynamo wrote:

Background thread is not difficult, but threadsafe is :-) (for example, if you have many short living threads, mem blocks can be reused in many different threads than the original)
(current version is not threadsafe for "interthread mem")

What about using RCU?
Perhaps it could be worth adding such a method. But the RTL should be patched, I guess...

Which kind of problems did you encounter?
In which case is the current version not threadsafe? When two threads want to release the same memory? I guess that a memory block should be used at once per thread, or via a Critical Section. It could make sense that a lot of short living threads create mem blocks, then those mem blocks are used in the main thread only, and freed in it - that is, the short living thread creates the content, then just pass it to the main thread, and don't use it any more. This sounds "thread safe" to me.

andrewdynamo · 2010-11-26 08:47:01

About the RCU: do have a simple explanation how this can be programmed in Delphi?

About thread safety: in the current version I use linked lists. When some memory is freed, I link the block to the block list (FirstFreeBlock) for a fast reuse (no block scanning). Doing such is not thread safe, because only a block is guaranteed to be available if not all memory is freed. So a block of a short living thread can be posted to the global list (because thread is terminated) and an other thread has gotten this block, so the block list (owner of the block) can be changed to an other thread -> not thread safe to change something in the block list in an other thread.

Last edited by andrewdynamo (2010-11-26 08:53:26)

andrewdynamo · 2010-11-26 08:55:57

Btw, I got my new POC working (no linked list but "bit arrays") which makes it easier for "inter thread" memory. Probably even no GC thread needed, only some small scanning in the owner thread when it needs more memory, but I'm still working on it :-).

ab · 2010-11-26 09:26:38

When will you commit the current version on SVN?

About RCU, see http://synopse.info/forum/viewtopic.php?pid=335#p335
To my (small) knowledge, there is no RCU implementation in Delphi.
I guess the Delphi RTL should be modified in order to handle RCU for Strings and Dynamic Arrays. As the document in the above link says, "RCU is a Restricted Reference-Counting Mechanism".
RCU should perhaps be implemented at the Memory Manager level.

andrewdynamo · 2010-11-26 09:46:55

About RCU: how can a critical section be made with no overhead? IMHO you need at least a LOCK to make a critical section?

ab · 2010-11-26 10:13:16

andrewdynamo wrote:

About RCU: how can a critical section be made with no overhead? IMHO you need at least a LOCK to make a critical section?

As far as I understood the concept, there are LOCK asm ops called, in order to add atomicity.
But the data is not locked during read or write access: the readers and writers use their own copy of the data, if needed.

You use RCU more or less like critical sections (instruction blocks are made safe via Enter/Leave calls), but the RCU does gives you a pointer to the data to work with.

andrewdynamo · 2010-11-26 11:02:24

So no direct improvement regarding Delphi's string reference counting? Because that's also done using a single LOCK.

andrewdynamo · 2010-11-26 12:34:49

Version 1.0b is released:
http://code.google.com/p/scalemm/downlo … M_v10b.zip
Interthread memory seems to work OK now (still a lock for each item that is freed) in a real world (stress) test.

Also POC of version 2 (bit scanning and "integer as 32 bit array") is available:
http://code.google.com/p/scalemm/source … aleMM2.pas
(just works, many optimizations to be done)

Last edited by andrewdynamo (2010-11-26 12:36:16)

TPrami · 2010-11-26 12:50:48

Hello hello,

One question, if this is meant for small en medium blocks (most common) how it is going to scale with threads that use very large data blocks, think of handles large bitmaps?? Will it scale better than FastMM or is it the same?

-TP-

andrewdynamo · 2010-11-26 13:01:40

TPrami wrote:

Hello hello,
One question, if this is meant for small en medium blocks (most common) how it is going to scale with threads that use very large data blocks, think of handles large bitmaps?? Will it scale better than FastMM or is it the same?
-TP-

For large blocks it will be the same (no scaling) or worse (due to some extra overhead: block calculation size then pass to FastMM).
So for the future, it should also handle large blocks. But for the mean time, I wanted to test small/medium blocks first.

Last edited by andrewdynamo (2010-11-26 13:02:07)

ab · 2010-11-26 14:05:03

andrewdynamo wrote:

For large blocks it will be the same (no scaling) or worse (due to some extra overhead: block calculation size then pass to FastMM).

Overhead won't be noticeable in comparison to the default MM.

andrewdynamo wrote:

So for the future, it should also handle large blocks. But for the mean time, I wanted to test small/medium blocks first.

Very large blocks should be handled with VirtualAlloc (that's what FastMM4 is doing: (medium=large blocks are handled internally up to a bit more than 256 KB, then it calls VirtualAlloc for bigger blocks).
FastMM4 large blocks have a whole lock for all those blocks...

In real applications, allocations are usually of small of medium (up to some KB).
For bigger blocks, it could mean that you need fixed size memory buffer. It's always better having a pool of such buffer, if you need good performances.

In all cases, I guess that for large blocks, the overhead is when you resize it: a lot of data has to be moved...

So all depends of your application. A real Profiler is needed.

andrewdynamo · 2010-11-26 14:29:52

So for large blocks: direct call VirtualAlloc. If we do it with no locks it would scale better than FastMM. Should not be difficult to make.

But what kind of algoritme should be used, or how to buffer/cache large blocks? You don't want to cache blocks of e.g. 10M, and/or reuse it for an alloc of 5M -> too much memory overhead.

Or just no caching: alloc and direct free?

andrewdynamo · 2010-11-26 14:36:19

Btw: if you need a free profiler, which stores the execution time of each call(!), try my AsmProfiler :-)
http://code.google.com/p/asmprofiler/
(little bit old and ugly code but it works)

Last edited by andrewdynamo (2010-11-26 14:42:41)

ab · 2010-11-26 14:59:33

andrewdynamo wrote:

So for large blocks: direct call VirtualAlloc. If we do it with no locks it would scale better than FastMM. Should not be difficult to make.

FastMM doesn't lock for such large blocks. So performance should be the same.

andrewdynamo wrote:

But what kind of algoritme should be used, or how to buffer/cache large blocks? You don't want to cache blocks of e.g. 10M, and/or reuse it for an alloc of 5M -> too much memory overhead.
Or just no caching: alloc and direct free?

FastMM has a good algorithm about medium blocks. Take a look at it. It requires some memory, then divide it on request.

ScaleMM will became as complex as FastMM!!

What about the performance of your last ScaleMM, with my modifications?

I'll take a look at ScaleMM2. Looks promising.

I've uploaded a new version to our internal repository:
http://synopse.info/fossil/finfo?name=SynScaleMM.pas

But I've still problems with this implementation, on real code: if C_ARRAYSIZE=32, it fails on FreeMem (GPF with reading of address 0)... and the code seems correct to me (this is the testing code of our framework, and FastMM4 in debug mode reports no double free or memory leak).

mORMot Open Source

#1 2010-07-15 16:08:06