Delphi doesn't like multi-core CPUs (or the contrary)

andrewdynamo · 2010-11-29 07:00:33

I have some (good) thoughts about how to implement medium blocks in a fast and easy (not complex) way, but need to work (and think) it out first.

Performance with your mods is it a bit faster: 1200ms (my simple demo test) instead of 1300ms.

I found the FastCode MM Challenge this weekend, which contains good MM tests: I already fixed some small problems. Still some AVs wit realloc, hope to fix this soon.

TPrami · 2010-11-29 07:09:22

andrewdynamo wrote:

I found the FastCode MM Challenge this weekend, which contains good MM tests: I already fixed some small problems. Still some AVs wit realloc, hope to fix this soon.

I think that BV-tool would be very good starting point, it it passes, then you must be close...

-TP-

ab · 2010-11-29 07:10:39

Good news.

You're working hard, and your MM sounds very promising to me.

Staying with not complex code is a very good idea.

I knew about the Fastcode MM challenge, but didn't take a look at this one since years, because it was not benchmarking MM on multi-core CPU at this time. Scaling was not a problem, and FastMM4 was winning most challenges.

TPrami · 2010-11-29 08:19:28

ab wrote:

I knew about the Fastcode MM challenge, but didn't take a look at this one since years, because it was not benchmarking MM on multi-core CPU at this time. Scaling was not a problem, and FastMM4 was winning most challenges.

Maybe it would be worth to add multithreaded test into it

-TP-

andrewdynamo · 2010-11-29 09:09:56

I changed "the challenge" using pansichar instead of pchar (D2005 was not yet unicode :-) ) and now all tests are running fine.
Hope to post some results soon.

ab · 2010-11-29 10:51:03

Very nice!

andrewdynamo · 2010-11-29 11:13:04

Overal:
- ScaleMM is faster, only realloc's are slower than D2010 internal Fastmm.
Of course ScaleMM is much better in multi-threaded tests :-)
- ScaleMM uses more mem than d2010/fastmm

Results:
http://code.google.com/p/scalemm/source … _D2010.txt

So I have to check why reallocs are slower, especially the "StringThreadTest" has very low score!

I'll try to reduce memory overhead in ScaleMM2

ab · 2010-11-29 13:04:42

I guess most time is spent in this part:

   // Copy resizing
   for I := 0 to FStringItems - 1 do
     SetLength(B[i], B2);

Meaning that resizing to half size will use move() in every case (the shr 1 test will always fail, due to the overhead).

Two possible fixes:
1. if NativeUInt(aSize) > Owner.FItemSize shr 2 then
2. put a faster move (using SSE2?) - but I guess you won't have any significant gain here

Downsizing is not very common IMHO for strings.
You append data more often than you use SetLength(s,length(s) div 2) in a real application.

This is the limit of benchmarking... the TStringThread test doesn't sounds very relevant to me.
Perhaps with the cRandomSizes = TRUE instead of false.

Why unit did you benchmark? ScaleMM or ScaleMM2?

andrewdynamo · 2010-11-29 14:56:27

Yes, it is less relevant.
However, to get a high score (to show ScaleMM is really faster than FastMM) I have to "cheat" my realloc algoritm... :-(
Maybe a simple change of > instead of >= can make a big difference too.

I benchmarked ScaleMM, so it should be "ready for release" now (no crashes in extensive benchmark).
ScaleMM2 is only POC, however a simple app runs fine, but big app won't (mem is never released etc etc).

ab · 2010-11-29 16:47:55

Or this would be sufficient for the benchmark:

 if NativeUInt(aSize) > Owner.FItemSize shr 1-sizeof(TMemHeader) then

Congratulations!
When the updated version will be available in your repository, let me know: I'll make tests on my side.

ab · 2010-12-02 09:26:14

I've corrected an issue in our framework (fixed dual memory release in case of FTS3 use, in TSQLDataBase.Destroy) which was triggering problems with ScaleMM.

It now works without any problem on our own stress test application.

From my point of view, ScaleMM is "ready for release".

TPrami · 2010-12-02 11:44:20

Hello,

Could you send me compiled versions of those (easier to test on my machines). Or put them into some public place for download.

-TP-

ab · 2010-12-02 12:52:10

You can download the file from our source code repository.

First log in as anonymous.

Then go to http://synopse.info/fossil/finfo?name=SynScaleMM.pas

Select the revision you want (latest on top), then click on the "Download" link.

There is almost no code difference with your current r17 revision.

TPrami · 2010-12-02 13:13:35

Oops,

I was more than less uninformative

I mean .exe compiled version of test/bechmark tool, so I could test them more easily (see the difference on my machine(s))

Sorry for that, I need to concentrate more

-Tee-

andrewdynamo · 2010-12-03 15:21:01

I committed my latest version to SVN:
http://code.google.com/p/scalemm/source … caleMM.pas

StringTest Performance is still too low due to lock on global table (because a full block is free). Also FastMM has a lock on large blocks too (so 2 times a lock).

All tests and checks are OK so ready for release.

I'm busy with a new allocation algoritme (for any alloc size, not only small), I hope a POC next week.

ab · 2010-12-03 16:50:25

andrewdynamo wrote:

All tests and checks are OK so ready for release.

Great!!!

andrewdynamo wrote:

I'm busy with a new allocation algoritme (for any alloc size, not only small), I hope a POC next week.

Good news. Can't wait for it!

ab · 2010-12-04 09:46:02

Here is some code "official" article about atomic LOCK instructions.
http://software.intel.com/en-us/article … itectures/

There is definitively a penalty about using these LOCKs.

The way the Spin wait is implemented in the current implementation, for example in TGlobalMemManager.FreeBlockMemory, seems perfect to me, according to the specs published by this Intel's article (see last paragraph named "Backing Off Locks with Spin-Wait Loops"):

  // lock
  while bl.FRecursive or (LockCmpxchg(0, 1, @bl.FRecursive) <> 0) do
  begin
    // fast/simple switch to other thread and wait least possible time
    if not SwitchToThread then
      Sleep(0);
    // can we lock now?
    if not bl.FRecursive and (LockCmpxchg(0, 1, @bl.FRecursive) = 0) then
      Break;
    // longer wait (up to 15ms!), otherwise race condition + high Kernel CPU if only Sleep(0)
    Sleep(1);
  end;

I've uploaded an updated version of our forked SynScaleMM.
I made some profiling investigation, under modern Delphi (with inlining), and in some cases, inlining made a great impact to the speed.

I really like this one:

pm := Owner.GetMem(SizeOf(pm^));

Will be inline into a few asm instructions: since the block size if a constant, all the

if aSize <= (length(FMiniMemoryBlocks)*32) then

will be reduced to the corresponding matching branch, and therefore the generated asm code will be reduced!

ab · 2010-12-04 14:16:34

I've made some speed benchmark, from some source code just posted in Embarcadero forum:
https://forums.embarcadero.com/thread.j … 061#305061

Here is the code:

program ThreadsTest;
 
{$APPTYPE CONSOLE}
 
uses
  FastMM4,
  SynScaleMM,
  Windows,
  SysUtils;

const
  MAX_THREADS = 100;

type
  TThreadFuncData = record
    Counter: integer;
  end;
  PThreadFuncData = ^TTHreadFuncData;
 
  TThreadDataArray   = array of TThreadFuncData;
  TThreadHandleArray = array of THandle;
  TTHreadIDArray     = array of cardinal;
 
var
  ThreadDataArray:   TThreadDataArray;
  ThreadHandleArray: TThreadHandleArray;
  ThreadIDArray:     TTHreadIDArray;
 
function ThreadFunc(aParam: pointer): DWORD; stdcall;
var
  S: string;
begin
  while (True) do
  begin
    S := 'Reporting from thread: ' + IntToStr(PThreadFuncData(aParam)^.Counter) + #13#10;
    inc(PThreadFuncData(aParam)^.Counter);
  end;
  Result := 0;
end;

procedure ThreadTest;
var
  i, Sum: cardinal;
  Ticks: cardinal;
begin
  SetLength(ThreadDataArray, MAX_THREADS);
  SetLength(ThreadHandleArray, MAX_THREADS);
  SetLength(ThreadIDArray, MAX_THREADS);
  for i := 0 to MAX_THREADS - 1 do
  begin
    ThreadDataArray[i].Counter := i;
    ThreadIDArray[i]           := 1;
    ThreadHandleArray[i]       := CreateThread(nil, 0, @ThreadFunc, @ThreadDataArray[i], 0, ThreadIDArray[i]);
  end;
  Ticks := GetTickCount;
  repeat
    Sleep(500);
    Sum := 0;
    for i := 0 to MAX_THREADS-1 do
      inc(Sum,ThreadDataArray[i].Counter);
    writeln(Sum div (GetTickCount-Ticks));
  until false;
  ReadLn;
end;

begin
  IsMultiThread := true;
  ThreadTest;
end.

It calls the string function in an infinite loop, then calculate the average of count.
I tested this little benchmark with or without our SynScaleMM unit, and with Delphi 7, Delphi 2007 and Delphi 2010.
Delphi 7 has our enhanced RTL patches over System.pas and SysUtils.pas.. and there is quite a difference here!

On my Core i7 with 8 cores, with MAX_THREADS = 100, here are the results:

Delphi 7 with our Enhanced RTL:
- Borland's MM: 18000, with 100% CPU usage
- FastMM4: 15000, with 12% CPU Usage
- SynScaleMM over Borland's MM: 32000, with 100% CPU usage
- SynScaleMM over FastMM4: 34000, with 100% CPU usage

Delphi 2007 (no Borland's MM since FastMM4 is by default) without our Enhanced RTL:
- FastMM4: 6000, with 12% CPU Usage
- SynScaleMM over FastMM4: 19400, with 100% CPU usage

Delphi 2010 (no Borland's MM since FastMM4 is by default) without our Enhanced RTL:
- FastMM4: 3400, with 12% CPU Usage
- SynScaleMM over FastMM4: 13000, with 100% CPU usage

I guess the difference came from:
- our Enhanced RTL is doing its purpose well (i.e. speed) under Delphi 7;
- FastMM4 is not scaling well at all;
- SynScaleMM scales up to no bottleneck;
- Borland's MM scales (a little bit) better than FastMM4, but spin lock is badly implemented (it burns 100% of CPU);
- UnicodeString handling in Delphi 2010 make string size twice bigger, so it's more or less half speed than Delphi 2007

If you lower the thread count to 8 or 10 (i.e. to test the perfect fit on my 8 cores CPU), you get around the same values. A little bit lower, because no CPU nor OS is a perfect multi-threaded system, I guess.

Graph groups are the following:
1. Borland's MM (only for Delphi 7);
2. FastMM4;
3. SynScaleMM over Borland's MM (only for Delphi 7);
4. SynScaleMM over FastMM4.

The higher, the better, of course!

ab · 2010-12-04 17:35:39

The big speed gain of our Enhanced RTL comes from the fact that we used an optimized asm version of IntToStr() in SysUtils.pas, whereare the default implementation of Delphi 7/Delphi 2007/Delphi 2010 (and Delphi XE) is still calling the internal CvtInt asm functions, which used very slow div operation in loops (whereas our IntToStr() version makes less div as possible).

For Delphi 2010, it's even worse: the conversion is made using CvtInt, and not CvtIntW (which was existing in all cases), and the slow System.@UStrFromPCharLen function is called on every IntToStr() call ! During Unicode refactoring, performance was not a goal in EMB...

But when you suppress the gain of our Enhanced RTL IntToStr() function, you can see that our SynScaleMM unit scales much better than FastMM4.

andrewdynamo · 2010-12-06 12:06:47

I have committed version 1.1 with some improvements/optimizations:

Still, the "Fast Code MM Benchmark" results are not good enough:
----------------------------------------------------------------
Average Total Performance: (Scaled so that the winner = 100%)
D2010 : 100,0
ScaleMem : 84,3
Average Speed Performance: (Scaled so that the winner = 100%)
D2010 : 84,1
ScaleMem : 100,0
Average Memory Performance: (Scaled so that the winner = 100%)
D2010 : 100,0
ScaleMem : 50,2
----------------------------------------------------------------

Some tests suffers from many "FreeBlockMemoryToGlobal" and global "GetBlockMemory" calls. So bad combination of
thread memory and FastMM (large) block locking.
Also, large alloc + free + realloc have some lower performance because ScaleMM does a size check first (some extra
overhead) and passes it to FastMM after that.

Note: memory consumption is higher because less optimal memory usage: each thread has its own (partly) used memory.

So these results can be explained (less optimal working because of mixed thread memory + fastmm locking). But in the mean
time this is the best I can get.
Btw: I get much better "StringThreadTest" results if I use ScaleMM on top of HeapMM or VirtualMM (because of no FastMM "large block" locks) but then reallocation tests perform very bad. I hope to fix all these problems with ScaleMM2 (but do not expect results soon).

Last edited by andrewdynamo (2010-12-06 12:12:13)

andrewdynamo · 2010-12-06 12:25:19

ab wrote:

For Delphi 2010, it's even worse: the conversion is made using CvtInt, and not CvtIntW (which was existing in all cases), and the slow System.@UStrFromPCharLen function is called on every IntToStr() call ! During Unicode refactoring, performance was not a goal in EMB...

How about an Enhanced RTL for D2010? :-)

ab · 2010-12-06 14:04:47

andrewdynamo wrote:

I have committed version 1.1 with some improvements/optimizations

Very interesting. But some modifications don't ring any bell to me.

1. Why did you get rid of all

    if not SwitchToThread then
      sleep(0);

code in this update?

This doesn't seem coherent with Intel official recommendations, as in http://software.intel.com/en-us/article … hitectures (see last paragraph named "Backing Off Locks with Spin-Wait Loops").

It could be handy to put at least a "pause" asm instruction in this code, if there is no SwitchToThread/Sleep:

function CAS32(const oldValue: pointer; newValue: pointer; var destination): boolean;
asm
  lock cmpxchg dword ptr [destination], newValue
  setz  al
  jz @ok
  pause // tell the CPU we are in a spin lock
@ok:
end; { CAS32 }

2. This setting will always fails (GPF) during the unitary tests of my framework:

  C_ARRAYSIZE = 32;  //32 instead of 64 -> smaller overhead

3. I like the trick of the unique memory allocation in procedure TMemBlockList.AddNewMemoryBlock.

But with this new kind of allocation, we could get rid of FMemoryArray without any problem (it's pointing to itself with some constant offset).

4. In the following code, the with GetSmallMemManager^ + Getmem/Freemem is a faster call of Scale_Getmem/ScaleFreemem:

        with GetSmallMemManager^ do
        begin
          Result := GetMem(aSize); // new mem  -------------> instead of Scale_GetMem
          if aMemory <> Result then
          begin
            Move(aMemory^, Result^, aSize); // copy (use smaller new size)
            FreeMem(aMemory); // free old mem  -------------> instead of Scale_Freemem
          end;
        end;

5. I'll study your new freeing mechanism. Perhaps its the reason why C_ARRAYSIZE = 32 failed.

ab · 2010-12-06 14:06:37

andrewdynamo wrote:

How about an Enhanced RTL for D2010? :-)

When I'll use D2010 in production.

Still using Delphi 7, because exe are smaller and the Unicode is handled at the framework level.

And... because Delphi 7 IDE with Angus+CnPack add one is more productive to me than D2010.

ab · 2010-12-06 16:09:01

I've uploaded a new version of SynScaleMM.
See http://synopse.info/fossil/info/6eeb471f30

It's now synchronized with r19 revision, from Dec 6, 2010, with some enhancements proper to SynScaleMM. In particular, I tried to clean my 5 points above.
For example, there is a new SPINWAITBACKOFF conditional, which is defined by default, in order to follow the Backing Off Locks with Spin-Wait Loops, as stated by Intel documentation.

Now seems working with C_ARRAYSIZE = 32... I've cleaned some Scale_Getmem() and Scale_Freemem() calls.

AdamWu · 2010-12-06 23:47:29

The reason back-off out-performs pure spin is that when the thread that hold the lock could not get CPU, the spinning is not making any progress, and the time is wasted. However, context-switch is not cheap either.

On a many core processor, it becomes increasing likely that the thread holding the lock is actually executing, and will be finished with it very soon. Sleep immediately on the first failed attempt also can induce unnecessary performance penalty.

So I think it make sense to first hot-spin for a while, and then switch to another thread. (I didn't invent this idea, I saw this first when I was reading Windows API several years ago: InitializeCriticalSectionAndSpinCount)

ab · 2010-12-07 07:04:23

AdamWu wrote:

So I think it make sense to first hot-spin for a while, and then switch to another thread. (I didn't invent this idea, I saw this first when I was reading Windows API several years ago: InitializeCriticalSectionAndSpinCount)

As you emphasized, it depends on how many threads your application have (or how many threads all running applications have globally).
If you have only one or two threads in your application, it could make sense no to call SwitchToThread.
But if you have 100 threads, SwitchToThread could be a good idea. Because you don't have 100 cores CPUs yet.

In all case, the current implementation in SynScaleMM is the folllowing:
1. check if possible to acquire lock and change the content atomically;
2. if possible and content updated, just break the loop;
3. if not possible, try to go to next thread;
4. if there is no next thread, sleep some time;
5. go back to step 1.

If you undefine the SPINWAITBACKOFF conditional, it will perform these steps:
1. check if possible to acquire lock and change the content atomically;
2. if possible and content updated, just break the loop;
3. if not possible, execute the PAUSE asm operation;
4. go back to step 1.

But the final judgement will become from wall clock timing.
With the small benchmak program of S := 'text'+IntToStr(n)+#13#10; infinite loop, defining SPINWAITBACKOFF make a small speed enhancement, less than 1000 units on my HW.

I guess there is no definitive answer, and that the best options will depend on the CPU and the application architecture.
That's why I tried to make SynScaleMM as modular as possible, with some conditional defines, which may be set at the Project/Options level. So it could help matching your program needs.

andrewdynamo · 2010-12-07 11:33:47

I thought (or hoped) a LOCK would not be interupted by a context switch, but it does...
(was more or less a dirty test)

So I did a quick but real test, the following gives the best results:

  ia: array of Integer;
  ...
  for l := 0 to 1000 do
  for i := 0 to 10000 do
  begin
    j := ia[i];
    k := j+1;
    while not CAS32(j, k, ia[i]) do
    begin
      if not SwitchToThread then
        Sleep(0);
      j := ia[i];
      k := j+1;
      if CAS32i(j, k, ia[i]) then
        Break
      else
         Sleep(1);
    end;
  end;

Timing results:
Testing with 1 threads...
Average: 182
Testing with 2 threads...
Average: 214
Testing with 3 threads...
Average: 260
Testing with 4 threads...
Average: 279
Testing with 8 threads...
Average: 426
Testing with 16 threads...
Average: 777

SwitchToThread gives slightly better results if thread count > core count (more efficient to switch to thread on same core).
If I don't use sleep(1) I get a race condition and it runs "forever" (I stopped waiting after 30s).

I will update my ScaleMM soon, first I want to test Fast Code Benchmark (much slower at startup than Fastmm -> never mind seems due to copyfile of exe).
Btw: I use the FMemoryArray now as a cache with an increment of the offset instead of multiply (add is faster than multiply?)

Last edited by andrewdynamo (2010-12-07 11:57:53)

ab · 2010-12-07 14:20:38

andrewdynamo wrote:

SwitchToThread gives slightly better results if thread count > core count (more efficient to switch to thread on same core).

That exactly what I meant just above, when I wrote "But if you have 100 threads, SwitchToThread could be a good idea".

andrewdynamo wrote:

If I don't use sleep(1) I get a race condition and it runs "forever" (I stopped waiting after 30s).

This is a worse case, IMHO.
But if it occurs just once, due to the MM, it's a major failure!

andrewdynamo wrote:

Btw: I use the FMemoryArray now as a cache with an increment of the offset instead of multiply (add is faster than multiply?)

Perhaps a little bit, but an integer mul is handled in a very aggressive way, i.e. within a few CPU cycles with modern CPU (with iCore I mean).
What is CPU consuming, is the div operation. Add or Mul are more or less equivalent.

I'm trying to implement some bitmap-based freeing of memory blocks in SynScaleMM, but in a different way you're doing in ScaleMM2. Still finishing the implementation. It'll be set by a USEBITMAP conditional.

andrewdynamo · 2010-12-07 14:53:09

ab wrote:

Perhaps a little bit, but an integer mul is handled in a very aggressive way, i.e. within a few CPU cycles with modern CPU (with iCore I mean).
What is CPU consuming, is the div operation. Add or Mul are more or less equivalent.

OK, I thought mul was also slow

ab wrote:

I'm trying to implement some bitmap-based freeing of memory blocks in SynScaleMM, but in a different way you're doing in ScaleMM2. Still finishing the implementation. It'll be set by a USEBITMAP conditional.

I saw the conditional: can you tell a little bit about how you think to do it? How much different?
(ScaleMM2 was "Proof Of Concept" anyway )

Btw: fixed ScaleMM is committed

Last edited by andrewdynamo (2010-12-07 15:17:21)

ab · 2010-12-07 16:36:37

andrewdynamo wrote:

I saw the conditional: can you tell a little bit about how you think to do it? How much different?

Draft version commited to our Source code repository.
Not working at all!!!

AdamWu · 2010-12-07 19:23:20

ab wrote:

If you have only one or two threads in your application, it could make sense no to call SwitchToThread.
But if you have 100 threads, SwitchToThread could be a good idea. Because you don't have 100 cores CPUs yet.

Actually the number of threads does not matter that much, at least in real life. Even if there are 100 threads, if 90 of them are not requesting memory at the same time, then there are only 10 competing threads, not 100. And more over, what matters is just that single thread and holds your lock, all other threads can be ignored.

The question boils down to the following: if the thread A that holds the lock is executing and will be done faster than a context switch, then it is better to spin wait; if A is switched off, or will take a long time to finish, then it is better to backoff. Since user program have no control over context switch, the problem is probablistic.

This resembles the classic "rent or buy" problem, consider each spin is a "rent", and a context switch/sleep is a "buy", I think the optimal answer would not be "always rent" or "always buy", but rent a few times then buy.

andrewdynamo wrote:

I thought (or hoped) a LOCK would not be interupted by a context switch, but it does...

Context switch and LOCK are two different things.

LOCK ensures the operation of an instruction is atomic, i.e. no other cores can do things that interfere with this operation; LOCK only applies to individual instructions, and no context switch can happen inside an instruction, locked or not.

Context switch happens on individual cores, and do not concern other cores - no context switch does not mean atomic (other core can fiddle with your memory at any time).

Last edited by AdamWu (2010-12-07 20:24:41)

andrewdynamo · 2010-12-08 07:19:30

AdamWu wrote:

LOCK only applies to individual instructions, and no context switch can happen inside an instruction, locked or not.

I do not think this is true: if what you say is true, I would not need a Sleep() on a "simple" CAS instruction, because it would be completed (very) fast.
It seems however it can be interrupted, so it keeps a lock quite long: I got a race condition, it burned my CPU and I had to wait forever.

AdamWu · 2010-12-08 07:36:54

andrewdynamo wrote:

I do not think this is true: if what you say is true, I would not need a Sleep() on a "simple" CAS instruction, because it would be completed (very) fast.
It seems however it can be interrupted, so it keeps a lock quite long: I got a race condition, it burned my CPU and I had to wait forever.

It must be a different problem.

Basically what the LOCK prefix for an instruction does is to assert a hold signal on the common memory bus, so that no other core can perform memory operation while the locked instruction is executing. (More recent architecture such as core i may do it a bit differently, but more or less on the same line) It has nothing to do with context switch, and vice versa. You could refer to the Intel ISA and programming manual to verify what I said.

In fact, think about it, the user program must not have any means, other than provided by the OS, to affect the scheduling. Otherwise it is a serious security flaw, any program can use LOCK prefix to gain unfair schduling advantage over other programs...

Last edited by AdamWu (2010-12-08 07:43:18)

ab · 2010-12-08 20:33:32

andrewdynamo wrote:

If I don't use sleep(1) I get a race condition and it runs "forever" (I stopped waiting after 30s).

I've included an optional Sleep(1) to avoid such race condition.
For that, I've inlined the

if not SwitchToThread then
   Sleep(0);

into the CAS() function, as asm instructions, and renamed it CAS0.

And added a new CAS1() function, including a Sleep(1) in case of CAS failure.

For instance, the code now looks like this:

procedure TThreadMemManager.AddFreedMemFromOtherThread(aMemory: PMemHeader);
var
  poldmem, currentmem: POtherThreadFreedMemory;
begin
  currentmem := aMemory;
  repeat
    poldmem := FOtherThreadFreedMemory;
    currentmem.NextMem  := poldmem;  // link to current next BEFORE the swap!
    // set new item as first (to created linked list)
    if CAS0(poldmem, currentmem, FOtherThreadFreedMemory) then
      break;
{$ifdef BACKOFFSLEEP1}
    poldmem := FOtherThreadFreedMemory;
    currentmem.NextMem  := poldmem;  
    if CAS1(poldmem, currentmem, FOtherThreadFreedMemory) then
      break;
{$endif}
  until false;               
end;

Here is the code of the CAS0 and CAS1 functions:

// compare oldvalue with destination: if equal then newvalue is set
function CAS0(const oldValue: pointer; newValue: pointer; var destination): boolean;
// - if failed, try to Switch to next OS thread, or Sleep 0 ms if it no next thread 
asm // eax=oldValue, edx=newValue, ecx=Destination
  lock cmpxchg dword ptr [Destination],newValue
  // will compile as "lock cmpxchg dword ptr [ecx],edx" under Win32 e.g.
  setz al
{$ifdef SPINWAITBACKOFF}
  jz @ok
  call SwitchToThread
  test oldValue,oldValue // oldValue=eax under Win32 e.g.
  jnz @ok
  push 0
  call Sleep
  xor oldValue,oldValue // return false
{$else}
  jz @ok
  pause // let the CPU know this thread is in a Spin Wait loop
{$endif}
@ok:
end;

{$ifdef BACKOFFSLEEP1}
function CAS1(const oldValue: pointer; newValue: pointer; var destination): boolean;
// - if failed, try to Switch to next OS thread, or Sleep 1 ms if it no next thread
// (this 1 ms sleep is necessary to avoid race condition - see
//  http://synopse.info/forum/viewtopic.php?pid=914#p914 )
asm // eax=oldValue, edx=newValue, ecx=Destination
  lock cmpxchg dword ptr [Destination],newValue
  // will compile as "lock cmpxchg dword ptr [ecx],edx" under Win32 e.g.
  setz al
  jz @ok
  call SwitchToThread
  test oldValue,oldValue
  jnz @ok
  push 1
  call Sleep
  xor oldValue,oldValue
@ok:
end;
{$endif}

AdamWu · 2010-12-08 23:43:24

If you must use sleep(1) to "avoid" some race condition, this is *not* the right fix, you are just hiding the bug on your machine, and this bug is bound to happen on some other machine.

One more thing, please don't stuck SwitchToThread() or even Sleep() into a benign-named "CAS" function, it will lead to maintenance nightmare.
If you must do it, name the function CAS_AND_SLEEP_ON_FAIL().

ab · 2010-12-09 04:48:52

AdamWu wrote:

If you must use sleep(1) to "avoid" some race condition, this is *not* the right fix, you are just hiding the bug on your machine, and this bug is bound to happen on some other machine.

I'm not sure of this thing. See the sample code above.

AdamWu wrote:

One more thing, please don't stuck SwitchToThread() or even Sleep() into a benign-named "CAS" function, it will lead to maintenance nightmare.
If you must do it, name the function CAS_AND_SLEEP_ON_FAIL().

This code is not about to be easily maintained, but to be fast. And code grouping in the CAS* function makes sense to avoid even more dupplicated code.
Rename the functions makes always sense, of course!

AdamWu · 2010-12-09 05:59:12

So, the problem is that TThreadMemManager.AddFreedMemFromOtherThread() causes deadlock if you don't sleep(1), am I correct?

BTW, one of the experience I got debugging my own multithread code is that, it is easier to spot a race condition on *slower* multi-processor (core) machines.

I have discovered several race condition in my code since I used Intel Atom.
Those code were tested on faster Core2 processors for at least 100 runs, and the races just *never* show up...

Last edited by AdamWu (2010-12-09 06:07:15)

AdamWu · 2010-12-09 22:51:52

I read the ScaleMM code where CAS1 is used, and I think I found a suspecious segment.
in TGlobalMemManager.GetBlockMemory()

...
      prevmem := bl.FFirstFreedMemBlock;
      if prevmem = nil then
        Continue;
      nextmem := prevmem.FNextFreedMemBlock;
      if CAS0(prevmem, nextmem, bl.FFirstFreedMemBlock) then
...

This code segment can suffer from the A-B-A problem. That is:

1. These line are executed
prevmem := bl.FFirstFreedMemBlock;
if prevmem = nil then
Continue;
nextmem := prevmem.FNextFreedMemBlock;

2. Before the CAS line is executed, the thread is swapped off.

Then content of bl.FFirstFreedMemBlock is changed, then changed back to the same value. (For example, another thread grabs the block A, and then freed it, and then grabs several blocks and made A the head of the chain again.)

Note that, although bl.FFirstFreedMemBlock is the same, bl.FFirstFreedMemBlock.FNextFreedMemBlock may not be the same anymore!

3. CAS is executed, and because it only looks at the pointer value, which is the same, the CAS passed. But now the pointer chain is incorrect!
In lucky cases, you will lose track of some memory blocks, and the program continues without error; but in unlucky cases, the pointer chain form a loop, and the program deadlocks.

A solution would be to append a version field to each pointer, and the version monotonically increases everytime the corresponding pointer recycles. To do this, you will need to use cmpxchg8b, which performs CAS on aligned 8 bytes at one go.

-------------
Adding a sleep(1) helps to hide the ABA problem, but not sleep(0) nor yieldtothread(). The reason is that, sleep(1) forces the thread to give up current time slice, so that, when the thread wake up, it will have a whole new time slice. Then, it is very likely to completely execute the entire loop once, without being swapped off in the middle, which hides the ABA problem.

However, it is not the right fix, because the very first attempt is executed without the whole time slice guarantee. If the ABA problem manifest on the very first try, the memory still will be corrupt.

(Adding a sleep(1) before the first attempt would probably completely hide the problem, but it will bring the performance down to hell... )

Last edited by AdamWu (2010-12-10 05:15:12)

ab · 2010-12-10 09:54:53

There is a problem here, which may appear in some border-line conditions.

I'll take a look at that. Perhaps worth changing the way it's implemented... The fix using a version field sounds possible, but a bit aggressive. Perhaps a regular Critical section could be enough... But this part of the code is called often, and its purpose is to be fast. And Critical Section could'nt be as efficient.

In all cases, thanks for the report. We'll see what Andrewdinamo will tell about that...

AdamWu · 2010-12-10 15:06:12

Yes, the entire purpose of using CAS is to write lock-free code, which scales well as the number of cores increase. Using locking mechanism of any kind (CS, mutex, semaphore, etc.) defeats the purpose of using CAS.

andrewdynamo · 2010-12-13 06:57:50

About the ABA problem: I think you're right. Not easy to fix it: critical section or larger lock (FRecursive) will make it a lot slower. However, it is a lock on a block so less frequent (not every mem alloc/free).
Version check should also be possible: a little more mem overhead on a block (no problem) and block should be 8 byte aligned for a cmpxchg8b.
I will see what's the best and/or easiest :-).

AdamWu · 2010-12-14 05:13:08

andrewdynamo wrote:

About the ABA problem: I think you're right. Not easy to fix it: critical section or larger lock (FRecursive) will make it a lot slower. However, it is a lock on a block so less frequent (not every mem alloc/free).
Version check should also be possible: a little more mem overhead on a block (no problem) and block should be 8 byte aligned for a cmpxchg8b.
I will see what's the best and/or easiest :-).

I think I came up with an algorithm that don't need version field for each pointer.

The old algorithm does the following:

0. Head: [Block] ----> [Block] ----> [Next]
1. Retrieve [Block] from Head
2. Retrieve [Next] from [Block]
3. CAS on Head, compare against [Block] and store [Next] if match

The new algorithm does the following:

0. Head [Block:Next] ----> [Block] ----> [Next]
1. Retrieve [Next] from Head, if [Next] = INFINITE, spin-wait-then-yield
2. CMPXCHG8B on Head, compare against [Block:Next] and store [Next:INFINITE]
Now we have [Block:Next] as the return value, the [Block] is what we need, but we need to do the next step to maintain loop invariance:
3. Retrieve [NewNext] from [Next] and replace INFINITE in Head, which becomes [Next:NewNext]

The new algorithm is free of the A-B-A problem, because the CAS is performed on both [Block] and [Next] pointers, even if ABA happened, as long as both pointer are the same, there is no error.

However, the new algorithm is slightly inefficient, in that, after a successful step 2, no thread can continue until step 3 is performed. However, this "locked" region is just 5-7 instructions long, and should not lead to a heavy penalty.

the Put(NewBlock) method is also straight forward to implement:

0. Head [Block:Next] ----> [Block] ----> [Next]
1. Retrieve [Next] from Head, if [Next] = INFINITE, spin-wait-then-yield
2. CMPXCHG8B on Head, compare against [Block:Next] and store [NewBlock:Block]

There is no "locked" region in the Put() method either.

Last edited by AdamWu (2010-12-14 05:15:37)

ab · 2010-12-14 07:10:44

Very interesting.

What about benchmarking this new version?

AdamWu · 2010-12-18 06:14:57

I have implemented three versions of the lock-free algorithm.

The first version is the one I described above, which uses cmpxchg8b:
In the best scenario each Get uses 23 instructions, each put uses 18 instructions;
Instructions in the "locked" sections are 7 and 0, respectively.

The section version does not use cmpxchg8b (and is still ABA safe):
In the best scenario each Get uses 17 instructions, each put uses 12 instructions;
Instructions in the "locked" sections are 10 and 6, respectively.

Third version uses a version field, instead of my algorithm, and uses cmpxchg8b:
In the best scenario each Get uses 17 instructions, each put uses 16 instructions;
Instructions in the "locked" sections are 0 and 0, respectively.

I am quite interested to know which one runs faster.
But I don't have access to high core machines right now...
Would anyone be intested in testing out the code for me?

ab · 2010-12-18 13:04:26

Where could these versions be downloaded?

AdamWu · 2010-12-18 18:33:26

ab wrote:

Where could these versions be downloaded?

I have release my code along with a test program, under MPL 1.1:
http://code.google.com/p/delphi-toolbox … svn29&r=29

The test program takes two parameter, the first is the number of threads, and the second is the algortihm name.

Valid algorithm names are:
4B - without using cmpxchg8b
8B - using cmpxchg8b
8BV - using the version field and cmpxchg8b

The workload is an endless cycle of get and put.
The test program will create threads, warm them up, and then run for 10 seconds.
Finally it will print out the average time used for each get-and-put cycle

On a single core, single thread, Pentium M 2.0GHz:

1 thread,   4B,   ~64 nanoseconds per cycle
4 threads,  4B,   ~64 nanoseconds per cycle

1 thread,   8B,   ~78 nanoseconds per cycle
4 threads,  8B,   ~78 nanoseconds per cycle

1 thread,   8BV,   ~77 nanoseconds per cycle
4 threads,  8BV,   ~77 nanoseconds per cycle

On a single core, hyper thread, Atom 1.6GHz:

1 thread,   4B,   ~44 nanoseconds per cycle
4 threads,  4B,   ~47 nanoseconds per cycle

1 thread,   8B,   ~73 nanoseconds per cycle
4 threads,  8B,   ~83 nanoseconds per cycle

1 thread,   8BV,   ~69 nanoseconds per cycle
4 threads,  8BV,   ~64 nanoseconds per cycle

It seems that from the absolute number POV, the 4B routines perform the best, probably for under 8 or 16 cores.
But the power of true lock-free code (8BV) is show on Atom, as the number of cores increase, the latency of all other routine goes up, while only 8BV goes down!

But again, I don't have access to more variety of machines right now, so please post your performance measurements so the general trend is more clear.

Update:
On a dual core, single thread, Core2Duo 2.4GHz:

1 thread,   4B,   ~50 nanoseconds per cycle
4 threads,  4B,   ~92 nanoseconds per cycle

1 thread,   8B,   ~61 nanoseconds per cycle
4 threads,  8B,   ~85 nanoseconds per cycle

1 thread,   8BV,   ~61 nanoseconds per cycle
4 threads,  8BV,   ~102 nanoseconds per cycle

OK now I am confused... need more data...

Last edited by AdamWu (2010-12-18 19:05:41)

ab · 2010-12-18 19:12:04

Here are the results (after compilation with Delphi 2010), on my Core i7 laptop - 4 Cores + HT, so 8 logical cores (Win7 64 bit).

Microsoft Windows [version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. Tous droits réservés.

d:\Download>TestLFStack.exe
TestLFStack.exe <Thread Count> <Routine>
Routine: 4B / 8B / 8BV

d:\Download>TestLFStack.exe 1 4B
* Using 1 threads
* Timing routine "4B"
Initializing...
Warming up...
Start timing...
Stopped timing...
* Elapsed 10,00 seconds
* Exchanged 233414862 cycles
* Average 42,84 nanoseconds per cycle


d:\Download>TestLFStack.exe 32 4B
* Using 32 threads
* Timing routine "4B"
Initializing...
Warming up...
Start timing...
Stopped timing...
* Elapsed 10,00 seconds
* Exchanged 20939583 cycles
* Average 477,54 nanoseconds per cycle


d:\Download>TestLFStack.exe 1 8B
* Using 1 threads
* Timing routine "8B"
Initializing...
Warming up...
Start timing...
Stopped timing...
* Elapsed 10,00 seconds
* Exchanged 294280876 cycles
* Average 33,98 nanoseconds per cycle


d:\Download>TestLFStack.exe 32 8B
* Using 32 threads
* Timing routine "8B"
Initializing...
Warming up...
Start timing...
Stopped timing...
* Elapsed 10,00 seconds
* Exchanged 27258849 cycles
* Average 366,84 nanoseconds per cycle


d:\Download>TestLFStack.exe 1 8BV
* Using 1 threads
* Timing routine "8BV"
Initializing...
Warming up...
Start timing...
Stopped timing...
* Elapsed 10,00 seconds
* Exchanged 293132128 cycles
* Average 34,11 nanoseconds per cycle


d:\Download>TestLFStack.exe 32 8BV
* Using 32 threads
* Timing routine "8BV"
Initializing...
Warming up...
Start timing...
Stopped timing...
* Elapsed 10,00 seconds
* Exchanged 33581564 cycles
* Average 297,81 nanoseconds per cycle

On my CPU, 8BV sounds better.

AdamWu · 2010-12-18 22:00:36

I did various performance tunning:

1. It seems that the pause instruction is the main culprit for the abnormally high increase in latency, so I remove it from my code;

2. It seems spin cycles are not helping either, so those instructions are gone (saved a couple of push/pop too)

3. Adjusted the testing program so that the warm up is more responsive and timing should be more accurate.

Please check out my latest revision:
http://code.google.com/p/delphi-toolbox … Free/?r=30

Last edited by AdamWu (2010-12-18 22:01:25)

TPrami · 2010-12-19 07:51:24

AdamWu wrote:

I am quite interested to know which one runs faster.
But I don't have access to high core machines right now...
Would anyone be intested in testing out the code for me?

Put compiled test Exe(s) somewhere so I can run them, I have i7 980x. I am very interested into test this...

-TP-

ab · 2010-12-19 15:15:14

Here are the results with the r30 revision:

* Using 1 threads
* Timing routine "4B"
* Exchanged 301232487 cycles
* Average 33,20 nanoseconds per cycle

* Using 32 threads
* Timing routine "4B"
* Exchanged 63495940 cycles
* Average 157,48 nanoseconds per cycle

* Using 1 threads
* Timing routine "8B"
* Exchanged 287862234 cycles
* Average 34,74 nanoseconds per cycle

* Using 32 threads
* Timing routine "8B"
* Exchanged 55684771 cycles
* Average 179,57 nanoseconds per cycle

* Using 1 threads
* Timing routine "8BV"
* Exchanged 293171819 cycles
* Average 34,11 nanoseconds per cycle

* Using 32 threads
* Timing routine "8BV"
* Exchanged 42762375 cycles
* Average 233,85 nanoseconds per cycle

mORMot Open Source

#51 2010-11-29 07:00:33

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#52 2010-11-29 07:09:22

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#53 2010-11-29 07:10:39

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#54 2010-11-29 08:19:28

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#55 2010-11-29 09:09:56

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#56 2010-11-29 10:51:03

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#57 2010-11-29 11:13:04

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#58 2010-11-29 13:04:42

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#59 2010-11-29 14:56:27

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#60 2010-11-29 16:47:55

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#61 2010-12-02 09:26:14

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#62 2010-12-02 11:44:20

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#63 2010-12-02 12:52:10

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#64 2010-12-02 13:13:35

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#65 2010-12-03 15:21:01

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#66 2010-12-03 16:50:25

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#67 2010-12-04 09:46:02

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#68 2010-12-04 14:16:34

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#69 2010-12-04 17:35:39

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#70 2010-12-06 12:06:47

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#71 2010-12-06 12:25:19

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#72 2010-12-06 14:04:47

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#73 2010-12-06 14:06:37

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#74 2010-12-06 16:09:01

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#75 2010-12-06 23:47:29

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#76 2010-12-07 07:04:23

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#77 2010-12-07 11:33:47

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#78 2010-12-07 14:20:38

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#79 2010-12-07 14:53:09

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#80 2010-12-07 16:36:37

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#81 2010-12-07 19:23:20

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#82 2010-12-08 07:19:30

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#83 2010-12-08 07:36:54

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#84 2010-12-08 20:33:32

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#85 2010-12-08 23:43:24

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#86 2010-12-09 04:48:52

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#87 2010-12-09 05:59:12

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#88 2010-12-09 22:51:52

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#89 2010-12-10 09:54:53

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#90 2010-12-10 15:06:12