#1 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-23 06:13:47

TPrami wrote:

Could not compile, D6 don't open the project, dies in Access Violation (Something wrong in my installation). And D2007 gives internal error when trying to compile it sad So I can't compile and test it at all for now...

I try it at home, but if it would be possible to have compiled .exe, and some bat to run some standard "test suite", with various settings.

Maybe to modify the test program to write  Log on disk, to make posting into here easier...

-TP-

Sorry, I was traveling in the past few days... tongue

I have started learning lock-free code around late 2008, so I have never tested my code on 2007 and below. But it should work with 2009 and up.

#2 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-18 22:00:36

I did various performance tunning:

1. It seems that the pause instruction is the main culprit for the abnormally high increase in latency, so I remove it from my code;

2. It seems spin cycles are not helping either, so those instructions are gone (saved a couple of push/pop too)

3. Adjusted the testing program so that the warm up is more responsive and timing should be more accurate.

Please check out my latest revision:
http://code.google.com/p/delphi-toolbox … Free/?r=30

#3 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-18 18:33:26

ab wrote:

Where could these versions be downloaded?

I have release my code along with a test program, under MPL 1.1:
http://code.google.com/p/delphi-toolbox … svn29&r=29

The test program takes two parameter, the first is the number of threads, and the second is the algortihm name.

Valid algorithm names are:
4B - without using cmpxchg8b
8B - using cmpxchg8b
8BV - using the version field and cmpxchg8b

The workload is an endless cycle of get and put.
The test program will create threads, warm them up, and then run for 10 seconds.
Finally it will print out the average time used for each get-and-put cycle

On a single core, single thread, Pentium M 2.0GHz:

1 thread,   4B,   ~64 nanoseconds per cycle
4 threads,  4B,   ~64 nanoseconds per cycle

1 thread,   8B,   ~78 nanoseconds per cycle
4 threads,  8B,   ~78 nanoseconds per cycle

1 thread,   8BV,   ~77 nanoseconds per cycle
4 threads,  8BV,   ~77 nanoseconds per cycle

On a single core, hyper thread, Atom 1.6GHz:

1 thread,   4B,   ~44 nanoseconds per cycle
4 threads,  4B,   ~47 nanoseconds per cycle

1 thread,   8B,   ~73 nanoseconds per cycle
4 threads,  8B,   ~83 nanoseconds per cycle

1 thread,   8BV,   ~69 nanoseconds per cycle
4 threads,  8BV,   ~64 nanoseconds per cycle

It seems that from the absolute number POV, the 4B routines perform the best, probably for under 8 or 16 cores.
But the power of true lock-free code (8BV) is show on Atom, as the number of cores increase, the latency of all other routine goes up, while only 8BV goes down!

But again, I don't have access to more variety of machines right now, so please post your performance measurements so the general trend is more clear.

Update:
On a dual core, single thread, Core2Duo 2.4GHz:

1 thread,   4B,   ~50 nanoseconds per cycle
4 threads,  4B,   ~92 nanoseconds per cycle

1 thread,   8B,   ~61 nanoseconds per cycle
4 threads,  8B,   ~85 nanoseconds per cycle

1 thread,   8BV,   ~61 nanoseconds per cycle
4 threads,  8BV,   ~102 nanoseconds per cycle

OK now I am confused... need more data... tongue

#4 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-18 06:14:57

I have implemented three versions of the lock-free algorithm.

The first version is the one I described above, which uses cmpxchg8b:
In the best scenario each Get uses 23 instructions, each put uses 18 instructions;
Instructions in the "locked" sections are 7 and 0, respectively.

The section version does not use cmpxchg8b (and is still ABA safe):
In the best scenario each Get uses 17 instructions, each put uses 12 instructions;
Instructions in the "locked" sections are 10 and 6, respectively.

Third version uses a version field, instead of my algorithm, and uses cmpxchg8b:
In the best scenario each Get uses 17 instructions, each put uses 16 instructions;
Instructions in the "locked" sections are 0 and 0, respectively.

I am quite interested to know which one runs faster.
But I don't have access to high core machines right now... sad
Would anyone be intested in testing out the code for me?

#5 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-14 05:13:08

andrewdynamo wrote:

About the ABA problem: I think you're right. Not easy to fix it: critical section or larger lock (FRecursive) will make it a lot slower. However, it is a lock on a block so less frequent (not every mem alloc/free).
Version check should also be possible: a little more mem overhead on a block (no problem) and block should be 8 byte aligned for a cmpxchg8b.
I will see what's the best and/or easiest :-).

I think I came up with an algorithm that don't need version field for each pointer.

The old algorithm does the following:

0. Head: [Block] ----> [Block] ----> [Next]
1. Retrieve [Block] from Head
2. Retrieve [Next] from [Block]
3. CAS on Head, compare against [Block] and store [Next] if match

The new algorithm does the following:

0. Head [Block:Next] ----> [Block] ----> [Next]
1. Retrieve [Next] from Head, if [Next] = INFINITE, spin-wait-then-yield
2. CMPXCHG8B on Head, compare against [Block:Next] and store [Next:INFINITE]
Now we have [Block:Next] as the return value, the [Block] is what we need, but we need to do the next step to maintain loop invariance:
3. Retrieve [NewNext] from [Next] and replace INFINITE in Head, which becomes [Next:NewNext]

The new algorithm is free of the A-B-A problem, because the CAS is performed on both [Block] and [Next] pointers, even if ABA happened, as long as both pointer are the same, there is no error.

However, the new algorithm is slightly inefficient, in that, after a successful step 2, no thread can continue until step 3 is performed. However, this "locked" region is just 5-7 instructions long, and should not lead to a heavy penalty.

the Put(NewBlock) method is also straight forward to implement:

0. Head [Block:Next] ----> [Block] ----> [Next]
1. Retrieve [Next] from Head, if [Next] = INFINITE, spin-wait-then-yield
2. CMPXCHG8B on Head, compare against [Block:Next] and store [NewBlock:Block]

There is no "locked" region in the Put() method either.

#6 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-10 15:06:12

Yes, the entire purpose of using CAS is to write lock-free code, which scales well as the number of cores increase. Using locking mechanism of any kind (CS, mutex, semaphore, etc.) defeats the purpose of using CAS.

#7 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-09 22:51:52

I read the ScaleMM code where CAS1 is used, and I think I found a suspecious segment.
in TGlobalMemManager.GetBlockMemory()

...
      prevmem := bl.FFirstFreedMemBlock;
      if prevmem = nil then
        Continue;
      nextmem := prevmem.FNextFreedMemBlock;
      if CAS0(prevmem, nextmem, bl.FFirstFreedMemBlock) then
...

This code segment can suffer from the A-B-A problem. That is:

1. These line are executed
      prevmem := bl.FFirstFreedMemBlock;
      if prevmem = nil then
        Continue;
      nextmem := prevmem.FNextFreedMemBlock;

2. Before the CAS line is executed, the thread is swapped off.

Then content of bl.FFirstFreedMemBlock is changed, then changed back to the same value. (For example, another thread grabs the block A, and then freed it, and then grabs several blocks and made A the head of the chain again.)

Note that, although bl.FFirstFreedMemBlock is the same, bl.FFirstFreedMemBlock.FNextFreedMemBlock may not be the same anymore!

3. CAS is executed, and because it only looks at the pointer value, which is the same, the CAS passed. But now the pointer chain is incorrect!
In lucky cases, you will lose track of some memory blocks, and the program continues without error; but in unlucky cases, the pointer chain form a loop, and the program deadlocks.

A solution would be to append a version field to each pointer, and the version monotonically increases everytime the corresponding pointer recycles. To do this, you will need to use cmpxchg8b, which performs CAS on aligned 8 bytes at one go.

-------------
Adding a sleep(1) helps to hide the ABA problem, but not sleep(0) nor yieldtothread(). The reason is that, sleep(1) forces the thread to give up current time slice, so that, when the thread wake up, it will have a whole new time slice. Then, it is very likely to completely execute the entire loop once, without being swapped off in the middle, which hides the ABA problem.

However, it is not the right fix, because the very first attempt is executed without the whole time slice guarantee. If the ABA problem manifest on the very first try, the memory still will be corrupt.

(Adding a sleep(1) before the first attempt would probably completely hide the problem, but it will bring the performance down to hell... tongue)

#8 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-09 05:59:12

So, the problem is that TThreadMemManager.AddFreedMemFromOtherThread() causes deadlock if you don't sleep(1), am I correct?

BTW, one of the experience I got debugging my own multithread code is that, it is easier to spot a race condition on *slower* multi-processor (core) machines.

I have discovered several race condition in my code since I used Intel Atom. big_smile
Those code were tested on faster Core2 processors for at least 100 runs, and the races just *never* show up...

#9 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-08 23:43:24

If you must use sleep(1) to "avoid" some race condition, this is *not* the right fix, you are just hiding the bug on your machine, and this bug is bound to happen on some other machine.

One more thing, please don't stuck SwitchToThread() or even Sleep() into a benign-named "CAS" function, it will lead to maintenance nightmare.
If you must do it, name the function CAS_AND_SLEEP_ON_FAIL().

#10 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-08 07:36:54

andrewdynamo wrote:

I do not think this is true: if what you say is true, I would not need a Sleep() on a "simple" CAS instruction, because it would be completed (very) fast.
It seems however it can be interrupted, so it keeps a lock quite long: I got a race condition, it burned my CPU and I had to wait forever.

It must be a different problem.

Basically what the LOCK prefix for an instruction does is to assert a hold signal on the common memory bus, so that no other core can perform memory operation while the locked instruction is executing. (More recent architecture such as core i may do it a bit differently, but more or less on the same line) It has nothing to do with context switch, and vice versa. You could refer to the Intel ISA and programming manual to verify what I said. tongue

In fact, think about it, the user program must not have any means, other than provided by the OS, to affect the scheduling. Otherwise it is a serious security flaw, any program can use LOCK prefix to gain unfair schduling advantage over other programs...

#11 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-07 19:23:20

ab wrote:

If you have only one or two threads in your application, it could make sense no to call SwitchToThread.
But if you have 100 threads, SwitchToThread could be a good idea. Because you don't have 100 cores CPUs yet. wink

Actually the number of threads does not matter that much, at least in real life. Even if there are 100 threads, if 90 of them are not requesting memory at the same time, then there are only 10 competing threads, not 100. And more over, what matters is just that single thread and holds your lock, all other threads can be ignored.

The question boils down to the following: if the thread A that holds the lock is executing and will be done faster than a context switch, then it is better to spin wait; if A is switched off, or will take a long time to finish, then it is better to backoff. Since user program have no control over context switch, the problem is probablistic.

This resembles the classic "rent or buy" problem, consider each spin is a "rent", and a context switch/sleep is a "buy", I think the optimal answer would not be "always rent" or "always buy", but rent a few times then buy.

andrewdynamo wrote:

I thought (or hoped) a LOCK would not be interupted by a context switch, but it does...

Context switch and LOCK are two different things.

LOCK ensures the operation of an instruction is atomic, i.e. no other cores can do things that interfere with this operation; LOCK only applies to individual instructions, and no context switch can happen inside an instruction, locked or not.

Context switch happens on individual cores, and do not concern other cores - no context switch does not mean atomic (other core can fiddle with your memory at any time).

#12 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-06 23:47:29

The reason back-off out-performs pure spin is that when the thread that hold the lock could not get CPU, the spinning is not making any progress, and the time is wasted. However, context-switch is not cheap either.

On a many core processor, it becomes increasing likely that the thread holding the lock is actually executing, and will be finished with it very soon. Sleep immediately on the first failed attempt also can induce unnecessary performance penalty.

So I think it make sense to first hot-spin for a while, and then switch to another thread. (I didn't invent this idea, I saw this first when I was reading Windows API several years ago: InitializeCriticalSectionAndSpinCount)

#13 Re: Delphi » Dll hell, WinSXS monster, and Delphi paradise » 2010-12-06 23:14:26

WinSxS is a monster, and you cannot avoid it. tongue
It is because Windows uses it for a lot of internal libraries, such as for common controls library, which you have not license to redistribute.

Anyway, I am just commenting to share a probably not-so-well-known quirk (API bug) of WinSxS. I encountered this when I tried to programmatically switch context to load a specific version of common controls library.

The background (skip this paragraph if you don't care):
I wrote a plug-in dll for a third party exe file which is written in another language. I wanted to use an XP style common control but the main executable is not boundled with the manifest.
So, I decided to boundle the manifest in my DLL, and programmatically create an SxS activation context using the boundled DLL before trying to load commctls.

The quirk:
The CreateActCtx API has two mode of operations, in the first mode, it takes an external manifest xml file, and in the second, a resource from a loaded module.

On XP and Vista, the second mode has a bug: although the file name parameter should not be used, you cannot pass null, you have to pass some valid memory address. Otherwise the API fails.

On Win7, this bug is secretly fixed, and in a totally incompatible way. Now you must pass null in the file name when you use module resource, otherwise the API fails. So the quirk workaround I had fails again on Win7 and I have to resort to check system versions and activate workaround accordingly...

I guess they thought nobody ever used the SxS, and it is mostly just for them to make their stuff work properly... tongue


One more note: So far, this bug presists acorss SPs -- XP SP123 and Vista SP1 all has this bug, which is GOOD. I sincerely HOPE they don't "fix" this thing in the next Vista SP, otherwise I will have to add more complicated SP detection to make my program work!

#14 Re: Language » Compiler enhancement proposal: threadlocalvar » 2010-08-04 04:26:41

> And it'll be less expensive than a LOCK in multi-thread applications
I doubt that. Currently string assignment is by reference only, a full copy would involve many times more instruction executions than that.

Also, for threadlocal to threadlocal assignment, yes, it will be faster. But since threadlocal and global variables must be differenciated and treated differently, its intruduction will cause all existing code to run slower, maybe just by a tiny bit, but that includes almost everything...

#15 Re: Enhanced System Run Time Library » Enhanced System Run Time for Delphi 7 and Delphi 2007 » 2010-08-04 04:23:08

If you recompile system.pas, then you will need to recompile every single source in the runtime to make it match with the modified system.pas.
Every one of them.

#16 Re: Language » Compiler enhancement proposal: threadlocalvar » 2010-08-01 07:21:46

So, if I understand it correctly, your threadlocalvar sounds quite like a stack variable, except it will include managed types which are then automatically redirected to use the thread local version of memory manager.

But, as you said, it has some confusing usage syntax and semantics.

There will be assignment between threadlocalvar and "normal" variables.
(You cannot forbid all local-nonlocal assignments, because the computation performed using the threadlocalvar has to be returned as result, which is not a threadlocalvar.)
This assignment will be not only expensive (full copying is required), but also cumbersome.

For example, what should happend when a threadlocalvar instance of a class is assigned to a normal class reference?
Because the class instance itself may contain reference to other threadlocalvar types/class instances, a simple instance memory copy may not be enough.
Then all classes must implement a compiler magic "threadlocalcopy()" method to perform deep copying.

Another possible solution is to forbid assignment between "complex" threadlocalvar types and normal types, such as class references.
Then this may lead to confusion, and extra complexity of the compilers.

Also what will happen if someone typecast a complex type of threadlocalvar to pointer or integer and try to pass it?

#17 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-07-27 17:21:30

Yes, stack based memory allocate is definitely faster, agreed. big_smile

But as I said, extra careness is needed on the programmer side, especially when writing a library, or calling external libraries, because you can't control the stack when you don't have the execution control (i.e. the external caller may only leave you a tiny stack, or the external callee may demand too much stack space than you have left for it)

And, sometimes a stack overflow exhibit totally weird behavior, it is hard to detect or debug.

> If a block memory is used outside the thread which created it, a copy will be made.

Who is going to make the copy?

If the user thread does the copy, then locking is needed, since the owner thread may operation the memory during the copy, and thus leading to race condition. (RCU works only if ALL write accesses are first made in a separate copy, so in order to make your scheme an RCU, even the owner thread need to make a copy of its memory block before doing any update -- this is one of the case I mentioned that the writer is penalized for faster reads)

if the owner thread does the copy, then the user thread needs to provide synchronized access to its memory manager, and most likely via some kind of locking. (Say, a memory block owned by thread A is going to be passed to thread B, then thread A needs to ask thread B's memory manager for a chunk of memory before A can perform the copy...)

#18 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-07-24 01:18:39

I agree that all these are possible and can be done.
But I kind of worried about doing all these puts too many constraints on the programming style, to the point of counter productive.

For example, exchange of data between threads is not common, agreed, but the problem is that, it does happen, and you won't know when.
So, in order to correctly handle those 0.1% cases that happen randomly, you have to be prepared all the time - i.e. the alloc / free list operations must be synchronized.

There are many existing optimizations such as RCU used in Linux kernel, but nearly all of them speeds up just the readers, some even penalize the writers in order to optimize for reading.
But most of the cases you argue in the first post are actually writing - string assignment, freeing memory blocks ... I doubt optimization can help.

But yes, there is an alternative way, just change the existing programming model and avoid the need for lock.
Like suggested in UltimX - whenever you need something local, just use the stack.

But everything comes at some cost, right?
1. The stack space is quite limited, and may not be extendable, so the programmer needs to be extra careful not to overflow it.
2. The local heap is basically a chunk a memory taken out from middle of the otherwise continous address space. So if you take too many, the address space gets fragmented.
32-bit program only has 2GB address space, in which it must fit in the executable plus all libraries, and the stacks and heaps for all threads.
So it is actually quite easy to get fragmented, and what happens then is that you will fail to allocate a relatively big chunk of memory (say 300MB), even there seems to be much more avaliable (say 600MB), because it is not a single piece (such as 3x200MB)... tongue

------
I've seem some nice hardware solutions. I have programmed on an OCTEON networking processor, which is designed to handle packets at extremely fast speed. On board are special hardware assistances to avoid locking, such as a programmable "thread" scheduler, and a hardware memory manager. But again, there is some cost - the memory manager can only give you fix sized blocks, so some wastage is unavoidable...

#19 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-07-23 01:43:32

True, local heap can avoid locking, but there will be other problems introduced:
1. Some application uses massive multi-threading, if each thread opens a local heap, then the address space may get very fragmented. (When Delphi eventually move to 64bit, we probably won't have to worry about it anymore... tongue)
2. Each heap would need an instance of memory manager, and then there will be inter-op problems. If a piece of memory is passed between threads, it has to be somehow "linked" to its allocator memory manager instance otherwise there will be problem freeing it. And also the free operation needs to be synchronized, probably using LOCK, maybe RCU -- but I think it is too heavy weighted for memory management.

#20 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-07-22 17:26:05

Well, to achieve thread synchronization and avoid race condition, some type of atomic operation is needed. And I think the LOCKed instruction is the finest atomic operation you will get. So I don't think it can be avoided -- may be its usage can be reduced, but eventually you will have to use one.

Eliminating LOCKed instruction is not what a compiler can do. Architectural (hardware) changes are needed, for example, using transational memory.

Board footer

Powered by FluxBB