Delphi doesn't like multi-core CPUs (or the contrary)

andrewdynamo · 2011-09-22 13:58:33

TPrami wrote:

When we can test drive it

If have a working version, but it is not ready yet.
It works for normal threads, but inter-thread memory is still a problem.
So I'm trying a simple locking (interlocked) approach now, although I would like something
without any LOCK... Anyway, it works, but thread memory is not reused or freed entirely,
so it leaks memory, etc etc. Also it is not optimized.

I hope I can post a preliminary version tomorrow if interested, but be aware, it is a draft: it needs code
cleanup, refactoring and some "finishing touch".

Arioch · 2011-11-17 12:49:33

" You use some div 32 or div 256: you'd better use shl 5 and shl 8, which will be faster when working with integer/NativeInt. "

" For the "div 2", replace it with "shl 1". Or use cardinals (the compiler will replace "aCardinal div 2" by "aCardinal shl 1", but it won't do it for an integer, because it must check if the integer is not <0).
For unsigned integers, div 2=shl 1, div 4=shl 2, div 8=shl 3, and so on... "

???

1) Ain't shift-left a multiplication, while shift-right being division ? 0100 = 4, 0010 = 2
2) AFAIR 80386 assembler there is three instructions:
SHL - multiplication, sign bit is lost (preempted into overflow signal)
SHR - unsigned divizion, zero-padded
SAR - signed division, sign-padded

So would compiler refuse to optimize integer div 2 ?

ab · 2011-11-17 17:22:33

You are right, this was a typo.

Of course, by two division is a binary shr (not shl).

The purpose of this remark was that the Delphi compiler will make the optimization for unsigned integer, but NOT for signed integers. If you are sure that the integer is in fact unsigned, you can safely use shr.
This is a limitation of the Delphi compiler.

Arioch · 2011-11-17 20:45:36

i would not search for TurboPascal 5.5 or 7; and Virtual Pascal proves nothing here, but i feel like i saw SAR opcode in generated code.

Anyway, given that RTL/VCL strongly dislikes Cardinal and such, it must be EXTREMELY lame limitation to use SHR but not SAR

Is there QC on this ?

Last edited by Arioch (2011-11-17 20:57:30)

ab · 2011-11-18 10:31:44

In practice, today's CPUs are able to identify a division per a two multiple, then change the micro-opcode executed (in modern CISC architectures, the x86/x64 asm is not executed directly, but first processed on the fly and converted into RISC-like opcodes).

So changing the compiler won't make a big difference.

Just another fact: until recently, the "best practice" for speed was unrolling loops.
But in newer CPUs, due to the L1 very restricted cache size and auto-pipelining feautres, optimized rolled loops are faster than plain unrolled loop. Take a look at our SynCrypto.pas unit: the rolled version I wrote in ASM is faster than an unrolled one.

The main performance is not about compiler (unless you have some very specific work to do), but mainly about algorithm.
Today's CPU have the potential to be damn fast.

Poorly designed software, and multiple architecture layers makes the Wirth's law true:
"Software is getting slower more rapidly than hardware becomes faster."

Arioch · 2011-11-18 12:51:07

That's inconsistent.

Year ago (when all modern x86 architectures except for Bulldozer were already in place) you gave suggestion to replace div with shr
Now u say it is irrelevant.

And i still have notebook with Pentium M processor, which, if i remeber right, is rooted in old PentiumPro :-)
fixed typos

Last edited by Arioch (2011-11-23 09:33:08)

ab · 2011-11-18 13:11:52

It is not irrelevant - it is sometimes not mandatory.

But as you wrote, if you want to run faster on older processors, shl/shr is still a good idea.
And it won't be slower on more modern CPUs.

Arioch · 2011-11-23 09:19:10

I have no access to Delphi 5 now, but at least i chanced to find respective line in XE2-produced 32-bit code.

TPoint is made of signed integers, isn't it ?

So, that's not just inconsistent but plain wrong :-)
Perhaps You've just forgotten {$O+} in DEBUG profile ?
Mmmm, in XE2 it still uses SAR even in $O- mode. If would not forget at spring should re0check in D5. I just can't believe in such inconsistency as a primary feature, not a bug in some particular compiler build.

Last edited by Arioch (2011-11-23 09:31:07)

Arioch · 2011-11-23 09:36:30

Arioch wrote:

SHR - unsigned divizion, zero-padded
SAR - signed division, sign-padded

Looking retrospectively, i like that shortcut typo of mine. :-)
"Unzigned" would be even better though. Missed chance for perfection.

ab · 2011-11-23 10:56:57

You are right: Delphi 5-6-7 will create those sar op codes AFAIR.

BUT the branch code (to handle CF in case of negative number) will make it slower than a pascal SHR - if you are sure that your integer value is positive (like for i := 0 to Count-1 do ...).

And... it is possible that in latest CPUs, a plain mov ecx,2; div ecx opcodes could be faster than a sar eax,1 + jns .. + adc eax,0

Arioch · 2011-11-23 11:54:03

> to handle CF in case of negative number
that means SHL - "*2". And only if range checking on.

In case of "div 2" CF would be oddity bit. IOW only need checking in "mod 2". Which probably be taken as "and 1". "Shr" 1 for "mod 2" or Odd(x) or Even(x) seems to be too clever trick for DCC.

Truly, i can't think of single practical use of CF after shift, that has any difference between signed and unsigned.

Also i ain't sure that CF is used at all for opcodes other than RCR/RCL. Just do not remember.
http://en.wikibooks.org/wiki/X86_Assemb … and_Rotate - this tells they are not, but i do not know how authoritative this source is. I remember well RCL but don't remember SCL at all. Though by logic it should exist for 64-bit integers in 32-bit mode and 32-bit integers in 16-bit one.

ab · 2011-11-23 12:14:59

Take a look at the asm generated, and step into it with a negative number.

I suspect you'll find out that you will step into the adc ecx,0 op code.
And that it makes a difference.

Arioch · 2011-11-26 00:03:35

No, really, that is getting weird. IS it depending on call-site ?

Integer, with both {$O-} and {$O+}

fmMain.pas.1216: x := -3;
0075EFC3 C745F4FDFFFFFF   mov [ebp-$0c],$fffffffd
fmMain.pas.1217: x := x div 2;
0075EFCA B902000000       mov ecx,$00000002
0075EFCF 8B45F4           mov eax,[ebp-$0c]
0075EFD2 99               cdq 
0075EFD3 F7F9             idiv ecx
0075EFD5 8945F4           mov [ebp-$0c],eax

Cardinal:

fmMain.pas.1214: x := +3;
0075F013 C745F403000000   mov [ebp-$0c],$00000003
fmMain.pas.1215: x := x div 2;
0075F01A B902000000       mov ecx,$00000002
0075F01F 8B45F4           mov eax,[ebp-$0c]
0075F022 33D2             xor edx,edx
0075F024 F7F1             div ecx
0075F026 8945F4           mov [ebp-$0c],eax
fmMain.pas.1217: x := +3;
0075F029 C745F403000000   mov [ebp-$0c],$00000003

x: cardinal; y: integer;

fmMain.pas.1214: x := +3;
0075EFDA C745F403000000   mov [ebp-$0c],$00000003
fmMain.pas.1215: x := x shr 2;
0075EFE1 C16DF402         shr dword ptr [ebp-$0c],$02
fmMain.pas.1217: y := -3;
0075EFE5 C745F0FDFFFFFF   mov [ebp-$10],$fffffffd
fmMain.pas.1218: y := y shr 2;
0075EFEC C16DF002         shr dword ptr [ebp-$10],$02
fmMain.pas.1221: x := x + y;
0075EFF0 8B45F0           mov eax,[ebp-$10]
0075EFF3 0145F4           add [ebp-$0c],eax

Which is arguably wrong on signed values.

PS since you made speed-optimized RTL, wanna have a bit of thrill and awe ?
Like "a bit of functional programming does not hurt in any quantities" ?
https://forums.embarcadero.com/thread.j … eID=414610

Arioch · 2011-12-22 20:46:30

talking about performance, how would you like this ? good, it is legacy compatibility bridge. Pure insanity....

Main.pas.49: lblJvIPAddressValuesAddress.Caption := IntToStr(AddressValues.Address);
00563B20 8B45F8           mov eax,[ebp-$08]
00563B23 8945E0           mov [ebp-$20],eax
00563B26 8B45E0           mov eax,[ebp-$20]
00563B29 8945F4           mov [ebp-$0c],eax
00563B2C 8B45F4           mov eax,[ebp-$0c]
...
Main.pas.50: lblJvIPAddressValuesValue1.Caption := IntToStr(AddressValues.Value1);
00563B52 8B45F8           mov eax,[ebp-$08]
00563B55 8945DC           mov [ebp-$24],eax
00563B58 8B45DC           mov eax,[ebp-$24]
00563B5B 8945F0           mov [ebp-$10],eax
00563B5E 8B45F0           mov eax,[ebp-$10]
00563B61 8945D4           mov [ebp-$2c],eax
00563B64 8B45D4           mov eax,[ebp-$2c]
00563B67 BA04000000       mov edx,$00000004
....

Alex7691 · 2012-02-07 18:42:29

Hi Folks,

I've created a little test case showing the difference of standard string type versus a self-made TStringBuilder class (internally using a PChar, not a String). I've found that ShortStrings doesn't like multi-core either, or at least in this specific case... I've not investigated the CPU view to discover which RTL functions the test case with ShortStrings are using, so I don't know yet why ShortStrings are even slower than standard string type. My test is here: http://alexandrecmachado.blogspot.com/2 … ation.html

Any thoughts?

Regards

Last edited by Alex7691 (2012-02-07 18:44:42)

ab · 2012-02-07 19:32:40

What is slow in string concatenation is not allocation, but internal copy of content.
That's why ShortString is also not much faster than a plain string, even if it should better scale.

We use a kind of TStringBuilder in the framework, because it is indeed much faster then string concatenation.
But our classes are much more complete than TStringBuilder...

Alex7691 · 2012-02-07 23:56:39

Yes, I've noticed. I will narrow down my investigation using that little test. I didn't test with many threads/cores but using only two cores ShortString didn't perform better than string, not only in time, but considering CPU utilization too.

Regards

carlos · 2012-10-10 08:29:17

Hi,

any way to use SynScaleMM on XE2?

carlos · 2012-10-10 08:33:17

I used the google code of ScaleMM2, and it gives me exceptions at the end of the execution. But threading is clearly faster.

andrewdynamo · 2012-10-10 12:07:10

carlos wrote:

I used the google code of ScaleMM2, and it gives me exceptions at the end of the execution. But threading is clearly faster.

You used the latest version from subversion from the version2 branch?
http://code.google.com/p/scalemm/source … 2Fversion2
(some bugs were fixed lately)

And have you some demo code/app to reproduce it?
Which version, Delphi XE2, 64bit?

Btw: better post an issue on scalemm project page:
http://code.google.com/p/scalemm/issues/list

ab · 2012-10-10 15:48:44

andrewdynamo wrote:

Btw: better post an issue on scalemm project page:
http://code.google.com/p/scalemm/issues/list

Indeed!
ScaleMM2 is much more advanced than our SynScaleMM, which is a fork of ScaleMM version 1.

JECKELS · 2013-02-05 03:59:31

Most of this is pretty deep, and I don't pretend to get it all, but I think I have the main concepts to avoid. Looking for clarification/confirmation.

For a Delphi 7 environment, Is this the correct summary of what to do to ensure we don't have issues with my binaries running on multiple processor/multiple core systems ...

- Apply the Enhanced Runtime Library (replace with apprpriate System.dcu & SysInit.dcu's)
- Replace memory manager with ScaleMM2 or equivalent, does FastMM4 have this issue as well?
- Replace TThread with my own class, that ties directly to the Win32 API for creating threads
- In threads, avoid String (or more specifically, use ShortString)

I am sure there's tons more, but from what I can gather, these are the main tasks.

I don't work too often in Delphi anymore, but have recently run into performance issues on a 32 core system that appears to be a ton of contention within the threading; and cant help but think its directly tied to the points you are making here.

Thanks in advance, for the help.

PS - Any other points for Delphi 7 as it relates to this topic would be appreciated.

ab · 2013-02-05 06:34:35

Replacing TThread is only worth it for very specific purpose with NO memory allocation (or pre-process memory allocation), and if you know that is may be the only thread in the project, to avoid IsMultiThread=true which slows down FastMM4.
For common thread use, and especially in big projects, it is much better to rely on the existing TThread, which is a thin layer around the API.

FastMM4 has the locking issue. ScaleMM2 is scales better, but also consume more memory.

Avoid allocation and re-allocation of data (e.g. string) is always worth it.
Adding "const" to string or dynamic array parameters, also a good habit.

Instead of using shortstring (I did not write about it in the blog article, on purpose), which is deprecated and not Unicode ready, I would rather use dedicated text append class (like our TTextWriter), or a static heap-allocated buffer (like tmp: array[byte] of char). You would have to deal with pointers, and/or use dedicated functions, e.g. those available in our SynCommons.pas.

JECKELS · 2013-02-05 19:31:22

Thanks for the feedback.

I hope I am on the right track with regard to gaining an overall improvement in performance.

Here is what I've done thus far ..

On our Delphi 7 buidl server

- Patched to D7 update 1
- Downloaded the precompiled System.dcu and SysInit.dcu, and placed over the existing ones
- Put ScaleMM2 as the preferred memory manager

I've done a build, and local tests, and everything appears to be just fine.

I've released this to a VM with 8 cores, and can see a significant performance gain.

When I take this same code to a much more powerful system (4 physical CPU, 8 core each --> total 32 cores), I see a decrease in performance (it got worse, by a very large margin).

I am running 32 TThreads. These threads are pretty lightweight, but represent a logical/isolated task (with a TDOAConnection on each).

What I seeing on our internal VM is a if CPU shows 80-90%, the kernel makes up about half of that.
What I seeing on our more powerful, non-VM system is if CPU shows 80-90%, the kernel makes up about 95+% of that.

On this 32 core system, it is actually significantly faster (almost 2-3x faster) with running 16 threads as it is running 32 threads. I see no other contention on the DB or anything else. It appears to be entirely within the delphi portion.

Our threads each have their own queue of work which are protectively pulling from and the main thread protectively adds to (protectively = TCriticalSection with a private member variable (FDummy: Array[0..95] Of Byte);

Am I right in thinking that kernel time = bad, and most likely related to locking as referenced here?

I appreciate the insights.

ab · 2013-02-05 21:53:25

You can specify thread affinity to a core, and test the diverse configurations.

TPrami · 2013-02-06 05:54:12

Hello,

Has anyone tested how multithreaded program will perform if the SetThreadIdealProcessor WinAPI is used.

This sounds good at paper. to allocate cores evenly but let the system do other vice if it knows better... In Real life scenario when the prog is not alonen in the system this could be good (IF some one is also hogging for the resources...).

Never tested my self, but made mental note once that should some day, if someone has now good system to compare, not setting any affinity for threads, setting threads to single core, and using SetThreadIdealProcessor, what kind of difference they produce. (Some trivial demo might be too sterile for this)...

-Tee-

ab · 2013-02-06 07:38:40

TPrami wrote:

Has anyone tested how multithreaded program will perform if the SetThreadIdealProcessor WinAPI is used.

I've used it, and it works pretty well, for instance when network process is contained on some cores, and file access (DB) activity is on another.

You can get an idea at the beginning of the file http://sourceforge.net/p/xmlrad/code/ci … System.pas

...........
: :
: V
...HTTP service ...DispatchThread
:..Mux RelaySMTP :..RequestQueueThread
:..Mux SMTP :
:..Mux XMPP :
: :
----------- -----------
| CPU#1 | QPI | CPU#2 |
DDR3...| 8xCores | < --- > | 8xCores |...DDR3
8x16GB | 16xSMT | | 16xSMT | 8x16GB
----------- -----------
QPI | | QPI
----------- -----------
| | QPI | |
| Chipset#1 | < --- > | Chipset#2 |
| | | |
----------- -----------
:..LAN NIC#1#2 :..PCIe#4 Flash Z-Drive R5 Tier0: 4x600GB
: NIC#3#4 :..PCIe#5 Flash Z-Drive R5
: :..PCIe#6 Flash Z-Drive R5
: :..PCIe#7 Flash Z-Drive R5
:..PCIe#1 (9261-8i)
: :..1x SSD X25E 64GB SLC System (Disk Writes intensive IOps, ie: persistent async queues)
: :..1x SSD M4 512GB MLC Trace
: :..2x SSD M4 512GB MLC Tier3 Eden (2x500GB)
: :..4x HDD 1TB Tier4 (RAID5 on every 4 drives, 4x1TBx5=15TB PCIe#1+PCIe#2+PCIe#3)
:..PCIe#2 (9261-8i)
: :..8x HDD 1TB Tier4
:..PCIe#3 (9261-8i)
: :..8x HDD 1TB Tier4
:..SAS2 Expander
:..48x HDD 1TB Tier4
Main ideas of UltimX:
- avoid unnecessary memory allocations moving data from source buffer to final buffer
- avoid intermediate string copy
- Memory buffers are maintained on the stack when possible
- Memory allocations are maintained outside the scope of variables
so variables do not need any longer to reserve physical memory,
and can rely on pre-allocated spaces to move data from source to destination
- when memory allocations are needed and are local to the scope of the procedure stack,
the programmer should privilege the use of the heap CurrentScratch with XScratchStack/Unstack
the memory blocks reside for the scope of the procedure (ephemary allocation),
all these local memory allocations are collected at each XScratchUnstack
and when thread terminates, the whole CurrentScratch heap is collected at once
- when memory allocations are needed to survive out of the scope of the current procedure stack,
the programmer should use the heap of the current request (XMLRequest.Heap),
the memory blocks reside for the duration of the request (ephemary allocation)
and all these local memory allocations are collected at once when the request is released
- at last, when no other alternative is possible but to perform a global memory allocation,
then such allocation should be performed using the global heap (auto-detect of the Numa Node)

There is a lot of very good code in this project, especially about multi-threading abilities of Delphi.
But, just like mORMot, it uses a lot of dedicated code (e.g. for string work), and not standard RTL, in order to scale on multi-core.

Take a look at TXThread object: it handles thread affinity.

maximmasiutin · 2017-07-14 05:54:43

As people have pointed out, performance, especially multi-threaded performance, depends on the memory manager, especially on its ability to handle locks.

FastMM4 (and the default Delphi built-in Memory Manager) is designed in such a way, that, by default, on thread contention, when one thread cannot acquire access to data, locked by another thread, calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.

Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.

That’s why CPU use never reaches 100% in multi-threaded Delphi applications that work with memory very intensively in concurrent - because of the Sleep(1) issued by FastMM4.

This way of acquiring locks can be improved by replacing it to better methods, recommended by Intel in the Developer's Optimization Guide.

A better way would have been a spin-lock of about 5000 `pause` instructions, and, if the lock was still busy, calling SwitchToThread() API call. If `pause` is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.

I have modified FastMM4, by creating a fork, to use a new approach to waiting for a lock: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).

When these options are enabled, FastMM4-AVX it checks:

- whether the CPU supports SSE2 and thus the "pause" instruction, and
- whether the operating system has the SwitchToThread() API call, and,

and in this case uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.

I have made available the fork called FastMM4-AVX at https://github.com/maximmasiutin/FastMM4

Here are the comparison of the Original FastMM4 version 4.992, with default options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization), and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch is more than twice as fast comparing to the Original FastMM4. The tests have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU sockets, each has 6 physical cores (12 logical threads) - with only 5 physical core per socket enabled for the test application. Another test was done under a i7-7700K CPU.

Used the "Multi-threaded allocate, use and free" and "NexusDB" test cases from the FastCode Challenge Memory Manager test suite, modified to run under 64-bit.

                         Xeon E6-2543v2 2*CPU     i7-7700K CPU
                        (allocated 20 logical  (allocated 8 logical
                         threads, 10 physical   threads, 4 physical
                         cores, NUMA)           cores)

                        Orig.  AVX-br.  Ratio   Orig.  AVX-br. Ratio
                        ------  -----  ------   -----  -----  ------
    02-threads realloc   96552  59951  62.09%   65213  49471  75.86%
    04-threads realloc   97998  39494  40.30%   64402  47714  74.09%
    08-threads realloc   98325  33743  34.32%   64796  58754  90.68%
    16-threads realloc  116708  45855  39.29%   71457  60173  84.21%
    16-threads realloc  116273  45161  38.84%   70722  60293  85.25%
    31-threads realloc  122528  53616  43.76%   70939  62962  88.76%
    64-threads realloc  137661  54330  39.47%   73696  64824  87.96%
    NexusDB 02 threads  122846  90380  73.72%   79479  66153  83.23%
    NexusDB 04 threads  122131  53103  43.77%   69183  43001  62.16%
    NexusDB 08 threads  124419  40914  32.88%   64977  33609  51.72%
    NexusDB 12 threads  181239  55818  30.80%   83983  44658  53.18%
    NexusDB 16 threads  135211  62044  43.61%   59917  32463  54.18%
    NexusDB 31 threads  134815  48132  33.46%   54686  31184  57.02%
    NexusDB 64 threads  187094  57672  30.25%   63089  41955  66.50%

You can find better tests of the memory manager in the FastCode challenge test suite at http://fastcode.sourceforge.net/

ab · 2017-07-18 07:05:37

Isn't it close to the NeverSleepOnThreadContention conditional behavior, which is known to be required for FastMM4 when working with multithreads?

maximmasiutin · 2017-07-18 18:54:56

No, it’s totally different from the NeverSleepOnThreadContention. I’ve tried NeverSleepOnThreadContention with FastCode Challenge Memory Manager test suite, and it was worse than the default behaviour. The "pause" instruction and a spin-loop of 5000 iterations, with just normal loads, not locked loads, is the essense. The number (5000) is not mandatory, any other number between 500 and 50000 also works OK.
Here is the code (excerpt from the FastFreeMem, 32-bit assembler):

@LockSmallBlockType:
  mov  eax, cLockByteLocked
{We are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange}
  cmp  TSmallBlockType([ebx]).SmallBlockTypeLocked, al       
  je   @PrepareForSpinLoop
  lock xchg TSmallBlockType([ebx]).SmallBlockTypeLocked, al
  cmp  al, cLockByteLocked
  jne  @GotLockOnSmallBlockType
@PrepareForSpinLoop:
  push edx
@LockSmallBlockTypeLoop:
  mov  edx, 5000
  mov  eax, cLockByteLocked
@DidntLock:
@NormalLoadLoop:
  dec  edx
  jz   @SwitchToThread // for static branch prediction, jump forward means "unlikely"
  pause
  cmp  TSmallBlockType([ebx]).SmallBlockTypeLocked, al       
  je   @NormalLoadLoop // for static branch prediction, jump backwards means "likely"
  lock xchg TSmallBlockType([ebx]).SmallBlockTypeLocked, al
  cmp  al, cLockByteLocked
  je   @DidntLock
  pop  edx
  jmp	@GotLockOnSmallBlockType
  @SwitchToThread:
  push  ebx
  push  ecx
  push  esi
  push  edi
  push  ebp
  call  SwitchToThread
  pop   ebp
  pop   edi
  pop   esi
  pop   ecx
  pop   ebx

  jmp   @LockSmallBlockTypeLoop

Last edited by maximmasiutin (2017-07-18 18:57:03)

E_Pluribus_Unum · 2019-03-21 13:41:28

Hi!

ab wrote:

3. Most locks are performed by the current version of Delphi:
- when you allocate memory;
- when your memory has to grow up or down;
- when your memory is freed;

The latest FastMM is quite good now, it scales well. (We used ScaleMM2, but we had to switch to FastMM because of a particular problem)

ab wrote:

- when a string is assigned to another string, which is very common if you use methods/function which returns a string as a result;
- when a char is about to be written in the string, i.e. when a string is about to be modified (implicit UniqueString() call generated by the compiler);
- the same for dynamic arrays...

Do you have any kind of solution for string/dynamic arra lock which does not include a complete library/program rewrite?

ab · 2019-03-21 16:43:53

IIRC the "pause" asm instruction has performance drop in latest Skylake CPUs.
See https://news.ycombinator.com/item?id=17336853

Perhaps there is some measurements to make.

Bo · 2022-06-20 05:46:47

Hi ab,

I have an application compiled by Dephi 2007, it is a multi-thread application, but when it is busy, it seems only uses one core, i.e., CPU0 will be used 100% but the others are sitll very low usage, and thus the box (a Windows 2019 server VM) becomes no response, do you think this sympton (only one core is used 100%) is also related to the same LOCK issue in Delphi?

ab · 2022-06-20 06:37:59

I don't think it goes to 100% of one core.
This sounds like a loopall somewhere in your code.
You may try to debug and see what's wrong.

It would let several cores loose a few percent.
And newer processor are less likely to have issue with the asm lock instruction. This article is somehow a bit old now - the today CPUs are not the same than the one from 2010!

Bo · 2022-06-21 01:02:27

Greate to know that recent CPUs have improved on this issue and that sounds correct as my application had been running okay for years in an on-premise VM until it moved to Azure VM and it started to have this 100$ CPU0 and non-responsive issue.
Is Azure VM not friendly to (Delphi) multi-thread application? Do you know if we need to do any optimization for Delphi application to run okay on Azure VM?

ab · 2022-06-21 07:53:47

How many cores do you have on the VM?

I am sure Delphi multi-thread applications are fine with Azure. Otherwise, they wouldn't get any money from it.

Bo · 2022-06-23 12:01:16

The customer's VM was created as 4 vCPUs in Azure, running CPU Get NumberOfCores, NumberOfLogicalProcessers /Format:List returns
NumberOfCores=2
NumberOfLogicalProcessors=4

Now I can reproduce on my own test VM which is half of customer's core, i.e., NumberOfCores=1, NumberOfLogicalProcessors=2.

With Process Explorer from Sysinternals tool set, I can see there are two threads using 100% of CPU0, even I suspend one of them the remained one still uses 100% CPU0.

Maybe next step is to see if I can reproduce it on a real PC.

ab · 2022-06-23 13:43:03

Could you reproduce it in the debugger and see where the infinite loops happen?

Bo · 2022-06-24 03:25:06

I now can reproduce the issue on my dev PC, but there is something I don't understand. The new problem is that so far I can only reproduce it if I run the application via Windows Run, after it becomes 100% CPU0, if I attach Delphi debugger to it, CPU0 dropped immediately to normal and stay like that after I continue running it, and after attached to the debugger, I could not reproduce it any more, neither if I run the application start from debugger, so I am now have to add debugging string/log to trace it instead of debugging it within the debugger.

Bo · 2022-06-24 07:12:39

It seems a problem in TIdTCPServer component, in its OnExecute event handler, if I use ReadLn to wait for the pack starter, in some cases (in my cases, hited by multiple packs at the same time and cut the connection in the middle etc.), it will keep firing OnExecute event even there is no connectioin at all (i.e., close the sender applications), change it from ReadLn to WaitFor seems can avoid the infinite loop:

conn.IOHandler.ReadLn(mllp_header, -2 {IdTimeoutInfinite}, conn.IOHandler.MaxLineLength); ==> conn.IOHandler.WaitFor(mllp_header);

Thanks for being helping me.

mORMot Open Source

#151 2011-09-22 13:58:33

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#152 2011-11-17 12:49:33

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#153 2011-11-17 17:22:33

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#154 2011-11-17 20:45:36

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#155 2011-11-18 10:31:44

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#156 2011-11-18 12:51:07

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#157 2011-11-18 13:11:52

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#158 2011-11-23 09:19:10

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#159 2011-11-23 09:36:30

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#160 2011-11-23 10:56:57

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#161 2011-11-23 11:54:03

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#162 2011-11-23 12:14:59

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#163 2011-11-26 00:03:35

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#164 2011-12-22 20:46:30

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#165 2012-02-07 18:42:29

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#166 2012-02-07 19:32:40

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#167 2012-02-07 23:56:39

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#168 2012-10-10 08:29:17

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#169 2012-10-10 08:33:17

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#170 2012-10-10 12:07:10

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#171 2012-10-10 15:48:44

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#172 2013-02-05 03:59:31

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#173 2013-02-05 06:34:35

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#174 2013-02-05 19:31:22

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#175 2013-02-05 21:53:25

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#176 2013-02-06 05:54:12

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#177 2013-02-06 07:38:40

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#178 2017-07-14 05:54:43

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#179 2017-07-18 07:05:37

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#180 2017-07-18 18:54:56

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#181 2019-03-21 13:41:28

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#182 2019-03-21 16:43:53

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#183 2022-06-20 05:46:47

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#184 2022-06-20 06:37:59

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#185 2022-06-21 01:02:27

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#186 2022-06-21 07:53:47

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#187 2022-06-23 12:01:16

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#188 2022-06-23 13:43:03

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#189 2022-06-24 03:25:06

Re: Delphi doesn't like multi-core CPUs (or the contrary)

#190 2022-06-24 07:12:39