conn.IOHandler.ReadLn(mllp_header, -2 {IdTimeoutInfinite}, conn.IOHandler.MaxLineLength); ==> conn.IOHandler.WaitFor(mllp_header);
Thanks for being helping me.
]]>Now I can reproduce on my own test VM which is half of customer's core, i.e., NumberOfCores=1, NumberOfLogicalProcessors=2.
With Process Explorer from Sysinternals tool set, I can see there are two threads using 100% of CPU0, even I suspend one of them the remained one still uses 100% CPU0.
Maybe next step is to see if I can reproduce it on a real PC.
]]>I am sure Delphi multi-thread applications are fine with Azure. Otherwise, they wouldn't get any money from it.
]]>It would let several cores loose a few percent.
And newer processor are less likely to have issue with the asm lock instruction. This article is somehow a bit old now - the today CPUs are not the same than the one from 2010!
I have an application compiled by Dephi 2007, it is a multi-thread application, but when it is busy, it seems only uses one core, i.e., CPU0 will be used 100% but the others are sitll very low usage, and thus the box (a Windows 2019 server VM) becomes no response, do you think this sympton (only one core is used 100%) is also related to the same LOCK issue in Delphi?
]]>Perhaps there is some measurements to make.
]]>3. Most locks are performed by the current version of Delphi:
- when you allocate memory;
- when your memory has to grow up or down;
- when your memory is freed;
The latest FastMM is quite good now, it scales well. (We used ScaleMM2, but we had to switch to FastMM because of a particular problem)
- when a string is assigned to another string, which is very common if you use methods/function which returns a string as a result;
- when a char is about to be written in the string, i.e. when a string is about to be modified (implicit UniqueString() call generated by the compiler);
- the same for dynamic arrays...
Do you have any kind of solution for string/dynamic arra lock which does not include a complete library/program rewrite?
]]>@LockSmallBlockType:
mov eax, cLockByteLocked
{We are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange}
cmp TSmallBlockType([ebx]).SmallBlockTypeLocked, al
je @PrepareForSpinLoop
lock xchg TSmallBlockType([ebx]).SmallBlockTypeLocked, al
cmp al, cLockByteLocked
jne @GotLockOnSmallBlockType
@PrepareForSpinLoop:
push edx
@LockSmallBlockTypeLoop:
mov edx, 5000
mov eax, cLockByteLocked
@DidntLock:
@NormalLoadLoop:
dec edx
jz @SwitchToThread // for static branch prediction, jump forward means "unlikely"
pause
cmp TSmallBlockType([ebx]).SmallBlockTypeLocked, al
je @NormalLoadLoop // for static branch prediction, jump backwards means "likely"
lock xchg TSmallBlockType([ebx]).SmallBlockTypeLocked, al
cmp al, cLockByteLocked
je @DidntLock
pop edx
jmp @GotLockOnSmallBlockType
@SwitchToThread:
push ebx
push ecx
push esi
push edi
push ebp
call SwitchToThread
pop ebp
pop edi
pop esi
pop ecx
pop ebx
jmp @LockSmallBlockTypeLoop
FastMM4 (and the default Delphi built-in Memory Manager) is designed in such a way, that, by default, on thread contention, when one thread cannot acquire access to data, locked by another thread, calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.
Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.
That’s why CPU use never reaches 100% in multi-threaded Delphi applications that work with memory very intensively in concurrent - because of the Sleep(1) issued by FastMM4.
This way of acquiring locks can be improved by replacing it to better methods, recommended by Intel in the Developer's Optimization Guide.
A better way would have been a spin-lock of about 5000 `pause` instructions, and, if the lock was still busy, calling SwitchToThread() API call. If `pause` is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.
I have modified FastMM4, by creating a fork, to use a new approach to waiting for a lock: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).
When these options are enabled, FastMM4-AVX it checks:
- whether the CPU supports SSE2 and thus the "pause" instruction, and
- whether the operating system has the SwitchToThread() API call, and,
and in this case uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.
I have made available the fork called FastMM4-AVX at https://github.com/maximmasiutin/FastMM4
Here are the comparison of the Original FastMM4 version 4.992, with default options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization), and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch is more than twice as fast comparing to the Original FastMM4. The tests have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU sockets, each has 6 physical cores (12 logical threads) - with only 5 physical core per socket enabled for the test application. Another test was done under a i7-7700K CPU.
Used the "Multi-threaded allocate, use and free" and "NexusDB" test cases from the FastCode Challenge Memory Manager test suite, modified to run under 64-bit.
Xeon E6-2543v2 2*CPU i7-7700K CPU
(allocated 20 logical (allocated 8 logical
threads, 10 physical threads, 4 physical
cores, NUMA) cores)
Orig. AVX-br. Ratio Orig. AVX-br. Ratio
------ ----- ------ ----- ----- ------
02-threads realloc 96552 59951 62.09% 65213 49471 75.86%
04-threads realloc 97998 39494 40.30% 64402 47714 74.09%
08-threads realloc 98325 33743 34.32% 64796 58754 90.68%
16-threads realloc 116708 45855 39.29% 71457 60173 84.21%
16-threads realloc 116273 45161 38.84% 70722 60293 85.25%
31-threads realloc 122528 53616 43.76% 70939 62962 88.76%
64-threads realloc 137661 54330 39.47% 73696 64824 87.96%
NexusDB 02 threads 122846 90380 73.72% 79479 66153 83.23%
NexusDB 04 threads 122131 53103 43.77% 69183 43001 62.16%
NexusDB 08 threads 124419 40914 32.88% 64977 33609 51.72%
NexusDB 12 threads 181239 55818 30.80% 83983 44658 53.18%
NexusDB 16 threads 135211 62044 43.61% 59917 32463 54.18%
NexusDB 31 threads 134815 48132 33.46% 54686 31184 57.02%
NexusDB 64 threads 187094 57672 30.25% 63089 41955 66.50%
You can find better tests of the memory manager in the FastCode challenge test suite at http://fastcode.sourceforge.net/
]]>Has anyone tested how multithreaded program will perform if the SetThreadIdealProcessor WinAPI is used.
I've used it, and it works pretty well, for instance when network process is contained on some cores, and file access (DB) activity is on another.
You can get an idea at the beginning of the file http://sourceforge.net/p/xmlrad/code/ci … System.pas
...........
: :
: V
...HTTP service ...DispatchThread
:..Mux RelaySMTP :..RequestQueueThread
:..Mux SMTP :
:..Mux XMPP :
: :
----------- -----------
| CPU#1 | QPI | CPU#2 |
DDR3...| 8xCores | < --- > | 8xCores |...DDR3
8x16GB | 16xSMT | | 16xSMT | 8x16GB
----------- -----------
QPI | | QPI
----------- -----------
| | QPI | |
| Chipset#1 | < --- > | Chipset#2 |
| | | |
----------- -----------
:..LAN NIC#1#2 :..PCIe#4 Flash Z-Drive R5 Tier0: 4x600GB
: NIC#3#4 :..PCIe#5 Flash Z-Drive R5
: :..PCIe#6 Flash Z-Drive R5
: :..PCIe#7 Flash Z-Drive R5
:..PCIe#1 (9261-8i)
: :..1x SSD X25E 64GB SLC System (Disk Writes intensive IOps, ie: persistent async queues)
: :..1x SSD M4 512GB MLC Trace
: :..2x SSD M4 512GB MLC Tier3 Eden (2x500GB)
: :..4x HDD 1TB Tier4 (RAID5 on every 4 drives, 4x1TBx5=15TB PCIe#1+PCIe#2+PCIe#3)
:..PCIe#2 (9261-8i)
: :..8x HDD 1TB Tier4
:..PCIe#3 (9261-8i)
: :..8x HDD 1TB Tier4
:..SAS2 Expander
:..48x HDD 1TB Tier4Main ideas of UltimX:
- avoid unnecessary memory allocations moving data from source buffer to final buffer
- avoid intermediate string copy
- Memory buffers are maintained on the stack when possible
- Memory allocations are maintained outside the scope of variables
so variables do not need any longer to reserve physical memory,
and can rely on pre-allocated spaces to move data from source to destination
- when memory allocations are needed and are local to the scope of the procedure stack,
the programmer should privilege the use of the heap CurrentScratch with XScratchStack/Unstack
the memory blocks reside for the scope of the procedure (ephemary allocation),
all these local memory allocations are collected at each XScratchUnstack
and when thread terminates, the whole CurrentScratch heap is collected at once
- when memory allocations are needed to survive out of the scope of the current procedure stack,
the programmer should use the heap of the current request (XMLRequest.Heap),
the memory blocks reside for the duration of the request (ephemary allocation)
and all these local memory allocations are collected at once when the request is released
- at last, when no other alternative is possible but to perform a global memory allocation,
then such allocation should be performed using the global heap (auto-detect of the Numa Node)
There is a lot of very good code in this project, especially about multi-threading abilities of Delphi.
But, just like mORMot, it uses a lot of dedicated code (e.g. for string work), and not standard RTL, in order to scale on multi-core.
Take a look at TXThread object: it handles thread affinity.
]]>Has anyone tested how multithreaded program will perform if the SetThreadIdealProcessor WinAPI is used.
This sounds good at paper. to allocate cores evenly but let the system do other vice if it knows better... In Real life scenario when the prog is not alonen in the system this could be good (IF some one is also hogging for the resources...).
Never tested my self, but made mental note once that should some day, if someone has now good system to compare, not setting any affinity for threads, setting threads to single core, and using SetThreadIdealProcessor, what kind of difference they produce. (Some trivial demo might be too sterile for this)...
-Tee-
]]>