Posts by maximmasiutin

maximmasiutin · Low level and performance

The SSE 4.2 hardware crc32 computation can only calculate CRC32 using Castagnoli polynomial, not the IEEE one.

There is also an assembler implementation of Slicing-by-8 for x64 (Win64) that you can take from https://github.com/maximmasiutin/CRC32- … 64-Asm-Pas
It is faster than any higher level (pascal, C) and only takes 1,20 CPU cycles per byte. It is based on the IA-32 version of Aleksandr Sharahov http://guildalfa.ru/alsha/node/2 but with an essential modification: second iteration of loop unrolling was removed because testing demonstrated it had no benefit, or even made code slower because of a branch in the middle that could cause branch misprediction penalty.

Besides that, with PUREPASCAL the current code of the Synopse mORMot framework implements Slicing-by-4, not Slicing-by-8. Please also consider using Slicing-by-8 implementation for PUREPASCAL (the above GitHub link has it too).

maximmasiutin · Low level and performance

No, it’s totally different from the NeverSleepOnThreadContention. I’ve tried NeverSleepOnThreadContention with FastCode Challenge Memory Manager test suite, and it was worse than the default behaviour. The "pause" instruction and a spin-loop of 5000 iterations, with just normal loads, not locked loads, is the essense. The number (5000) is not mandatory, any other number between 500 and 50000 also works OK.
Here is the code (excerpt from the FastFreeMem, 32-bit assembler):

@LockSmallBlockType:
  mov  eax, cLockByteLocked
{We are using faster, normal load to not consume the resources and only after it is ready, do once again interlocked exchange}
  cmp  TSmallBlockType([ebx]).SmallBlockTypeLocked, al       
  je   @PrepareForSpinLoop
  lock xchg TSmallBlockType([ebx]).SmallBlockTypeLocked, al
  cmp  al, cLockByteLocked
  jne  @GotLockOnSmallBlockType
@PrepareForSpinLoop:
  push edx
@LockSmallBlockTypeLoop:
  mov  edx, 5000
  mov  eax, cLockByteLocked
@DidntLock:
@NormalLoadLoop:
  dec  edx
  jz   @SwitchToThread // for static branch prediction, jump forward means "unlikely"
  pause
  cmp  TSmallBlockType([ebx]).SmallBlockTypeLocked, al       
  je   @NormalLoadLoop // for static branch prediction, jump backwards means "likely"
  lock xchg TSmallBlockType([ebx]).SmallBlockTypeLocked, al
  cmp  al, cLockByteLocked
  je   @DidntLock
  pop  edx
  jmp	@GotLockOnSmallBlockType
  @SwitchToThread:
  push  ebx
  push  ecx
  push  esi
  push  edi
  push  ebp
  call  SwitchToThread
  pop   ebp
  pop   edi
  pop   esi
  pop   ecx
  pop   ebx

  jmp   @LockSmallBlockTypeLoop

maximmasiutin · Low level and performance

As people have pointed out, performance, especially multi-threaded performance, depends on the memory manager, especially on its ability to handle locks.

FastMM4 (and the default Delphi built-in Memory Manager) is designed in such a way, that, by default, on thread contention, when one thread cannot acquire access to data, locked by another thread, calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.

Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.

That’s why CPU use never reaches 100% in multi-threaded Delphi applications that work with memory very intensively in concurrent - because of the Sleep(1) issued by FastMM4.

This way of acquiring locks can be improved by replacing it to better methods, recommended by Intel in the Developer's Optimization Guide.

A better way would have been a spin-lock of about 5000 `pause` instructions, and, if the lock was still busy, calling SwitchToThread() API call. If `pause` is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.

I have modified FastMM4, by creating a fork, to use a new approach to waiting for a lock: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).

When these options are enabled, FastMM4-AVX it checks:

- whether the CPU supports SSE2 and thus the "pause" instruction, and
- whether the operating system has the SwitchToThread() API call, and,

and in this case uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.

I have made available the fork called FastMM4-AVX at https://github.com/maximmasiutin/FastMM4

Here are the comparison of the Original FastMM4 version 4.992, with default options compiled for Win64 by Delphi 10.2 Tokyo (Release with Optimization), and the current FastMM4-AVX branch. Under some scenarios, the FastMM4-AVX branch is more than twice as fast comparing to the Original FastMM4. The tests have been run on two different computers: one under Xeon E6-2543v2 with 2 CPU sockets, each has 6 physical cores (12 logical threads) - with only 5 physical core per socket enabled for the test application. Another test was done under a i7-7700K CPU.

Used the "Multi-threaded allocate, use and free" and "NexusDB" test cases from the FastCode Challenge Memory Manager test suite, modified to run under 64-bit.

                         Xeon E6-2543v2 2*CPU     i7-7700K CPU
                        (allocated 20 logical  (allocated 8 logical
                         threads, 10 physical   threads, 4 physical
                         cores, NUMA)           cores)

                        Orig.  AVX-br.  Ratio   Orig.  AVX-br. Ratio
                        ------  -----  ------   -----  -----  ------
    02-threads realloc   96552  59951  62.09%   65213  49471  75.86%
    04-threads realloc   97998  39494  40.30%   64402  47714  74.09%
    08-threads realloc   98325  33743  34.32%   64796  58754  90.68%
    16-threads realloc  116708  45855  39.29%   71457  60173  84.21%
    16-threads realloc  116273  45161  38.84%   70722  60293  85.25%
    31-threads realloc  122528  53616  43.76%   70939  62962  88.76%
    64-threads realloc  137661  54330  39.47%   73696  64824  87.96%
    NexusDB 02 threads  122846  90380  73.72%   79479  66153  83.23%
    NexusDB 04 threads  122131  53103  43.77%   69183  43001  62.16%
    NexusDB 08 threads  124419  40914  32.88%   64977  33609  51.72%
    NexusDB 12 threads  181239  55818  30.80%   83983  44658  53.18%
    NexusDB 16 threads  135211  62044  43.61%   59917  32463  54.18%
    NexusDB 31 threads  134815  48132  33.46%   54686  31184  57.02%
    NexusDB 64 threads  187094  57672  30.25%   63089  41955  66.50%

You can find better tests of the memory manager in the FastCode challenge test suite at http://fastcode.sourceforge.net/

mORMot Open Source

#1 Re: Low level and performance » Fastest CRC calculation » 2020-02-22 22:04:51

#2 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2017-07-18 18:54:56

#3 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2017-07-14 05:54:43

Board footer