#1 2021-11-01 12:34:15

wxinix
Member
Registered: 2020-09-07
Posts: 121

FastMove that claims to be 50% faster than mORMot

https://github.com/dbyoung720/FastMove

@ab  comments?  How possible it can be 50% faster than mORMot?

Offline

#2 2021-11-01 13:25:53

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,237
Website

Re: FastMove that claims to be 50% faster than mORMot

This is a micro benchmark, only for a single size of 1024 bytes, running within the L1 CPU cache, with the same pointers over and over, and computed at millisecond resolution.
Pretty useless.
They could at least have used the Validate() test available in TTestCoreBase.CustomRTL, which uses several sizes. Or the FastCode benchmark, which is a good reference.

I decompiled some of the .obj.
First of all, it is not their code, it is a compiled version of another project. Some licensing issue to be fixed. The source is also mandatory on any project - and a Linux/POSIX version.
For small blocks (< 32 bytes), which are very very common in practice, it seems much slower than mormot.core.base.pas MoveFast() - there is a huge overhead in their code.
I suspect also there are some potential AVX / SSE transition penalty issue on production use: they use AVX registers for sizes < 256 bytes, and call vzupper after switching to ymm to xmm registers in the asm, which is not recommended. https://www.intel.com/content/dam/devel … 256953.pdf
The prefetchnta opcode is used, which is not a good idea from Agner Fog reference material. https://www.agner.org/optimize/
Even the function signature is not the same as the standard Move(src,dst,cnt) - so you need to rewrite all your code!

Then the whole code is bloated. More than 120KB of code just for moving data. Our MoveFast is 671 bytes long. Almost each small size < 255 have their own unrolled version. It flies on benchmarks which run the same exact size move over and over within L1 cache, on the same address. But no real application does that. On production code, the CPU opcache would be very polluted by this code, and your own code is likely to be re-interpreted by the CPU more often.

So it sounds like code optimized for benchmarking, not for production use.
If your application is just moving data - why not?
But if your application does something useful - it is still to be demonstrated that there is any gain.

If the mORMot 2 regression tests were any faster with this code, then we would start tallking. But I doubt it. Especially due to the problem of small sizes, unexpected prefetching, and the CPU opcache pollution.

edit: one additional remarks: our AVX code is not available on Delphi because even Delphi 11 does not support AVX (it is unsuable in Delphi 11). So it uses a regular SSE2 loop which is slower, and explain the difference in Project1.exe. With a good compiler like FPC, which allows to write AVX code, I guess the 1024 byte copy benchmark would run almost the same.

Offline

Board footer

Powered by FluxBB