Fast MM5

esmondb · 2020-04-30 10:04:28

May be of interest to mORMot users.

edwinsn · 2020-04-30 16:48:42

Great! Just discovered it earlier.
It seems that there is a major change in licensing. I assume one can use the GPL license in a closed-source, commercial program?

radexpol · 2020-04-30 17:05:43

No delphi 7 support

ab · 2020-04-30 17:38:33

GPL is a "viral" licence.
So if you use FastMM5 in any program, this whole program and its source should become GPL 3.
The only alternative is to pay the license fee.

There is no older Delphi support, nor any FPC support.
I could help Eric make it FPC/Linux compatible for sure, but its use as pure GPL may be not worth it for some mORMot users.

emk · 2020-04-30 20:50:13

ab wrote:

I could help Eric make it FPC/Linux compatible for sure, but its use as pure GPL may be not worth it for some mORMot users.

Some benchmarks will tell the answer, especially in heavy multi threaded scenario.
If the improvement is real, I don't think commercial users will not spend equivalent of 16GB DDR4.

Anyway FPC/Linux needs a good MM, so this is a nice opportunity to investigate - when a quality product is offered at a decent price, people will pull the trigger.

Also , please read: fastmm5-now-released-by-pierre-le-riche-small-background-story/

Last edited by emk (2020-04-30 21:00:07)

Junior/RO · 2020-05-01 13:07:30

I will only try the BorlndMM DLL with Delphi 7.

urhen · 2020-05-01 17:56:19

They developed a new memory manager (made for multithreading and other new technologies) - why the hell should they aim to support your almost 20 years old Delphi version?
Switch to a newer version with much better code optimizations and you probably get a bigger performance boost than with a new memory manager.
Sorry but your claim to support Delphi 7 is the biggest bullshit I've read here (also for mORMot2).

ab · 2020-05-02 12:12:52

@urhen
To be honest, there is almost no performance boost at compiler level since Delphi 2007...
The only noticeable performance enhancement was with Delphi 2006/2007, when inlining was enabled - and also (but it is not the compiler) when FastMM4 was included as standard in the RTL.

For mORMot2, we support Delphi 7 because almost 20% of our Users require it.
Delphi 7 is the second (yes - second) platform requested in our pool http://blog.synopse.info/post/2020/03/3 … ot2-Survey

I understand you don't use nor need it.
But please be fair and indulgent with people stuck with this version, most probably due to existing code base (also due to pricing of the Delphi IDE).
mORMot is great to include e.g. a REST server to an existing legacy project with millions of line of codes, which never made the Unicode/UTF-16 jump... it is not your use case, but it is one of the biggest mORMot advantages.

Junior/RO · 2020-05-02 17:41:12

Delphi 7:

20% + 1

I don't voted in this survey because I was in vacation.

Sha · 2020-05-02 20:13:23

Same, Delphi 7: 20% + 1

and take your attention to FastMM4-AVX
https://github.com/maximmasiutin/FastMM4-AVX

wesleyalves · 2020-05-03 07:34:29

Delphi 7 20% +1. rs

mpv · 2020-05-04 12:07:09

The main problem with Delphi 7 (except Windows-only) is x32 only. All modern operation system is primary x64. Microsoft, for example, don't install WoW64 for Windows Server Core, so x32 app do not works there without additional OS tuning. In 2-3 year they drop WoW64 support for Windows Server. And after this - for Winows10. So for a long-term project migration to x64 is IMHO a must. Even for projects with huge code base....

So I share the opinion what Delphi 7 support for a new library is not a mandatory option.

Example from real life: I have a customers, who use a "WinXP only" software. And therefore can't migrate to Windows10. But if they have WinXP they should have a NTLM 1 on the Domain Controller, but NTLM 1 is very un-secure. They can't install a fresh Web Browser, etc. As a result customer install 1 (ONE) WinXP computer (without network) with "WinXP only" software, instead of 200 as before, so "WinXP only" software vendor lost 199 installation. In the long run, the same will happen with x32 apps.

Last edited by mpv (2020-05-04 21:56:55)

ab · 2020-05-04 13:06:34

@mpv
This is what VMs are for - you can run XP in a VM, with no network defined, and it will work with no problem.

macfly · 2020-05-04 14:10:09

I fully understand the need to still use Delphi 7, as I also use it in a legacy application.

But, in my opinion, there is mORMot 1.8 for those who need compatibility with delphi 7.

And leave mORMot 2 focused on the new compilers.

johnnysynop · 2020-05-05 13:07:22

urhen wrote:

They developed a new memory manager (made for multithreading and other new technologies) - why the hell should they aim to support your almost 20 years old Delphi version?
Switch to a newer version with much better code optimizations and you probably get a bigger performance boost than with a new memory manager.
Sorry but your claim to support Delphi 7 is the biggest bullshit I've read here (also for mORMot2).

I'm on the same position, plus my input:
1. Delphi is commercial product - which by design is used by companies who understand that software need updates
2. Nobody asked there for support only 10.3 + but XE3, XE4 as minimum for Mormot2 is reasonable (in my opinion Tokyo should be minimum)
3. I don't like money grubbers they made hundreds thousands of bucks on their software and can't spend money for update at least 1 per 5 years.(D7 is 18 years old!)
4. If mormot supports FPC free as main - then Delphi when it's commercial it should have more strict req.
5. Delphi 7 = no 64 bit support
6. If someone do not make money on their software or make very low - here it's Delphi is FREE for such users (Community edition) where users have always the latest edition to download.
7. But I make money on my software I cannot use free community edition - my answer: Go to point 3 it's about you.
8. 1.17, 1.18 fully perfectly supports D7 and other obsolete compilers.

I can confirm urhen comment make sense, total bullshit supporting 20 years old product.
To move forward for future sometimes old things needed to be left behind otherwise it will be always stopper and pain in the ass

Last edited by johnnysynop (2020-05-05 13:38:45)

radexpol · 2020-05-05 15:12:04

Could you please stop this discussion concerning support for Delphi 7? There is a lot of Delphi 7 users (so am I) and I'm really grateful that AB still supporting that compiler and does not abandon D7 support. I understand that you are throwing your 5 years old Samsung TV to the bin because of end of smart tv support. My Delphi7 works fine, and I have no important argument to move to any newer versions of it. FYI: two months ago I bought 5 RIO Licences with 3 yrs support and I'm almost sure that I will not have time to move to it this year as I tried 3 yrs ago.

End of discussion?

Last edited by radexpol (2020-05-05 15:17:45)

sakura · 2020-05-05 16:08:38

johnnysynop wrote:

...I can confirm urhen comment make sense, total bullshit supporting 20 years old product...

While not my choice of words, it matches my sentiments ;-)

Junior/RO · 2020-05-05 17:54:55

Yes, please, end of discussion. D7 works. D7 have components that I don't want buy again in new versions.

Reinstall components is hell.

Even today, Delphi can't work as a ruby bundle or rust crate, and I can't have a dependency list neither a place to download them.

Vitaly · 2020-05-05 17:56:40

I wonder, why some people here are so obsessed about giving up Delphi 7 support?
I'm using 10.3 and I have no problem with Delphi 7 support in mORMot. Arnaud asked everybody about versions, which they need to be supported in 2.0. He made a decision according to the results of the survey. Does he make you doing something to make it happen? Then what does bother you so much?

While choosing words like 'bullshit', please, don't forget that in the context you're talking about somebody's job, life business, or hobby.

zed · 2020-05-05 18:21:20

Is this topic still about FastMM5?

urhen · 2020-05-05 23:15:33

Vitaly wrote:

While choosing words like 'bullshit', please, don't forget that in the context you're talking about somebody's job, life business, or hobby.

rofl, that was the biggest trap you could fall in! Hopefully all people worked/working on/for Delphi are still paid from your Delphi 7 license. *faceplam* Just read point 3 again from johnnysynop.

It just shows the bad mentality of you guys, still using Delphi 7 (= nothing paid for 20 years) and then also demand to use the free mORMot but making money with the product build with it. Typical business mo******** - take everything for free but never give anything back.
You should give ab all the money you saved the last 20 years + 1.000.000€ because he is such a fantastic guy who is a real plus for the FPC/Delphi community.

And btw Delphi7 should fully work with FPC (that was their main target years ago...). Its also free so absolutely your thing.

Now back to FastMM5. Some more background infos are here and opinions are here.

Vitaly · 2020-05-06 02:04:02

urhen wrote:

Vitaly wrote:
While choosing words like 'bullshit', please, don't forget that in the context you're talking about somebody's job, life business, or hobby.
rofl, that was the biggest trap you could fall in!

what trap? calling somebody's choice (you don't know the real people's situation) 'bullshit' is funny for you? ok, for me it is just an insult and I'm not expecting seeing this here in anybody's direction.
just reminding one rule of this forum:
5. No trolling, advertising, spamming, fundraising, illegal activity, flaming or other personal attacks, be they acrimonious or veiled in humor.

urhen wrote:

Hopefully all people worked/working on/for Delphi are still paid from your Delphi 7 license. *faceplam* Just read point 3 again from johnnysynop.
It just shows the bad mentality of you guys, still using Delphi 7 (= nothing paid for 20 years) and then also demand to use the free mORMot but making money with the product build with it. Typical business mo******** - take everything for free but never give anything back.

what are you talking about at all? Weren't you able to read that I'm using Delphi 10.3? paid Professional version if you worried so much...

urhen wrote:

You should give ab all the money you saved the last 20 years + 1.000.000€ because he is such a fantastic guy who is a real plus for the FPC/Delphi community.

I'm donating to opensource projects when I have such an ability, Synopse is also in this list. If you want free advice: stop "counting" people's money and telling them, how they should work, and how they should spend their money. In all cases, you don't know their life situation and current needs and most likely you'll always be wrong in your suspicions. They are grownups and they can deal with it by themselves. In other cases, you just show your disrespect to them (hope, it is not your purpose, anyway).
I thought Arnaud explained possible situations and his decision for supporting Delphi 7 clearly here. Or you haven't read his reply thoroughly? Then do it, please.

I didn't get the explanation, why having Delphi 7 support in mORMot particularly bothers you so much. Nothing, even tiny technical detail...
Just understand, please. It is not for you, Arnaud decided to make it for others, they are also good people (in spite of any offensive unproved thing you can tell about them). And they are fully-right members of this community. Which, I assume, Arnaud created not for your 'bullshit' and 'mo********' (whatever is hidden there, I don't think that it is pleasant), but for sharing, contribution and supporting each other. I hope, Arnaud will correct me if I'm wrong.

ab · 2020-05-06 06:47:19

Back to FastMM5.
The big change is about small blocks.

FastMM5 handles several arenas instead of one for FastMM4, and tries all of them until one is not currently locked, so thread contention is less subject to happen.
One area where FastMM5 may have some improvement is his naive use of "rep movsb" which should rather use a non volative SSE2 move.

ScaleMM2 and FPC heap both use a threadvar arena for small blocks, so doesn't bother to check for any lock. It is truly lock-free.
But each thread maintains its own small blocks arena, so it consumes more memory.

Other Memory Managers like Intel TBB or JeMalloc have also a similar per-thread approach, but consumes much more memory.
For instance, IBB is a huge winner in performance, but it consumes up to 60 (sixty!) times more memory! So it is not usable in practice for serious server work - it may help for heavily multithread apps, but not for common services.
I tries those other MM with mORMot. Please check the comments at the beginning of https://synopse.info/fossil/artifact/f85c957ff5016106

Our only problem with FPC heap is that with long term servers, it tends to fragment the memory and consumes some GB of RAM, whereas a FastMM-like memory manager would have consumed much less.
FPC heap memory consumption doesn't leak: it stabilizes after a few days, but it is still higher than FastMM.
The problem with FastMM (both 4 and 5) is that they don't work on Linux x86_64 with FPC.

emk · 2020-05-06 12:03:25

ab,

FastMM4-AVX is any better? Can you make it FPC/Linux compatible?

ab · 2020-05-07 21:24:29

Please check https://github.com/synopse/mORMot2/blob … cx64mm.pas

It is based on FastMM4 (not FastMM5) and not the FastMM4-AVX version - instead of AVX, we use plain good (non-temporal) SSE2 opcode, and rely on http://man7.org/linux/man-pages/man2/mremap.2.html on Linux for very efficient reallocation.
It runs all our regression tests with huge performance and stability - including multi-threaded tests with almost no slow down (Sleep reported for 2ms during a 1 minute test).

Feedback is welcome!

    A Multi-thread Friendly Memory Manager for FPC written in x86_64 assembly
    - based on FastMM4 proven algorithms by Pierre le Riche
    - targetting Windows and Linux multi-threaded Services
    - only for FPC on the x86_64 target - use the original heap on Delphi or ARM
    - code has been reduced to the only necessary featureset for production
    - huge asm refactoring for cross-platform, compactness and efficiency
    - detailed statistics gathering (also about leaks and threads contention)
    - mremap() makes large block ReallocMem a breeze on Linux :)

Why another Memory Manager on FPC?
- The built-in heap.inc is well written and cross-platform and cross-CPU, but its threadvar arena for small blocks tends to consume a lot of memory on multi-threaded servers, and has suboptimal allocation performance
- C memory managers (glibc, Intel TBB, jemalloc) have a very high RAM consumption (especially Intel TBB) and panic/SIGKILL on any GPF
- Pascal alternatives (FastMM4,ScaleMM2,BrainMM) are Windows+Delphi specific
- It was so fun deeping into SSE2 x86_64 assembly and Pierre's insight
- Resulting code is still easy to understand and maintain, and performs very well

macfly · 2020-05-08 00:36:16

Very good news!

Any chance of integration with mORMot 1.18?

urhen · 2020-05-08 01:08:38

macfly wrote:

Any chance of integration with mORMot 1.18?

Better let him focus on finishing mORMot2

But I'm really impressed on how fast you wrote a new memory manager.

Last edited by urhen (2020-05-08 01:09:11)

macfly · 2020-05-08 01:26:20

mORMot 2 seems to have a long way to go.

It would be great to be able to test on mORMot 1.18.

But if in 2 days absent from the forum he created a new memory manager, in 1 month we will have mORMot 2 ready.

edwinsn · 2020-05-08 04:04:18

Mr. Bouchez,

You never cease to surprise us with your ability and fast speed!

ab · 2020-05-08 06:26:53

The unit is stand alone so it works with mormot 1.18 too.

Note: tests show no problem on Linux, but report some issues on Win64.
More investigation (and feeback) is needed!
Our main target is Linux for services (and this unit), so help is welcome perhaps.

ttomas · 2020-05-08 11:51:23

@ab You make my day
Just tested TestSQL3
Manjaro Linux - Linux 5.4.36-1-MANJARO (cp65001)
4 x Intel(R) Core(TM) i3-8145U CPU @ 2.10GHz (x64)
Using mORMot 1.18.5970
TSQLite3LibraryStatic 3.31.0 with internal MM
Generated with: Free Pascal 3.2 64 bit compiler

Time elapsed for all tests: 2m36
Total assertions failed for all test suits: 0 / 43,251,504

with mormot.core.fpcx64mm.pas

Time elapsed for all tests: 1m04
Total assertions failed for all test suits: 0 / 43,253,899

I can't bеlieve this speed boost :-)
Notice total No. of test is different?

Last edited by ttomas (2020-05-08 11:52:29)

mdbs99 · 2020-05-08 13:17:29

ab wrote:

The unit is stand alone so it works with mormot 1.18 too.

Sure. But in that transition to 2.0, would be nice to have a batch script to copy this unit from 2.0 to 1.8 — renamed, following unit name pattern of 1.8, I presume.

RObyDP · 2020-05-08 14:31:38

Do you will release a Delphi version?

ab · 2020-05-08 15:39:09

Thanks for the kind words.
I really missed a good MM in FPC/Linux for serious serious stuff.
I didn't want to make another "toy MM". It is not a proof-of-concept, but it is expected to be the MM we use for high performance mORMot servers - faster than anything else on the market.
From my guess, fpcx64mm is faster and use less memory than even the well known MM written in C. This is the benefit of asm with a single target and CPU.

@macfly
I worked two days (and part of the night) on it, true. But I knew a lot MM in general (theory and practice - I worked on ScaleMM remember) and I had a precise idea in which direction to go: take the best of FastMM4 (which is very proven, stable and efficient), but drive it a little further in terms of multi-threading and code quality (FastMM4 asm is 32-bit oriented, its x86_64 version was sometimes not very optimized for this target - just see its abuse of globals, not knowledge of micro-op fusion or CPU cache lines and locks, and sparse use of registers). Also focusing on a single compiler and a single CPU, with not all the features of FastMM4 in pascal mode, helped make it happen fast.
Last but not least, I spent a lot of time this last year in x86_64 assembly, so I know which patterns are expected to be faster. And the huge regression test of mORMot helps having a real benchmark - much more aggressive and realistic than microbenchmarks (like string concatenation in threads, or even the FastCode benchmark) on which most other MM relies for measurement. When the regression tests are more than twice faster on Linux - as @ttomas reported - then we are talking.

@ttomas
The speed boost is expected, due to how the FPC heap is not very fast by itself. It was meant to be cross-platform, and correct - not the fastest. Memory consumption should also be lower with fpcx64mm.
Note that the regression tests are not very aggressive about memory usage.
I guess that some user code, not written with performance in mind, and e.g. abusing of str := str+'something' patterns would also be more than twice faster. And if it has to reallocate huge buffers (>256KB) in a loop, using mremap on Linux may make a huge performance boost since no data would be copied at all.
About the not same number of tests, this is because some tests use true random values to make some fuzzing testing, even for the iteration count or the data processed. And some multi-thread tests just stop after a time period, in which a faster MM would make more iterations, therefore more checks. So it is expected that numbers of check() calls should be the same between runs in an order of magnitude, not exactly.

@mdbs99
Yes, I think I will make the unit fully stand-alone, and put it into its current state to mORMot 1.18 - named SynFPCx64MM.pas I suppose.

@RObyDP
No Delphi version: use FastMM4/5/AVX instead (as noted in the source code). The asm details are not the same, and I focus on FPC/Linux. Even Delphi/Linux is not an option to me.
I guess fpcx64mm may be slightly faster than those, but I didn't make any test.

macfly · 2020-05-08 17:20:16

Thanks ab for your usual attention and detailed information.

emk · 2020-05-08 20:08:05

@ab,
I'm not using Linux yet, but THANK YOU! for your hard work and attention to detail.

edwinsn · 2020-05-09 05:35:01

emk wrote:

@ab,
I'm not using Linux yet, but THANK YOU! for your hard work and attention to detail.

Please be noted that fpcx64mm is **also for Windows** under FPC.

ab · 2020-05-09 10:06:21

I have made the unit trully stand-alone, and included it also to mORMot 1.18.
Please check https://synopse.info/fossil/info/0374f51eb7a0a9bd

danielkuettner · 2020-05-09 11:21:23

I can confirm fpcx64mm is faster than cmem (in my test scenario) and stable.

macfly · 2020-05-09 13:58:13

As fully standalone, it just got better.

Checklist to use only FPC + Lazarus in new projects.
[X] Best memory manager.

No pending.

ab · 2020-05-09 15:00:45

@danielkuettner
This is good news! Especially when we know cmem = glibc = ptmalloc which has been very optimized through the years.
The more feedback, the better!
Just a question: faster, how much?

BeRo1985 · 2020-05-09 18:49:47

Hm, it crashes om me at Applicatiion.Initialize unter Windows. Here is the stack trace:

Here the stack trace:
#0 _FREEMEM(0x0) at mormot.core.fpcx64mm.pas:1463
#1 ?? at :0
#2 ?? at :0
#3 SYSTEM_$$_FREEMEM$POINTER$$QWORD at :0
#4 FREEDATA({DESCRIPTION = {FORMAT = RICFGRAY, WIDTH = 310, HEIGHT = 310, DEPTH = 1, BITORDER = RIBOBITSINORDER, BYTEORDER = RIBOLSBFIRST, LINEORDER = RILOTOPTOBOTTOM, LINEEND = RILETIGHT, BITSPERPIXEL = 1, REDPREC = 1, REDSHIFT = 0, GREENPREC = 0, GREENSHIFT = 0, BLUEPREC = 0, BLUESHIFT = 0, ALPHAPREC = 0, ALPHASHIFT = 0, MASKBITSPERPIXEL = 0, MASKSHIFT = 0, MASKLINEEND = RILETIGHT, MASKBITORDER = RIBOBITSINORDER, PALETTECOLORCOUNT = 0, PALETTEBITSPERINDEX = 0, PALETTESHIFT = 0, PALETTELINEEND = RILETIGHT, PALETTEBITORDER = RIBOBITSINORDER, PALETTEBYTEORDER = RIBOLSBFIRST}, DATA = 0x0, DATASIZE = 0, MASK = 0x0, MASKSIZE = 0, PALETTE = 0x0, PALETTESIZE = 0}) at graphtype.pp:1310
#5 MASKHANDLENEEDED(0x61a00d0) at include\icon.inc:753
#6 GETMASKHANDLE(0x61a00d0) at include\icon.inc:576
#7 HANDLENEEDED(0x61a00d0) at include\icon.inc:1390
#8 RELEASEHANDLE(0x61a00d0) at include\icon.inc:1363
#9 BIGICONHANDLE(0x66927f0) at include\application.inc:1158
#10 ICONCHANGED(0x66927f0, 0x61a00d0) at include\application.inc:1115
#11 CHANGED(0x61a00d0, 0x61a00d0) at include\graphic.inc:56
#12 CHANGED(0x61a00d0, 0x61a00d0) at include\rasterimage.inc:395
#13 ENDUPDATE(0x61a00d0, true) at include\rasterimage.inc:271
#14 LOADFROMSTREAM(0x61a00d0, 0x4690910, 35898) at include\rasterimage.inc:457
#15 LOADFROMSTREAM(0x61a00d0, 0x4690910) at include\rasterimage.inc:417
#16 READDATA(0x61a00d0, 0x4690910) at include\icon.inc:801
#17 LOADFROMRESOURCEHANDLE(0x61a00d0, 4294967296, 4303999108) at include\icon.inc:1440
#18 INITIALIZE(0x66927f0) at include\application.inc:459
#19 main at BeRoSequencer.lpr:189

FPC Version is SVN trunk from 03.03.2020 and Lazarus version is the 2.0.7 patches SVN branch also from 03.03.2020 with some small fpc-syntax-change-bugfix changes of myself, so that it compiles with FPC 3.3.1 without any compilation errors.

Last edited by BeRo1985 (2020-05-09 18:53:35)

ab · 2020-05-09 19:37:33

@bero
We validated here with FPC 3.2 fixes, not 3.3.1/trunk.

Also ensure you have the latest revision.
I fixed some problems specific to Win64.

danielkuettner · 2020-05-09 20:12:33

Btw I used fpc 3.3.1 without issues

macfly · 2020-05-10 01:27:26

@ab, the fite SynFPCx64MM.pas is missing on git.

mpv · 2020-05-10 09:50:53

I tested a SynFPCx64MM on a three different server scenarios (Environment: Linux 5.3.0-51-generic (cp65001) 8 x Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz (x64) FPC 3.2_fixes)

1) Samples/21 - HTTP Client-Server performance with KeepAlive (default socket server behavior)
2) Samples/21 - HTTP Client-Server performance without KeepAlive
3) UnityBase server in ORM mode (http server in HTTP 1.0 (no KeepAlive), many string concatenations / object allocation / many SpiderMonkey C library calls)

and compare results with FPC_SYNCMEM (SynFPCCMemAligned). Testing methodology is the same as we use for HTTP socket server bench-marking - see this link

TL;DR;
a) CMem is faster on all scenarios
b) in case application uses C libraries (SpiderMonkey in my case, but most apps uses C libraries like one of libpq, openSSL, libcurl, ODBC, OleDb etc.) CMem use less memory. May be I'm wrong, but most likely in case of CMem both C library and FPC code uses the same MM and in case SynFPCx64MM FPC code have his own memory pool?

I understand what after some optimizations speed problems may be solved, but I don't see how to solve a memory problem.

Detailed desults:

Keep-Alive /timestamp request ON (wrk -t8 -c400 -d5s http://localhost:8881/root/timestamp)

MM            RPS         MaxMem    Latency      
FPC_SYNCMEM   176 145        18 220        1.11ms
FPC_X64MM     102 208        20 040      27.56ms

As we can see with new MM average request latency is very big (for each HTTP request many FPC structures allocating concurrently here)

More close to life scenario (prevent thread spawn for each keep-alive request) is
HTTP 1.0 Keep-Alive OFF /timestamp request (wrk -t8 -c400 -d5s http://localhost:8881/root/timestamp)

MM            RPS         MaxMem    Latency      
FPC_SYNCMEM   68 654        9 596        6.20ms
FPC_X64MM     64 221        9 724        8.03ms

UnityBase server scenario: Keep-Alive OFF, many string concatenations / object allocation / many C library calls (SpiderMonkey) / SQLIte3 in WAL mode

MM            RPS         MaxMem    Latency      
FPC_SYNCMEM   24 140        266 128       0.93ms
FPC_X64MM     22 257        534 688       1.04ms

The bigest problem here is MaxMem - with new MM application use x2 more memory. I think this is because C library (SpiderMonkey) and SynFPCx64MM allocates his own memory pool (in case of FPC_SYNCMEM both use the same CMem allocated memory pool)

Last edited by mpv (2020-05-10 09:55:02)

mpv · 2020-05-10 11:19:08

I thought many times why at all in Delphi/FPC world we need a memory manager like FastMM etc. while on C world libc memory manager is enough for most scenarios (some very specific software uses jemalloc very rarely).

From my POW the main reason is RTL data structures typical Delphi program use (and for historical reason typical FPC program also). Most of them created (by Borland in 1990) to be used in GUI applications and not applicable for server-side.
And (IMHO) the solution to speed up and application is not in memory manager but in Algorithms + Data Structures = Programs. And the Data Structures is a thing we miss in Pascal

I don't even want to talk about global lock in TPersistent.Destroy() or Notify in TList (this not related to memory manager, fixed in TSynPersistent / TfpList), but about, for example, TMemoryStream.

TMemoryStream uses monolithic memory block to hold a content. Every time it's grows memory block is reallocated. As a result we got a huge memory fragmentation. For such scenarios in C world peoples use a linked lists of small buffers (ideally linked list node size should be of memory page size).

Another scenarios is strings concatenation. It's very simple in Pascal to wrote s1 := s2 + 'some text', but this kills performance in multi-thread application (same in Java/.NET).
SynComons.TTextWriter helps here (BTW there is the same problem there as in TMemoryStream but in any case it faster compared to strings concatenation), but TTextWriter is in mORMot, not in RTL (in .NET / Java StringBuilder is a part of RTL)

Or SetLength() function - it's very simple to call it, in typical pascal program it called many times (even if developer do not calls it directly it called by RLT, for example in TList), but it realloc a monolithic memory block and we got a memory fragmentation as a result.
The good solution here is to use a data structures depending of a tasks we need to solve. Ideally - pre-allocated memory segment of fixed size. RingBuffer, for example. But we miss it in RTL.

Even if we have something in RLT, for example TStack / TQueue, internally it uses a TList (TList = memory fragmentation) instead of linked list as it Java for example (I'm sure the same is in other languages)

ab · 2020-05-10 12:34:54

@mpv
Very interresting!

Your tests are clearly very multi-threaded. In mostly single-threaded process, FPCx64MM is faster than libc (50% faster when allocating strings).
On my computer, the mORMot tests run in 1m04 with FPCx64MM, 1m10 with libc, and 1m18 with FPC default heap.
I am able to reproduce your behavior in the regression tests, and have a lot of sleep() calls, writing:

procedure TDDDThreadsThread.Execute;
...
  for i := 1 to 10000 do
    fHttpClient.ServerTimestampSynchronize; // force /timestamp

So I guess the libc MM leads in multi-thread context, due to its per-thread cache https://sourceware.org/glibc/wiki/Mallo … 8tcache.29
We may add such a cache to our memory manager. Or a set of arenas, as FastMM5 does.

The memory consumption difference in SpiderMonkey tests may indeed come from the shared libc memory manager by the whole process.
You are right: it is a big advantage, in which using our own pascal MM could never win. Unless we use it also as libc malloc/free replacement. It is possible for the process IIRC with some linking tweaks as stated by https://www.gnu.org/software/libc/manua … alloc.html

A linked list of buffers has also its drawbacks: it is worth in term of memory fragementation, and awful in terms of CPU cache misses. A simple dynamic array staying in CPU L1/L2 cache is faster - this is what C game engines do: they don't use linked lists, but continuous arrays.
About SetLength, I guess it is not so simple. It doesn't move any data if the current chunk is already big enough. For strings, there is a call to _MemSize() in the FPC RTL to do that, and avoid any _reallocmem call. And _memsize is thread-safe with no lock with FPCx64MM. Also _reallocmem is able to do the same and don't move any data when you use a raw dynamic array. Also note that mremap is called by _reallocmem for huge block with libc and FPCX64MM (but not by the FPC MM) with avoid to move data, but just let the kernel adjust the memory pages TLB.
The RingBuffer is what a TDynArray is for - or with mORMot, a dynamic array with an external count: you set the capacity, then you use its external Count integer for the actual size. This is what we usually do in mORMot. Having continuous items don't loose any memory as with linked lists, and allow fast CPU-cache-friendly brute force scan.

mpv · 2020-05-10 16:47:17

I also verify a SynDBPostgres in scenarios I use to test it speed in multy-thread - see this post (run query + FetchAllAsJSON)
I run a test scenario using low-level BeginThread function to avoid any TThread overhead

for nt := 1 to T do
    BeginThread(@singleThreadPerfTestDBPostgres, pointer(nt));

for T = 1, 4, 16 the result is (QPS = query + fetch per second):

MM \ thread  1 QPS   1 MaxMem   4 QPS   4 MaxMem  16 QPS   16 MaxMem  
FPC_SYNCMEM  14 552  13 392    35 061   13 776    45 710     15 432
FPC_X64MM    15 017  13 404    34 812   13 964    45 141     15 664

So you a right - in single thread mode new MM is faster compared to CMem, in multi-thread is slower. (CMem thread-level cache)

Memory consummation difference are grows together with threads count (libpq not allocate too much, but difference is measurable). I read a topic how to replace a CMem for a dynamically linked programs. But to have a guaranty it works additional procs should be implemented ( aligned_alloc, malloc_usable_size, memalign,posix_memalign, pvalloc, valloc ), so I don't test a replacement.

I understand and agree with your arguments about SetLength.

About Linked list vs Array - I will investigate deeper (for example by implementing a simple TTextWriter what use a linked list of 4Kb chunks). For example in article about CMem per-therad cache author use a linked list.

Last edited by mpv (2020-05-10 16:49:03)

ab · 2020-05-10 19:33:31

TThread has almost no overhead IIRC the FPC RTL.

About MaxMem, what is it? Virtual memory? Shared memory?
The only interresting value is the Reserved memory (RES in top/htop): it is the actual memory consumed by the process.

Please check https://synopse.info/fossil/info/85a4927449
I have added a set of round-robin arenas of smallest blocks (<=128 bytes) for better thread scaling.
Also with a new FPCMM_BOOST experimental mode, which may be tested on some HW.

From my tests, multi-thread performance is now similar to CMem (or even slightly faster), e.g. when using 40 concurrent clients making 10,000 root/timestamp requests in parallel in the same process.
Your own feedback is welcome!

mORMot Open Source

#1 2020-04-30 10:04:28

Fast MM5

#2 2020-04-30 16:48:42

Re: Fast MM5

#3 2020-04-30 17:05:43

Re: Fast MM5

#4 2020-04-30 17:38:33

Re: Fast MM5

#5 2020-04-30 20:50:13

Re: Fast MM5

#6 2020-05-01 13:07:30

Re: Fast MM5

#7 2020-05-01 17:56:19

Re: Fast MM5

#8 2020-05-02 12:12:52

Re: Fast MM5

#9 2020-05-02 17:41:12

Re: Fast MM5

#10 2020-05-02 20:13:23

Re: Fast MM5

#11 2020-05-03 07:34:29

Re: Fast MM5

#12 2020-05-04 12:07:09

Re: Fast MM5

#13 2020-05-04 13:06:34

Re: Fast MM5

#14 2020-05-04 14:10:09

Re: Fast MM5

#15 2020-05-05 13:07:22

Re: Fast MM5

#16 2020-05-05 15:12:04

Re: Fast MM5

#17 2020-05-05 16:08:38

Re: Fast MM5

#18 2020-05-05 17:54:55

Re: Fast MM5

#19 2020-05-05 17:56:40

Re: Fast MM5

#20 2020-05-05 18:21:20

Re: Fast MM5

#21 2020-05-05 23:15:33

Re: Fast MM5

#22 2020-05-06 02:04:02

Re: Fast MM5

#23 2020-05-06 06:47:19

Re: Fast MM5

#24 2020-05-06 12:03:25

Re: Fast MM5

#25 2020-05-07 21:24:29

Re: Fast MM5

#26 2020-05-08 00:36:16

Re: Fast MM5

#27 2020-05-08 01:08:38

Re: Fast MM5

#28 2020-05-08 01:26:20

Re: Fast MM5

#29 2020-05-08 04:04:18

Re: Fast MM5

#30 2020-05-08 06:26:53

Re: Fast MM5

#31 2020-05-08 11:51:23

Re: Fast MM5

#32 2020-05-08 13:17:29

Re: Fast MM5

#33 2020-05-08 14:31:38

Re: Fast MM5

#34 2020-05-08 15:39:09

Re: Fast MM5

#35 2020-05-08 17:20:16

Re: Fast MM5

#36 2020-05-08 20:08:05

Re: Fast MM5

#37 2020-05-09 05:35:01

Re: Fast MM5

#38 2020-05-09 10:06:21

Re: Fast MM5

#39 2020-05-09 11:21:23

Re: Fast MM5

#40 2020-05-09 13:58:13