Fast MM5

mpv · 2020-05-11 11:02:20

I measure program using command /usr/bin/time -v program, MaxMem is a "Maximum resident set size (kbytes)" from time commant output. (time & /usr/bin/time is a different things!)

For HTTP test I run a program, in second terminal runs wrk 5 times with 15sec delay between runs (in keep-alive mode http server eat ~50% of CPU ~10 seconds after loads ends), select a best result.

I add BOOTS mode test. In non-boots mode UnityBase hungs

In boots mode x64MM WIN a CMem on /timestamp test!!

Updated results:

SynDBPostgres - results for all CMem, x64MM & x64MM (boots) is near the same with small x64MM win

Keep-Alive /timestamp request ON (wrk -t8 -c400 -d5s http://localhost:8881/root/timestamp)

MM            RPS         MaxMem    
FPC_SYNCMEM   170 369        19 012
FPC_X64MM     171 654        17 120
FPC_X64MM(b)  173 626        20 184 (average RPS is much better compared to w/o boots)

NO Keep-Alive /timestamp request ON (wrk -t8 -c400 -d5s http://localhost:8881/root/timestamp)

MM            RPS         MaxMem 
FPC_SYNCMEM   68 077         9 832  
FPC_X64MM     66 748         9 896
FPC_X64MM(b)  68 406        11 228 (average RPS is much better compared to w/o boots)

UnityBase server scenario. Keep-Alive OFF, many string concatenations / object allocation / many SpiderMonkey C library calls / SQLIte3 in WAL mode

MM            RPS         MaxMem 
FPC_SYNCMEM   23887        266 324
FPC_X64MM     hangs after ~15000 http requests
FPC_X64MM(b)  23117        970 896

Last edited by mpv (2020-05-11 11:04:58)

mpv · 2020-05-11 11:22:18

For my test suite I also got an AV in _FreeMem on line (for both BOOTs and non-boots modes)

function _FreeMem(P: pointer): PtrInt; nostackframe; assembler;
asm
  {$ifdef MSWINDOWS}
  push rsi
  {$else}
  mov rcx, P
  {$endif MSWINDOWS}
  push rbx
  {Get the block header in rdx}
  mov rdx, [P - BlockHeaderSize]  <-----------

Form call stack it comes from some TFPObjectList.destroy call

The same AV is with initial version of SynFPCx64MM (from 2020-05-09), so problem not in latest optimizations.

P.S.
AV occurs from randomly while some object is destroyed (different objects).

Last edited by mpv (2020-05-11 11:31:30)

ab · 2020-05-11 13:19:14

Isn't it an object released twice?
Or with P=nil ?

Please test with the latest version:
https://synopse.info/fossil/info/dafd1bbe3b

macfly · 2020-05-11 17:05:50

My Results (Ubuntu inside a VM).

Ubuntu 20.04 LTS - Linux 5.4.0-29-generic (cp65001)
    4 x Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (x64)
Using mORMot 1.18.5979
    TSQLite3LibraryStatic 3.31.0 with internal MM
Generated with: Free Pascal 3.2 64 bit compiler

Time elapsed for all tests: 1m13

Total assertions failed for all test suits:  0 / 43,122,020
! Some tests FAILED: please correct the code.

With FPC_X64MM

Ubuntu 20.04 LTS - Linux 5.4.0-29-generic (cp65001)
    4 x Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (x64)
Using mORMot 1.18.5979
    TSQLite3LibraryStatic 3.31.0 with internal MM
Generated with: Free Pascal 3.2 64 bit compiler

Time elapsed for all tests: 1m10

Total assertions failed for all test suits:  0 / 43,117,753
! Some tests FAILED: please correct the code.

SynFPCx64MM Memory Usage Report:
 Medium:  22MB/197MB  sleep=1
 Large:     0B/738MB  sleep=0
 Sleep:    count=919
 Small Waits: getmem=0 freemem=918
 freemem(288)=102 freemem(48)=86 freemem(48)=77 freemem(48)=75 freemem(48)=74 freemem(48)=69 freemem(48)=68 freemem(48)=65 freemem(48)=55 freemem(224)=24 freemem(160)=23 freemem(64)=21 freemem(240)=19 freemem(192)=19 freemem(64)=17 freemem(64)=17 freemem(144)=14 freemem(64)=13 freemem(64)=13 freemem(64)=11 freemem(64)=10

mpv · 2020-05-12 11:27:33

ab wrote:

Isn't it an object released twice?

I don't think problem is because object is released twice (with CMem all works as expected). Latest MM changes fixes AV - I now able to execute UnityBase regression/integration tests and it's pass (almost always).
This is very good - my integration tests verify a real life client-server scenarios (in multi-thread mode with external database access, SpiderMonkey calls etc. etc.)

For a while I got AV ~ once on 20 test execution (such behavior not appears with CMem) - I try to reproduce it on my development environment to give more details

ab · 2020-05-12 18:17:38

Ensure you use the latest version.
It includes code simplification and some fixes.

Please check https://synopse.info/fossil/info/796ec5c965

urhen · 2020-05-12 19:38:55

@mpv
Could you please include FastMM4 in your tests? It also works on Linux with FPC and might give an indication on how much it improved compared to the 'normal' version

mpv · 2020-05-12 19:56:19

Now its stable !

Memory leaks is fixed in https://synopse.info/fossil/info/796ec5c965

I wrote a memory related test: for each HTTP request server (8 thread) do a random count (1-100) of DB queries (each return 1 record from table using random ID), serialize each query result into JSON with 1 record and pass to JS engine. In JS queries results concatenated into array, sorted and returns in HTTP response.

Before you fix a memory leak after 150sec (~120 000 HTTP request, >1 000 000 DB query) of stress test I got 5Gb memory usage. Now ~ 200Mb (this is OK ~ the same as CMem).

BTW - I take my words about "CMem share memory pools between SpiderMonkey and FPC" back - SpiderMonkey use his own arena based allocator (from NSPR library), so memory pools not shared.

I also can't reproduce AV anymore. (may be it related to memoty leak in prev. version)

After testing for a memory leak with valgrind (simple start/stop of service, without payload) I discover 2 things:
1) no MM memory leak anymore (24 block is my problem, not related to memory manager)
2) new MM use less heap allocation compared to CMem

CMem valgrind results

==8047== HEAP SUMMARY:
==8047==     in use at exit: 71,117 bytes in 24 blocks
==8047==   total heap usage: 201,323 allocs, 201,299 frees, 191,894,022 bytes allocated

SynFPCx64MM (in BOOST mode)

==8158== HEAP SUMMARY:
==8158==     in use at exit: 71,117 bytes in 24 blocks
==8158==   total heap usage: 115,637 allocs, 115,613 frees, 160,543,207 bytes allocated

About speed. I test using UnityBase in TechEmpower `/db` scenario (heavy multi-thread):

SynFPCx64MM (BOOTS) - 39 792 RPS 289 732 Kb memory
SynFPCCMemAligned - 40 547 RPS 290 460 Kb memory

Results is near the same. This does not mean that x64MM slower when CMem, just to ensure in near to real life tests we do not loose in performance.

Congratulations!

mpv · 2020-05-12 20:01:21

FastMM4 not works under Linux with FPC x64 as expected - see https://synopse.info/forum/viewtopic.php?id=4330 (at last 1 year ago)

Last edited by mpv (2020-05-12 20:01:36)

urhen · 2020-05-12 21:56:07

mpv wrote:

FastMM4 not works under Linux with FPC x64 as expected - see https://synopse.info/forum/viewtopic.php?id=4330 (at last 1 year ago)

That could be the reason why we had issues with 32-bit FPC some time ago but those issues are gone since almost a year.
We use FastMM4 for FPC since 2017 or 2018 without problems (except the one issue which vanished after some time and compiler updates).

Last edited by urhen (2020-05-12 21:57:09)

ab · 2020-05-13 07:51:47

@mpv
Thanks again for the very detailed feedback!
Yes, from my side too the MM seems more stable and with no leak. It can now properly reports user memory leaks, so it seems not to leak by itself any more.
It was in fact not directly leaks, but how FPC expects ReallocMem(nil, size) or ReallocMem(P, 0) to behave like Getmem(size) or Freemem(P), whereas it was not the case with Delphi. Some reallocmem() calls with those patterns (I guess from the FPC RTL) were not handled properly, so some memory never released.
Also note that when FPCMM_REPORTMEMORYLEAKS is defined, the first qword of any Freemem() block is set to 0, so the TObject VMT or string/dynarray header is reset (with no performance penalty). As a consequence, a GPF would occur immediately in case of messing with the memory pointers: it will help identify e.g. accessing an object or realloc a buffer once freed, or double releases.

@urhen
I was never able to use FastMM4 on Linux/FPCx64 on production due to several bugs in its implementation on that target.
Note also that FastMM4 disables its x86_64 asm code which is Win64 specific. And using this "purepascal" version of FastMM4 is of little benefit in respect to the FPC standard MM, or the libc malloc/free.

I am now happy with the current asm state of our MM.
FastMM4 assembly was sometimes really awesome, and sometimes very verbose/redundant and complex.
I also deleted the "MoveUp" procedures, which were of little benefit from my benchmarks in respect to inlining a sse2 movaps loop. Therefore I could reuse the MoveUp procedure slot in the block type info to track getmem/freemem calls count, therefore add very useful statistics about small blocks allocation usage.
There is much less asm lines in our MM than in FastMM4, which is a good sign.

The fact that we are close to the libc allocator, in terms of performance, is very good.
I had troubles with libc on some Linux systems, where free() over a dandling pointer does not just trigger a GPF, but abort/kill the process, which is not what we expected on a production server. The libc seems to be compiled with paranoid settings on some systems, with no option to make it less paranoid (the tuning API seems with no effect in SynFPCMemAligned).

So I will use our MM as standard for all my projects so for (on Linux), to ensure it is stable and efficient as expected.
At least, if there is some GPF, it will be more explicit about its context, and may help debugging and enhance stability.

BeRo1985 · 2020-05-14 04:36:29

ab wrote:

@bero
We validated here with FPC 3.2 fixes, not 3.3.1/trunk.
Also ensure you have the latest revision.
I fixed some problems specific to Win64.

The most current mormot.core.fpcx64mm.pas crashes still at me with Free Pascal 3.3.1 SVN revision 45361 and Lazarus 2.0.9 fixes branch SVN revision 63147 (with my small patch http://rootserver.rosseaux.net/stuff/la … .3.1.patch for to compile the stable Lazarus 2.0.9 fixes branch with fpc 3.3.1 SVN trunk), both updated and fresh compiled on 14.05.2020. And so far I can see, it crashes always somewhere at image-resource-loading (icon, bitmap, etc.) stuff at Application.Initialize for the app icon, or later at Application.CreateForm at TImage's TImageList's and so on.

I think, I should resume my work on my own lock-free memory manager, which is based on the core concepts of https://github.com/mjansson/rpmalloc.

Last edited by BeRo1985 (2020-05-14 04:43:08)

ab · 2020-05-14 07:23:58

@Bero
mormot.core.fpcx64mm.pas is not always in synch.
I work and validate https://synopse.info/fossil/finfo?name=SynFPCx64MM.pas which sounds stable now.
Good idea to have alternatives! I just wanted to build on FastMM4 main algorithms, because it really has not high fragmentation.

And from my findings, a lock-free structure may not be necessarely faster than a structure with small locks. It just never sleeps and don't call the OS, but it still needs to spin and asm "pause" on contention.
When I introduced https://synopse.info/fossil/info/ca971c118945a761 there was a very light performance benefit for heavily multi-threaded process (20 HTTP client threads bombarding a multi-thread server).
It called much less the OS, but performance was almost the same. Lock-free structures are not magic. Even a CAS/ABA pattern has an overhead and need to spin if the same slot is shared among threads and the CAS failed.
In respect to FastMM4, for SynFPCx64MM we defined a lot of small independent locks, to reduce contention. But in our HTTP client/server benchmark, on my CPU all cores are used, with 50% in the kernel, even if there is no explicit OS call, just "pause".

This is the threads overhead. The problem here is not the memory manager, but the multi-thread paradigm.
Event-based HTTP clients and server scales much better than multi-thread process with blocking process.

EduardAppelhans · 2020-05-14 11:19:58

Hi Arnaud,

Does it then make sense to use such an event based approach with mORMot instead of the thread based approach, or are the practical requirements still missing and would have to be created first ?

If the framework would have to create these prerequisites, how much work would that be and would it be worth it ?

Just to understand the background.

ab · 2020-05-14 12:32:54

The event-based HTTP server is planned for mORMot2.
It is already prepared in https://github.com/synopse/mORMot2/blob … b.sock.pas

mpv · 2020-05-21 12:51:29

@ab - current (2020-05-21) mORMot master is unstable for FPC/Linuxx64. With both MM: for x64MM I got AV on application terminate, with CMEM - application hangs. I think something wrong in string-related commits.

mpv · 2020-05-30 15:49:55

@ab - I merge with latest trunk and (thanks to DOPATCHTRTL) could detect a AV problem (FPC 3.2.0 Linux x64).

The reason is System.FillChar -> FillcharFast redirection

System.Move -> MoveFast redirect is OK
All RTL patches (with enabled DOPATCHTRTL) is also OK.

The only problem is redirection of FillChar ( cnt=-24 is very strange arg value!)

The common stack trace for both CMEM / x64MM is

#0 FILLCHARFAST(-24, 32) at libs/Synopse/SynCommons.pas:37160
#1 SYSTEM_$$_FILLCHAR$formal$INT64$CHAR at :0
#2 ?? at :0
#3 FGL$_$TFPSLIST_$__$$_SETCAPACITY$LONGINT at :0
#4 RTTI_$SYNCOMMONS_$$_RAWUNICODE$indirect at :0
// below different calls depending on CMEM/x64MM

What can I do to help you identify a reason?

P.S. May be because I use generics?

Last edited by mpv (2020-05-30 15:53:54)

ab · 2020-05-31 17:40:42

I guess Count<0 should be rejected (i.e. handled as 0), whereas we use an unsigned comparison.
FPC RTL FillChar() handle it, whereas we didn't.

So it should be fixed by https://synopse.info/fossil/info/69f664c486

About the RTL patches, did you find any performance impact in your tests?
I found some better numbers in the http server, but I may be misleaded.

mpv · 2020-06-01 16:42:23

After negative count patch in FillcharFast all my integration tests passed (at last on my dev environment).
I test RTL patches performance in real scenarios - performance is near the same with small (~1-2%) win of patched RTL. But in such scenarios most of the time wasted on HTTP/Database calls.
Please, revert back a localasvoid if possible (or confirm you will not revert and I revert it in my brunch only) - without this I can`t run a CI (all platform / DB tests)

mpv · 2020-06-02 18:58:47

The current state (FPC 3.2.0):
- Linux x64 target FPC_X64MM + FPCMM_BOOST + DOPATCHRTL pass all my regression test (SQLIte / SynDBPostger / SynDBOracle / SynDBODBC )
- Linux x64 target FPC_X64MM + FPCMM_SERVER + DOPATCHRTL fails (hangs) if I execute from console, but if I execute from Lazarus IDE it's works, so for a while I can't identify a reason (will try)
- Windows x64 target FPC_X64MM + FPCMM_BOOST + DOPATCHRTL fails (millions of AV so even can't understand a reason because a backlog is owerflows) on my CI server - I will try from IDE when i reboot my dev computer to Windows

Last edited by mpv (2020-06-02 18:59:57)

ab · 2020-06-03 06:46:58

Windows is not fully tested with FPC_X64MM. So they may be a problem somewhere for sure.
And I didn't test DOPATCHTRTL with Windows at all yet. I will soon!

About BOOST/SERVER difference, it is pretty weird.
Perhaps in BOOST mode there is less locking, this may be the reason. On heavily multi-thread, there may be some contention and timeouts from the client side.
Try FPCMM_SERVER + FPCMM_LOCKLESSFREE or FPCMM_SERVER + FPCMM_PAUSE to see if the problem comes from the contention.

And without DOPATCHTRTL?

mpv · 2020-06-03 13:35:14

Can you, please, add an ifdef condition into SynDprUses to disable a FPC_X64MM on Windows while it is not tested - my CI use a single command lazbuild prog.Ipi with identical set of defines inside lpi for compilation on both Lin/Win. I don't know how to "define different defines" inside lpi depending on platform...

ab · 2020-06-03 15:05:53

We don't use lazbuild but a regular .sh file here... easier to work with... and no problem having custom switches on the command line...
.lpi are reserved for the IDE / developer side.

Please check https://synopse.info/fossil/info/dde5820b3c

mpv · 2020-06-03 19:16:49

I agree what using a sh script is better, but I can't find a way to embad a versionInfo resources into executable without lazbuild - currently I store it into lpi file <VersionInfo> section and lazbuild embade this section on both Win/Lin correctly...

Under Linux FPCMM_SERVER works correctly - my app previously hangs because I forgot to remove {$DEFINE FPC_SYNCMEM} just before {$I SynDprUses.inc} in the top or my lpr file - for unknown reason in -O2 optimization mode app hangs (dark magic, IMHO).

So on Linux x64 target both "FPC_X64MM + FPCMM_BOOST + DOPATCHRTL" & "FPC_X64MM + FPCMM_SERVER + DOPATCHRTL" works as expected for SQLIte / SynDBPostger / SynDBOracle / SynDBODBC / SyNode / LibCurs/ OpenSSL and couple of other libs on both O1 & O2 optimizations mode!

Under Win x64 got AVs (even with patches form above) - I will reboot now to Windows and try to search for reason

Last edited by mpv (2020-06-03 19:17:59)

ab · 2020-06-03 19:53:20

You can just compile the .res using http://manpages.ubuntu.com/manpages/xen … res.1.html
It is similar to brcc32 from Delphi.

Thanks for the feedback!
Note that the correct conditional is DOPATCHTRTL not DOPATCHRTL - there is an unexpected T which was a typo but I kept it from Delphi RTL patches time.

mpv · 2020-06-03 21:51:53

OK - finally it pass all my tests on FPC+Windows with DOPATCHRTL but without new MM.

For new MM under Windows x64 build both FPC_X64MM (for SynCommons.pas) and FPC_X64MM_WIN (for SynDprUses.inc) MUST be defined - my mistake is to define only one of them.
In this case with -dDOPATCHTRTL -dFPC_X64MM -dFPCMM_SERVER -dFPC_X64MM_WIN (the same without DOPATCHTRTL) my tests are passed but I got an AV during program shutdown inside ..\inc\dynarr.inc:215 (this is inside RTL fpc_dynarray_setlength)

Will try to rewrote my CI to use windres - thnks! While I'm on lazbuild I use a such hack and -dDOPATCHTRTL -dFPC_X64MM -dFPCMM_SERVER for cross-platform build from the same lpi

mpv · 2020-06-04 09:50:27

@ab - I think dependency on SynFPCx64MM (or any other memory manager) inside SynCommon should be removed - SynCommons should use GetMem and GetMem - a MM defined for current process.
If MM call is hardcoded in the Unit (even with ifdef) then this unit is not usable as a part of any library ( so / dll ).

I think this is crytical - isn`t it?

ab · 2020-06-04 13:50:59

I don't understand why you think any SynFPCx64MM harcoded call is not usable with dynamic loading. It is a local relative call within the very same executable.

About random GPF, please check https://synopse.info/fossil/info/439953494b
This fixes an awful bug which may be the cause of your AV.

mpv · 2020-06-05 20:11:43

With latest fixes GPF is moved from ..\inc\dynarr.inc:215 (fpc_dynarray_setlength) into ..\inc\dynarr.inc:84 (fpc_dynarray_clear) but seems in both case it occurs while RLT access to dynarray refcount

realp:=pdynarray(p-sizeof(tdynarray));
if realp^.refcount=0 then <---- here

About SynCommons, I mean lines like

{$ifdef FPC_X64MM}result := _Getmem({$else}GetMem(result,{$endif}len+(STRRECSIZE+4));

If my main executable compiled without FPC_X64MM and dll compiled WITH FPC_X64MM can I load such dll's into main executable? As I understand from your explanation - answer is "yes" because memory blocks allocated by main executable and dll do not intersect (use their own memory, so can use different MM)?
Did I understand correctly? (sorry for stupid question)

In any case we should be carefully and don't forgot to include a {$I SynDprUses.inc} in the top of dpr in case FPC_X64MM is defined and dll use SynCommons.

Last edited by mpv (2020-06-05 20:12:15)

ab · 2020-06-05 20:24:12

The MM is not shared between the dll and the main process. Each one has its own MM: you can't use a getmem() in one part, then freemem() in another.
With FastMM4 you can share the MM. But with FPC, it is not possible IIRC, because there is no easy cross-platform way of doing it.
Check https://wiki.lazarus.freepascal.org/sha … ry_manager

The GPF sounds like a dandling pointer to me.
Are you sure the dynamic array is not freed twice, e.g. not well protected from one thread to another?
I don't have any GPF on Windows any more here... even with aggressive multi-thread testing... so it is weird.

mpv · 2020-06-10 20:25:06

Thanks for explanation about dll's MM! As far as I understand the only reasonable MM under Linux in my case ( I use many dynamic C library in my code) is CMEM.

I try to verify you idea about dandling pointers (you know, this is not a trivial task, under Windows I do not know how to deal with it, so I start from Linux) using valgrind but discover another problem - in case DOPATCHRTL (with any MM) is enabled valrgind can't be used because of "unrecognized instruction". From my POW _ansistr_decr_ref don't contains any unusual instructions (but my asm knowledge is very low)

vex amd64->IR: unhandled instruction bytes: 0x17 0x48 0x89 0xC7 0x48 0x39 0x50 0xF0 0x7C 0x7
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==12169== valgrind: Unrecognised instruction at address 0x4196dc.
==12169==    at 0x4196DC: fpc_unicodestr_decr_ref
==12169==    by 0x428188: SYSTEM_$$_ASSIGN$TEXT$RAWBYTESTRING
==12169==    by 0x42F49B: SYSTEM_$$_SYSENTRY$TENTRYINFORMATION

Can you look what is wrong? valgrind is very useful tool and I finally found my problem using it.

The good news is what valgrind now understand a SSE4 so we do not need to undef a HASAESNI condition in Synopse.inc

ab · 2020-06-10 20:44:12

I guess this is because we overwrite the first bytes of the original fpc_ansistr_decr_ref by the _ansistr_decr_ref bytes, which are shorter, so some bytes are still there and valgrid is lost.
We put a "JMP" instruction, or we copy all the opcodes of the new version... but we don't overwrite all the previous bytes.

Here is the disassembly of the function:
https://gist.github.com/synopse/f98f2cf … a18e24d468

I don't find the bytes recognized by valgrid.

mpv · 2020-06-12 18:51:47

I was able to create a minimal reproducable example of problem with DOPATCHRTL and valgrind.
Program below throws EResourceNotFound in case "Include version info in executable" not checked in Lazarus project but FPCUSEVERSIONINFO is defined. This is OK. After this AV is raised (this is not OK, but not a problem). Under valgrind ./vgerr I got the usual AV output.
But! If compiled with -dDOPATCHTRTL -dFPCUSEVERSIONINFO valgrind reports about unhandled instruction bytes as noted in my previous post.

program vgerr;
uses
  SynCommons;
begin
  writeln('FPCUSEVERSIONINFO + DOPATCHTRTL');
end.

In real life my senario is completely different - no exception, no AV, just an invalid instruction but with the same bytes. In case I compile my real life project w/o DOPATCHTRTL valgring is happy, so problem not in exception itself but somewhere in RTL patch

P.S.
Arnaud - can you enable an Issues section on mORMot github? I think it's better to discuss such problems there. Or I should create a ticket in fossil (less usable IMHO, markdown is not supported etc.)?

ab · 2020-06-13 10:10:28

It didn't help me find a real root cause of the problem.
I don't think the exception for FPCUSEVERSIONINFO does anything meaningful...
It just raises an exception which is not caught during the initialization.
We will now intercept the exception in case FPCUSEVERSIONINFO is defined but no resource info is available.
Please check https://synopse.info/fossil/info/6273ad17b3

On second investigation, the valgrind bytes comes from our patched function.
You can in fact see them in https://gist.github.com/synopse/f98f2cf … a18e24d468
at offset 0C:

        mov     qword ptr [rdi], rdx                    ; 000A _ 48: 89. 17
        mov     rdi, rax                                ; 000D _ 48: 89. C7
        cmp     qword ptr [rax-10H], rdx                ; 0010 _ 48: 39. 50, F0

So it is like valgrind is trying to interpret the opcodes in the middle of the function for no reason.
I don't find any jmp to this address.

I asked myself: couldn't the "unhandled instruction bytes" paranoid check be disabled?
It seems it is not http://valgrind.10908.n7.nabble.com/how … 42100.html

What if you disable those fpc_unicodestr_ patches?

  // PatchJmp(@fpc_unicodestr_incr_ref,@_ansistr_incr_ref,$17);  // fpclen=$2f
  // PatchJmp(@fpc_unicodestr_decr_ref,@_ansistr_decr_ref,$27);  // fpclen=$3f
  // PatchJmp(@fpc_unicodestr_assign,@_ansistr_assign,$3f);      // fpclen=$3f

Also try to disable

      //RedirectRtl(@_fpc_ansistr_concat,@_ansistr_concat_utf8);
      //RedirectRtl(@_fpc_ansistr_concat_multi,@_ansistr_concat_multi_utf8);

since I have found some weird register use of high(s) during my gdb sessions.

mpv · 2020-06-13 12:00:32

I think problem is in PatchCode. This function change a program code segments, right? My idea is what valgdind instruments (change) a program code before PatchCode is called and therefore we got this unexpected result.
OK, we can disable RTL patching while execute our program with valgdind.

But what happens in case we execute multiple instances of the program? In this case Linux (and Windows also?) share memory for program code segments (only stack and heap is different). Second instance will try to patch memory segments already patched by the first instance? Or after first patching kernel perform "copy on write" and a copy of program code segments (all segments, I think) is created. Sorry, so many questions...

P.S.
Usually I run many server instances on the same host. Either one instance per customer or multiple instances for big customer with load balancing. Shared code segments helps me to save a lot of memory

Last edited by mpv (2020-06-13 12:06:05)

ab · 2020-06-13 12:02:07

There is some COW process. When we patch we remap as writable so the memory page (4KB block) is copied as private in the TLB and patched once.

We patch the executable on all systems for the vmtAutoRef slot (e.g. TSQLRecord) with no problem since years.
Patching is a standard technique.

mpv · 2020-06-14 11:42:33

I researched more deeply how PatchCode affects a program code segments memory. Just to ensure everything is OK

TL;DR; It does not give a performance/memory penalty (a tiny 72Kb memory overuse doesn't really metter) at last under Linux (even don't know how to do similar research under other OS).

How I researched:

For a vgerr program from above (compiler without FPSUSEVERSIONINFO) I use a sudo pmap `pidof vgerr` | grep "x-.*vgerr" to take a memory snapshot just before call to SynCommons.InitFunctionsRedirection and after (for simplicity I put a readln in a first and last lines of InitFunctionsRedirection to pause a program)
Before patching program code segment (a memory marked as x) uses 1300K memory

0000000000400000   1300K r-x-- vgerr

after patching we can see some segmentation occurs (as expected) but most of code still in 412 segment:

0000000000400000      4K rwx-- vgerr
0000000000401000     36K r-x-- vgerr
000000000040a000     12K rwx-- vgerr
000000000040d000     12K r-x-- vgerr
0000000000410000      8K rwx-- vgerr
0000000000412000   1228K r-x-- vgerr

After this I use a page-types program from this answer to check a physical memory address for 412 virtual segment before and after RLT patch

$sudo ./page_types -p `pidof vgerr` -l -N | grep 412
412	1ef69a	1	__RU_lA____M

in my case this is a same physical memory address 1ef69a (patched segments change their physical address)
So kernel do a COW (copy on write) only for patched blocks (do not move main program code into another location)

The second part is to run two or more vgerr program to ensure they use the same "unpatched" code segments (412 segment). And yes - all instances of the same program uses the same physical memory address for 412 segment, only patched segments (72K total for each instance) is different.

Last edited by mpv (2020-06-14 11:48:43)

ab · 2020-06-14 13:11:57

You didn't trust me?
Or you didn't trust Linux?

mpv · 2020-06-14 16:56:56

In fact, I did not find anything in internet about such "patching" technique (may be I ask a wrong questions to google), so decided to double-check. Now I'm sure it's OK for my use-cases.

ab · 2020-06-14 19:06:38

It is in fact how virtual memory mapping works on Intel/AMD CPUs since years.

BeRo1985 · 2020-06-21 21:28:17

A status update from my side:

The latest SynFPCx64MM version now works with my Digital Audio Workstation project without access violations at Application.Initialize() .

As a result it is now possible to load multiple large multiple gigabytes big Soundfont 2.x soundbanks in my Digital Audio Workstation without corrupting the internal heap data structures when using SynFPCx64MM, but which happened before with FPC's own default memory manager after a longer runtime, where the internal heap data structures at the Lazarus LCL message queue stuff were often corrupted.

So somewhere in FPC own memory-manager seems to be a bug concerning these heap data structure corruptions at multiple large memory allocations after some runtime, but I haven't been able to find out which code location of FPC own memory manager is guilty for this so far yet.

Hence my warning: Avoid using FPC's own memory manager if one has to perform multiple several large memory allocations (>=512MB).

At least I can freeze my work on my own memory manager, where SynFPCx64MM now works for me, and my Digital Audio Workstation project is x86-64-only anyway. Thanks a lot.

Last edited by BeRo1985 (2020-06-21 21:39:04)

ab · 2020-06-22 07:13:40

Thanks a lot BeRo for the feedback!
It is very difficult to have a correct memory manager, this is why we started from FastMM4 which had a very proven code base.

Did you use Windows for your tests?
I guess the advantage of synFpcx64mm for large block is that it just wraps map/remap on Posix or VirtualAlloc on Windows.

I am happy to have been helpful to you.
Do you have any information to publish about your upcoming Digital Audio Workstation project?

BeRo1985 · 2020-06-22 11:33:41

ab wrote:

Thanks a lot BeRo for the feedback!
It is very difficult to have a correct memory manager, this is why we started from FastMM4 which had a very proven code base.
Did you use Windows for your tests?
I guess the advantage of synFpcx64mm for large block is that it just wraps map/remap on Posix or VirtualAlloc on Windows.
I am happy to have been helpful to you.
Do you have any information to publish about your upcoming Digital Audio Workstation project?

Yes, it's Win64, because of the VST plugin support.

On Youtube I've some videos of now some weeks/months older versions of it:

Mini-DAW-Project - Safri Duo - Played a live - BeRo Remake - Draft 2

Mini-DAW - PasMP Profiler integrated for to show how strong parallized the whole audio engine is

[Mini-DAW] Scorpions - Wind Of Change (with synthesized vocals) (together with my own pascal-native singing-able diphone-based speech synthesizer with the PSOLA-style "Multi-Band Resynthesis OverLap Add" algorithm)

Mini-DAW - Slavko Avsenik - Auf der Autobahn - Techno Remix with extra much BASS

It's fully pascal-native (with some inline assembler parts) and fully multithreaded and work-stealing-fork-join-style parallelized with the help of my PasMP project.

It has a VST 2.x and VST 3.x plugin host. The funny thing is that the VST 3.x plugin spec ABI/API is actually hardcoded against the MSVC C++ classes VMT data structure respectively ABI, wrapped into Fake-COM-interface-constructs (Steinberg is calling their C++ ABI Raw-MSVC-Class-VMT Fake-COM "VST-MA"). It took some time before my VST 3.x host code worked correctly with dirty hacks for the ABI-wrapping with pure-pascal-code, where most all VST3 plugins are working now, except these of izotope and few others, where I'm guessing that any other C++ ABI problems are occuring at these.

It has integrated also my own Sobanth Soundbank engine.

The most of the GUI is rendered over the GPU with OpenGL 4.3 or Direct2D/DirectWrite as compile-time-option, and with GDI as fallback option. I've hijacked the canvas stuff in the Lazarus LCL for this purpose, where most all GDI operations are redirected and emulated with OpenGL 4.3 or Direct2D/DirectWrite (with DirectX 12 feature level) then. The OpenGL render part uses shader-based 2D Signed-Distance-Fields for its rendering concept, especially for the antialiased text rendering.

I could write more, but I don't have the time right now. It is or it will be a a full DAW.

Last edited by BeRo1985 (2020-06-22 11:49:38)

BeRo1985 · 2020-06-22 11:41:18

And by the way, I once wrote a GPU memory manager, because with the very explicit Vulkan API one has to manage the GPU vRAM almost completely by oneself.

See:

https://github.com/BeRo1985/pasvulkan/b … k.pas#L656 and
https://github.com/BeRo1985/pasvulkan/b … .pas#L8238 and
https://youtu.be/EKcwo_mrBDM?t=395

for more details.

ab · 2020-06-22 12:51:11

On Win64 it just calls VirtualAlloc for the large blocks so if you don't reallocate them it relies on the OS.

Very interesting project your DAW!

BeRo1985 · 2020-06-22 23:19:15

ab wrote:

On Win64 it just calls VirtualAlloc for the large blocks so if you don't reallocate them it relies on the OS.
Very interesting project your DAW!

Thanks.

Another fact:

My DAW allocates (and releases these shortly afterwards again) continuously new tiny and small memory blocks for the MIDI events and automation events in the PasMP-based playback threadsafe lockless queues under real-time audio conditions. This can be even several thousands per far below one millisecond, depending on the type of song. And so far SynFPCx64MM performs rather pretty well regarding this aspect, because otherwise, memory allocations within real-time DSP audio routines are actually a total no-go and taboo in the audio software industry, because one has no predictable duration of the operation for an memory allocation, which is rather a major disadvantage in real-time applications. But SynFPCx64MM handles it pretty good anyway, without PCM audio buffer underruns in the end, at least with tiny and small memory blocks.

ab · 2020-06-23 07:35:44

Thanks for this interesting feedback too.
The tiny/small blocks FreeMem is indeed where the MM may lock, as far as I have seen in heavy multi-threaded process.
So it is good news that you find SynFPCx64MM good enough for your multi-threaded app!

Leslie7 · 2020-06-24 07:48:44

BeRo1985,

As you are deep into this stuff this is an off topic audio processing related question - you might be able to point me to the right direction :
I need a relatively simple functionality to split and insert into compressed audio files at a given time with Delphi or FP. All platforms are relevant but the most important ones are Windows, Android and IOS. So far I am struggling to find a working solution for this.

Last edited by Leslie7 (2020-06-24 08:03:42)

pvn0 · 2021-08-10 11:03:10

Anyone here who likes to modularize their applications (linux) by writing dynamic libraries? If so, do you use shared memory manager and which one? Any pitfalls and recommendations ?

ab · 2021-08-10 11:53:59

We try to use monolithic apps, but when we use third party or external libraries, like SQlite3, we provide our own malloc/free version redirecting to our mormot.core.fpcx64mm.pas unit which gives very good multi-thread scaling and memory usage.

An alternative may be to use the libc memory manager everywhere in the project.

mORMot Open Source

#51 2020-05-11 11:02:20

Re: Fast MM5

#52 2020-05-11 11:22:18

Re: Fast MM5

#53 2020-05-11 13:19:14

Re: Fast MM5

#54 2020-05-11 17:05:50

Re: Fast MM5

#55 2020-05-12 11:27:33

Re: Fast MM5

#56 2020-05-12 18:17:38

Re: Fast MM5

#57 2020-05-12 19:38:55

Re: Fast MM5

#58 2020-05-12 19:56:19

Re: Fast MM5

#59 2020-05-12 20:01:21

Re: Fast MM5

#60 2020-05-12 21:56:07

Re: Fast MM5

#61 2020-05-13 07:51:47

Re: Fast MM5

#62 2020-05-14 04:36:29

Re: Fast MM5

#63 2020-05-14 07:23:58

Re: Fast MM5

#64 2020-05-14 11:19:58

Re: Fast MM5

#65 2020-05-14 12:32:54

Re: Fast MM5

#66 2020-05-21 12:51:29

Re: Fast MM5

#67 2020-05-30 15:49:55

Re: Fast MM5

#68 2020-05-31 17:40:42

Re: Fast MM5

#69 2020-06-01 16:42:23

Re: Fast MM5

#70 2020-06-02 18:58:47

Re: Fast MM5

#71 2020-06-03 06:46:58

Re: Fast MM5

#72 2020-06-03 13:35:14

Re: Fast MM5

#73 2020-06-03 15:05:53

Re: Fast MM5

#74 2020-06-03 19:16:49

Re: Fast MM5

#75 2020-06-03 19:53:20

Re: Fast MM5

#76 2020-06-03 21:51:53

Re: Fast MM5

#77 2020-06-04 09:50:27

Re: Fast MM5

#78 2020-06-04 13:50:59

Re: Fast MM5

#79 2020-06-05 20:11:43

Re: Fast MM5

#80 2020-06-05 20:24:12

Re: Fast MM5

#81 2020-06-10 20:25:06

Re: Fast MM5

#82 2020-06-10 20:44:12

Re: Fast MM5

#83 2020-06-12 18:51:47

Re: Fast MM5

#84 2020-06-13 10:10:28

Re: Fast MM5

#85 2020-06-13 12:00:32

Re: Fast MM5

#86 2020-06-13 12:02:07

Re: Fast MM5

#87 2020-06-14 11:42:33

Re: Fast MM5

#88 2020-06-14 13:11:57

Re: Fast MM5

#89 2020-06-14 16:56:56

Re: Fast MM5

#90 2020-06-14 19:06:38