#1 2023-01-09 16:31:28

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

High-Performance Frameworks

This is the discussion topic about integrating mORMot 2 with the TechEmpower Framework Benchmarks (TFB) challenge.
https://github.com/TechEmpower/FrameworkBenchmarks

This is a follow-up of https://synopse.info/forum/viewtopic.php?id=5547 in the new "mORMot 2" thread.

As reference, the current status of the TFB challenge internal rounds is available at https://tfb-status.techempower.com

Offline

#2 2023-01-09 16:32:35

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Info: the pull request has been merged.
https://github.com/TechEmpower/Framewor … /pull/7833

So I hope next round will show better numbers.
I am very upset with the current "updates" performance on their high-end HW - much slower on their system than on my old laptop! hmm

Offline

#3 2023-01-09 19:54:59

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

I'm thinking about updates. After a deep dive into Postgres architecture I have some ideas, I will check (in few days) and report the results.

Offline

#4 2023-01-10 11:29:26

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Looks like I fixed the update performance.
The reason is what we update table in random order, and simulations updates are lock each other (achieved after reading an epic book form postgrespro team https://postgrespro.ru/education/books/internals, unfortunately available only on russian).
So I add order by ID (Alternatively, we can sort by ID at the application server level, but at the database level it is easier) - see PR134

I setup and update test on 48core server - performance (with 512 concurrent connections and /updates?queries=20)  increased from ~4k RPS to 16k RPS

Offline

#5 2023-01-10 12:25:50

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

This is a great finding!

I have merged your PR.
Thanks for the feedback.

The new round would start today I guess.
I hope they will include the latest trunk, and your previous PR with the new thread layout...
smile

Offline

#6 2023-01-10 17:41:22

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

As far as I understand every new round is starts from trunk, so will include PR with the new thread layout.
While we wait for results of new round I will try to solve the problem with pipelining mode

Offline

#7 2023-01-14 12:56:42

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Results for new round is ready. I prepare a historical overview or composite results - first row is weight of each column:

Weights		1.000	1.737	21.745	4.077	68.363	0.163
Composire #	JSON	1-query	20-query Fortunes Updates Plaintext 	Weighted score
38 	mormot 	731,119	308,233	19,074	288,432	3,431	2,423,283 	3,486  2022-10-26 - 64 thread limitation
43 	mormot 	320,078	354,421	19,460	322,786	2,757	2,333,124 	3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	mormot 	317,009	359,874	19,303	324,360	1,443	2,180,582 	3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	mormot 	563,506	235,378	19,145	246,719	1,440	2,219,248 	2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	mormot 	394,333	285,352	18,688	205,305	1,345	2,216,469 	2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	mormot 	859,539	376,786	18,542	349,999	1,434	2,611,307 	3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity	

Offline

#8 2023-01-14 22:42:32

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

I detect a bottleneck - it's in ThreadSafeConnection implementation.
Current implementation ( on hi load) is looping almost every time to found connection for thread ID, Ideal implementation should  use array of fixed size (equal to HTTP server threads count) and get connection by thread index.

I have almost finished prototyping and got at last 10% in /db performance, but @#$%^&* russians  turned off electricity by their @#$%^&* missiles and I lost all modifications.

@ab, what do you think about fixed thread pool for DB connections?
Filled by nulls on start and indexSafeConnection(threadIndex) instead of ThreadSafeConnection?

Offline

#9 2023-01-15 11:23:07

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

This is a bit unexpected because the loop is protected by a ReadLock so is not blocking.
If have added a threadvar which is the safest way to make it work as expected.
Because we never know how much threads there will be. There is no such thing as a "thread index" - just a thread ID.
https://github.com/synopse/mORMot2/commit/82f5085b

About the numbers, it is weird that the "updates" numbers are so low.
I hope with your latest fix about sorting the IDs it could be better. Or we could use a more complex OR/IN statement (as other ORMs do) instead of the nested SELECT. I am not sure...

Stay safe.

Offline

#10 2023-01-15 14:55:21

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

TFB PR 7860 is ready.
Using threadvar improves performance ~1%. Previous implementation do not locks - it loops over all connection in thread pool. And it takes some time...
In case of Async thread pool we have a TAsyncConnectionsThread.Index.
And we know a threadPoolSize, so efficient array-based implementation is possible (at last for /raw* endpoints) - the problem is what Ctxt.ConnectionThread is not filled by AsyncConnection here nil is passed. Can it be filled?

Offline

#11 2023-01-15 16:58:34

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Yes, you are right it may be possible only for the raw code, since there is no link between mormot.sql.db and mormot.net.async - the thread pool is not known by the SQL layer.

Please check https://github.com/synopse/mORMot2/commit/99f97333
Now ConnectionThread should be populated.

Edit:
I have added a new  TAsyncConnectionsThread.CustomObject property.
You could put your DB connection directly in this field: it will be freed by TAsyncConnectionsThread.Destroy.
https://github.com/synopse/mORMot2/commit/4b676390
But by design, it will be very basic, and will lack e.g. the auto-deletion-once-deprecated feature of TSqlDBConnectionPropertiesThreadSafe.

Offline

#12 2023-01-15 19:17:19

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

I tests new feature by assigning connection to thread CustomObject and there is almost no difference with threadvar version (at last on thread pool size = 28). So - you are right - better to keep ThreadSafeConnection  for auto-deleteion\reconnectiong etc.

Please, look at https://github.com/synopse/mORMot2/pull/135/files. It you think what removing lock is unsafe we can at last use ReadLock in function body and WriteLock in TryPrepare sub-function.

Offline

#13 2023-01-15 19:39:13

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

This lock is for each connection, so it will never lock in our case of one connection per thread.

It won't be a bottleneck I think.

Offline

#14 2023-01-16 11:51:44

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

I trace /plaintext n pipelining mode on the server.
12% of time program spends on parsing headers, 7% is on retrieving HOST header - on this line https://github.com/synopse/mORMot2/blob … .pas#L1284 (3.5% of them - on string compassion  -> THttpRequestContext.SetRawUtf -> function SortDynArrayPUtf8Char(const A, B): integer; )
When I replace host by const:

// 'HOST:' 
Host := '10.0.0.1'; // GetTrimmed(P + 5, Host);      

I got +200 000 RPS on plaintext

@ab, please - see valgrind file - https://drive.google.com/file/d/1G1527e … share_link (viewing instruction is inside archive) - maybe you have some idea for optimization

Offline

#15 2023-01-16 19:00:52

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

You may rather try to disable hsoHeadersInterning and see what happens, especially with the libc memory manager.

I have also made https://github.com/synopse/mORMot2/commit/cfbc5694
Which may help a little.

Offline

#16 2023-01-17 12:22:13

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Yes, updated HTTP headers parser helps for a few % for both /json and /plaintext. And with this changes where is no reason to verify disabled hsoHeadersInterning. I switch TFB PR 7860  to this mORMot version (it not merged yet, so waiting for a next round).
In PR 7860 I also remove hsoThreadSmooting - seems that it improves only plaintext, but bad for other endpoints.
I still investigating bad pipelining performance - no results yet, but at last I reproduce it on server hardware (not reproduced on my PC)

Offline

#17 2023-01-18 17:55:46

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

PR 7860 is merged by TFB team. Now we are waiting for the results (~ on 2023-01-28). From my approximations /updates should be around 12000 what gives +650 to composite scores, /json is also improved, for other endpoints results depends on reaction to `hsoThreadSmooting` removing..

BTW in last round
- mORMot is #2 ORM in /db test! (Orm=ful). First is Rust based `xitca-web`
- #2 ORM in /fortunes test (actually #3, but I exclude lithium - it`s not an ORM) - first is `asp.net core`
Not bad, not bad...

Offline

#18 2023-01-21 19:18:13

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Solved TFB /rawqueries and /rawupdates performance problems:
- PostgreSQL pipeline mode has been rethought - better to use Flush after each statement instead of PipelineSync after last statement; In this case server starts to execute queries ASAP;
- use ::bigint[] instead of ::NUMERIC[] typecast in /rawupdates  - NUMERIC is a floating point type, but we need Int64; added order by id to minimize locks waits (as in ORM)

See mORMot PR #140
Now I expect /rawupdates and /rawqueries to be in top 10.At last on my server hardware results are:
- 52000 RPS for /rawqueries?queries=20
- 26000 RPS for /rawupdates?queries=20
 
Will test once more after @ab merge PR#140 and prepare a merge request to TFB.....

Also a small performance tip to be fixed: TRestOrm.Retrieve call Model.GetTableIndexExisting twice - here inside Retrieve and second time - inside fCache.Retrieve
@ab - may be you add optional 3-d parameter tblIdx to Cache.Retrieve ?

Last edited by mpv (2023-01-21 19:19:58)

Offline

#19 2023-01-22 18:23:25

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Great!
I have merged the pull request.

About hsoThreadSmooting, what is your feedback about its impact on the Citrine HW?
The mORMot numbers are part of latest https://www.techempower.com/benchmarks/ … ched-query
I don't understand the numbers of cached-query. They should be close to the /json numbers, and we reached only 100,000 per second.
Perhaps it is because hsoThreadSommothing is missing...

About TOrmCache.Retrieve see https://github.com/synopse/mORMot2/commit/440ffa93

Offline

#20 2023-01-22 20:12:10

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Current round *NOT* includes last PR where we remove hsoThreadSmooting and adds `order by` for updates - let's wait for the next round results ~ on 2023-01-28

I also worried about cached-queries performance. Bad thing is what I can't reproduce it on my server. Independent of hsoThreadSmooting I always get a good numbers ~400k for cached-queries?count=20
The only thing is what in opposite to Citrine I execute both wrk and app on the same server over loopback  - may be this is a reason.

Offline

#21 2023-01-22 20:36:10

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Tested improved TOrmCache.Retrieve - it gives a small but measurement improvement about + 3000 RPS for /db performance (~ +3000 RPS).

If you don't mind, I prefer to wait for the result of the next round (without Smooting) to decide whether we need Smooting or not before making a new PR.
And during this time, I might find a way to reproduce the cached-queries problem....

Offline

#22 2023-01-23 10:10:02

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Please try https://github.com/synopse/mORMot2/commit/87aa8faf
I think it is not needed to return the ID field from the DB when it is already part of the "where" clause.
I have done this for both ORM and rawdb queries. In fact, the ORM was allowing the DB layer to not return the ID value: it does set it manually (Value.IDValue := ID) after parsing the JSON returned by the DB layer.
There are some other minor optimizations in the previous commits.

From what I could read in the TFB requirements, it is not forbidden to do so, and I suppose it will leverage the DB a little more.
Numbers are better on my side.

Offline

#23 2023-01-23 11:40:23

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

No, it`s forbidden to read only randomNumber - see punkt i.

i. For every request, a single row from a World table must be retrieved from a database table. It is not acceptable to read just the randomNumber. 

Offline

#24 2023-01-23 12:31:42

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

But what if the ORM does this pretty valid optimization?

We could try to let the ORM use his default behavior, but use a SELECT id,randomnumber for the raw queries.
https://github.com/synopse/mORMot2/commit/0eca2dff

Offline

#25 2023-01-23 14:01:18

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

I do not measure a valuable performance difference between ORM with ID and without ID. But in case of ORM without ID we break the rules. When we get to the top, someone will look at the source and say that this is not a "fair play".

I propose to rollback ORM also. In fact for Postgres having primary key in select fields affects only serialization and a tiny amount of traffic - PK value is always in buffer cache...

Offline

#26 2023-01-23 14:23:09

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Make sense for PostgreSQL.
I saw a slight performance impact with SQLite3.
But I guess we could rollback it for all external databases, and strictly follow TFB rules (even if they do not make much sense).

Please check https://github.com/synopse/mORMot2/commit/5a69112b

And I have committed several ORM optimizations in https://github.com/synopse/mORMot2/commit/88742d08
On my side, performance is slightly better.

Offline

#27 2023-01-24 14:52:36

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

After "several ORM raw optimizations" /db performance increased by +2000 RPS (mostly because fCache.NotifyAllFields now not called for non-cached entities, as far as I understand)

Here  is server-side valgrind profile data for cached-queries?count=20 for mORMot 8765c931
May be you find some optimization ideas.
Since I do not reproduce pure cached queries performance on my server, I think on Citrine HW results are difference because of something like CPU cache.

Last edited by mpv (2023-01-24 14:53:00)

Offline

#28 2023-01-24 18:53:47

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

@mpv
I looked at the cachegrind information... but I am not convinced what to do.
Please try with https://github.com/synopse/mORMot2/commit/5e6f3685

Offline

#29 2023-01-26 14:37:20

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

As @ab you consider about month ago I tries with libc memory manager (uses cmem) instead of fpcx64mm in FPCMM_SERVER mode and....

all results for libc MM are better on modern CPU. For my PC it's better a little (10%) but for server (28 Xeon cores) results are increased dramatically. Most valuable is /fortunes - from 180k RPS to 350k RPS. Other tests is about +20%, for example /rawfortunes - from 350k to 408k
 
Unfortunately where is floating AV what happens very rarely,  I will try to found it after blackout. And also will prepare a detailed endpoint statistics in cmem mode for comparison.

P.S.
in cmem mode AV is for plaintext in pipelining mode

Last edited by mpv (2023-01-26 14:45:01)

Offline

#30 2023-01-26 19:32:47

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

I can't reproduce libc problems in TFB bench on my PC, where I can debug, but application crash is also easy reproducible  in `mormot2tests` - when I put CMem as a first unit

uses
  CMem,
  {$I ..\src\mormot.uses.inc} // may include mormot.core.fpcx64mm.pas

and compile without any defines

@ab - do we need to do something specific like in mORMot1 SynFPCCMemAligned instead of using CMem?

Offline

#31 2023-01-26 20:56:20

tbo
Member
Registered: 2015-04-20
Posts: 335

Re: High-Performance Frameworks

@mpv
Sorry, I can be totally wrong (only use Delphi). The build is created (setup_and_build.sh) with the defines FPC_X64MM, FPCMM_SERVER, NOSYNDBZEOS, NOSYNDBIBX, FPCMM_REPORTMEMORYLEAKS. Wouldn't it be better to disable FPCMM_REPORTMEMORYLEAKS? FPCMM_SERVER also activates FPCMM_DEBUG. Shouldn't it be disabled for better performance? Couldn't definition FPCMM_BOOSTER be an option for the test scenario?

With best regards
Thomas

Offline

#32 2023-01-26 21:20:03

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Yes, the correct way is
- to disable FPC_X64MM conditional
- include CMem as first unit
- include {$I ..\src\mormot.uses.inc}

Something like this, to compile also on Windows:

uses
  {$ifdef OSPOSIX}
  {$ifndef FPC_X64MM}
  CMem,  // or SynFPCCMemAligned
  {$endif FPC_X64MM}
  {$endif OSPOSIX}
  {$I ..\src\mormot.uses.inc} 

I will try to make a mormot 2 unit to use the libc memory manager.
The CMem works, but is a bit old in its implementation (we could call the libc directly with no prefix - as SynFPCCMemAligned does).
So @mpv try also with SynFPCCMemAligned instead of CMem.

Offline

#33 2023-01-27 09:48:17

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

current TFB round result is ready for mORMot - as expected /updates rates increased to 11k RPS (from 2-3k) because of order by. All test results increased because of removed  Smooting. After round ends we will be #28

Weights		1.000	1.737	21.745	4.077	68.363	0.163
Composire #	JSON	1-query	20-query Fortunes Updates Plaintext 	Weighted score
38 	mormot 	731,119	308,233	19,074	288,432	3,431	2,423,283 	3,486  2022-10-26 - 64 thread limitation
43 	mormot 	320,078	354,421	19,460	322,786	2,757	2,333,124 	3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	mormot 	317,009	359,874	19,303	324,360	1,443	2,180,582 	3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	mormot 	563,506	235,378	19,145	246,719	1,440	2,219,248 	2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	mormot 	394,333	285,352	18,688	205,305	1,345	2,216,469 	2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	mormot 	859,539	376,786	18,542	349,999	1,434	2,611,307 	3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	mormot 	948,354	373,531	18,496	366,488	11,256	2,759,065 	4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates	

@tbo - you are rights. I tries with disabled FPCMM_REPORTMEMORYLEAKS and enabled  FPCMM_BOOSTER (what disables FPCMM_DEBUG)

-dFPC_X64MM -dFPCMM_SERVER -dFPCMM_BOOSTER -dNOSYNDBZEOS -dNOSYNDBIBX 

but results is near the same as with previous parameters.

@ab - I adopt  SynFPCCMemAligned for mORMot2 but auto-test (mormot2tests) still fails with "core dumped" sad

Offline

#34 2023-01-27 11:37:51

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Please try https://github.com/synopse/mORMot2/commit/01fd9895
There is the new mormot.core.fpclibcmm.pas unit.
To enable it, just define FPC_LIBCMM but not FPC_X64MM with {$I mormot.uses.inc} in the dpr.

But libc would abort/SIG_KILL the process on any problem.
And it seems a bit paranoid, because "s := s + s" raise an execption - https://github.com/synopse/mORMot2/commit/bdc67a02

We may also test the FPC RTL MM which is twice slower than fpcx64mm on a few threads, but is likely to scale better with 28 CPU cores, because it maintains a threadvar for small blocks.
And it won't abort/SIG_KILL without notice!

Offline

#35 2023-01-27 12:17:54

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Already verified. FPC RTL mm is much slover compared to x64mm.

Offline

#36 2023-01-28 14:29:21

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Made a TFB PR#7879 with glibc MM and improved PG pipelining mode for raw* tests

Even with randomly occurs error with glibc and plaintext in pipelinig mode this version should work - error does not occurs in case /plaintext tests in pipelining mode are executed after warm up (plaintext w/o pipelining) - as it done in TFB benchmark

Offline

#37 2023-01-28 18:13:49

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

For comparison - results for 28 cores server what shows performance increasing. First 3 because of MM, raw* - mostly because of new PG pipelining. For other endpoints, what almost not allocate results near the same

			     x64mm	 libc
/fortunes                   181 000	361 000
/rawfortunes                367 000	424 000
/queries?queries=20          33 000	 35 000
-- raw perf increased because of new PG pipeline impl
/rawqueries?queries=20       6 000	 50 000
/rawupdates?queries=20       3 000	 26 000

if the server does not crash I expect mORMom can be in the top 10

Offline

#38 2023-01-31 16:04:30

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

A small rawfortunes improvemet (avoid record copy) - gives +4000RPS (+80 composite points)
Now I expect mormot in  fortunes to be #10 (just above asp.net core)

Last edited by mpv (2023-01-31 16:07:09)

Offline

#39 2023-01-31 18:11:45

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

See my remark in the last PR - you could try arr.NewPtr. wink

So fpcx64mm is a bottleneck with a lot of concurrent cores.
As we may expect due to its design, which was better than the original FastMM4 (much less contention), but still prone to contention.
For "regular" CPUs (up to 12-16 threads), my guess is that it is faster.

I also rewrote the mormot.core.fpclibcmm unit.
https://github.com/synopse/mORMot2/comm … bc2f487a0a
The prefix trick was not consistent and cmem fails to run mormot2tests on Linux x86_64.
This won't change for the TFB but it could help on other POSIX systems.

Offline

#40 2023-01-31 20:48:55

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Last fpclibcmm changes tested - all OK.
Will prepare new TFB PR, hope they don't get sick of me..

Offline

#41 2023-02-01 18:28:42

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

If you have time, please try
https://github.com/synopse/mORMot2/commit/0f944e51
The MM should now scale better on high-end hardware...

I have rewritten the fpcx64mm lockless free list
- to be really lockless
- and with no limit of size
- let GetMem() use this free list if it can instead of locking the block

Offline

#42 2023-02-01 19:12:43

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Unfortunately there is no changes in fortunes an all... New x64mm - 181K RPS (~90% of CPU in userspace, 10% - in kernel), glibc MM  -  360K RPS (50% CPU in user space, 50% in kernel).

I checked syscalls - both MM do near the same amount of mmap and munmap.
And I do not see anything strange in valgrind sad
If you need some addition help - please, tell me...

P.S.
compiled with -dFPC_X64MM -dFPCMM_SERVER -dFPCMM_BOOSTER

Last edited by mpv (2023-02-01 19:15:38)

Offline

#43 2023-02-01 19:41:33

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

By filtering valgrind profiling data using `mem` I found what the only valuable difference is  self timing of _getmem and _freemem
x64mm: 0.69 0.46
glibc    : 0.26  0.23

Offline

#44 2023-02-01 20:29:13

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

So we will use the glibc MM then on this HW.
smile

But could you try
- to enable FPCMM_DEBUG and FPCMM_BOOST (and not FPCMM_BOOSTER which disables FPCMM_DEBUG)
- run the tests on the 28 cores HW,
- and report the

  WriteHeapStatus(' ', 16, 8, {compileflags=}true); 

output on the console?
- perhaps including https://github.com/synopse/mORMot2/commit/a5195136 change which will ensure that the arena round-robin is really thread-safe.

It may help us see the actual contentions/locks/sleeps involved in the code.

Another possibility may be to change the following constants:

  NumTinyBlockTypesPO2  = 4; // tiny are <= 256 bytes
  NumTinyBlockArenasPO2 = 4; // 16 arenas

Perhaps 16 arenas is not enough with 28 cores (glibc maintain one pool per thread anyway)... so NumTinyBlockArenasPO2 = 5 would create 32 arenas which should not block on 28 cores, with a thread-safe round-robin.
Or TFB has contention on allocations > 256 bytes (which is not what I have seen).
The WriteHeapStatus() report may help identifying the problem.

So please try with:

  NumTinyBlockTypesPO2  = 3; // 4=256 bytes triggers more medium locks
  NumTinyBlockArenasPO2 = 5; // or 6 

Offline

#45 2023-02-02 20:11:23

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

Edit:
You may also try FPCMM_BOOSTER
with https://github.com/synopse/mORMot2/commit/60024584
- it will define 32 tiny arenas, and also several (31) medium arenas which are split around the tiny arenas and small blocks.

Offline

#46 2023-02-02 20:48:46

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

Just tried with commit/60024584 and FPCMM_BOOSTER - results is better - 243K RPS on fortunes (instead of 181)

Flags: BOOSTER  assumulthrd smallpools erms                                              
Small:  blocks=3K size=309KB (part of Medium arena)                                      
Medium: 43MB/43MB  sleep=137                                                             
Large:  0B/640KB  sleep=0                                                                
Total Sleep: count=137                                                                   
Small Blocks since beginning: 180M/22GB (as small=41/46 tiny=466/496)                    
  48=68M  112=28M  80=20M  128=14M  32=10M  96=7M  64=7M  160=3M                         
  144=3M  256=3M  880=3M  416=3M  1264=3M  272=2M  960=310K  448=308K                    
 Small Blocks current: 3K/309KB                                                          
  48=2K  64=426  352=200  32=87  128=80  112=73  80=48  96=21                            
  192=14  416=8  576=7  880=7  288=6  736=5  672=4  624=4

Offline

#47 2023-02-02 20:55:40

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

BTW - glibc MM for x64 by default uses arenas count = CPUcores*8

Offline

#48 2023-02-02 21:13:12

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

With FPCMM_BOOST result is 226К

Flags: BOOST  assumulthrd smallpool erms debug                                             
Small:  blocks=3K size=309KB (part of Medium arena)                                        
 Medium: 13MB/13MB    peak=13MB current=11 alloc=11 free=0 sleep=229                       
 Large:  0B/640KB    peak=640KB current=0 alloc=2 free=2 sleep=0                           
 Total Sleep: count=229                                                                    
 Small Blocks since beginning: 157M/19GB (as small=43/46 tiny=112/112)                     
  48=56M  112=25M  80=18M  128=12M  32=9M  96=6M  64=6M  160=3M                            
  144=3M  256=3M  880=2M  416=2M  1264=2M  272=1M  448=277K  960=273K                      
 Small Blocks current: 3K/309KB                                                            
  48=2K  64=426  352=200  32=87  128=80  112=73  80=48  96=21                              
  192=14  416=8  576=7  880=7  288=6  736=5  672=4  160=4    

Offline

#49 2023-02-03 18:25:15

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,183
Website

Re: High-Performance Frameworks

So it is better.

The good news is the number of "Sleep". It is low, and it seems to affect only the medium part: there is no sleep/contention for the small/tiny blocks.

I have tried another approach.
In FPCMM_BOOSTER, we now have 64 arenas for tiny blocks, and we use the current thread ID to return to the same arena from each thread. Using the thread ID is very close to what the libc MM does for the smallest blocks. But we don't track the threads, we just redirect them to the same slot in the 64 arenas.
There is now also a lock-free list of free medium blocks, when the arena is locked - but it won't affect the TFB benchmark.
See https://github.com/synopse/mORMot2/commit/290bedf9

Please try on your 28-cores system, with FPCMM_BOOSTER option...
And perhaps try to use 128 arenas instead, playing with constant NumTinyBlockArenasPO2 = 7 instead of 6...

Offline

#50 2023-02-03 20:50:44

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: High-Performance Frameworks

With arenas, bind to threadID fortunes result is 313K RPS - very close to 355K with libc. Congratulations!

Flags: BOOSTER  assumulthrd smallpools perthrd erms                              
Small:  blocks=3K size=309KB (part of Medium arena)                              
Medium: 51MB/51MB  sleep=10K                                                     
Large:  0B/640KB  sleep=0                                                        
Total Sleep: count=10K                                                           
Small Getmem Sleep: count=16                                                     
288=14 80=2                                                                      
Small Blocks since beginning: 234M/28GB (as small=42/46 tiny=746/1008)           
48=89M  112=37M  80=27M  128=18M  32=14M  96=9M  64=9M  144=4M                   
160=4M  256=4M  416=3M  880=3M  1264=3M  272=2M  960=465K  1376=464K             
Small Blocks current: 3K/309KB                                                   
48=2K  64=426  352=200  32=87  128=80  112=73  80=48  96=21                      
192=14  416=8  576=7  880=7  288=6  736=5  160=4  672=4

P.S.
sleeps count is increased, but overall speed - also

Last edited by mpv (2023-02-03 20:52:42)

Offline

Board footer

Powered by FluxBB