You are not logged in.
This is the discussion topic about integrating mORMot 2 with the TechEmpower Framework Benchmarks (TFB) challenge.
https://github.com/TechEmpower/FrameworkBenchmarks
This is a follow-up of https://synopse.info/forum/viewtopic.php?id=5547 in the new "mORMot 2" thread.
As reference, the current status of the TFB challenge internal rounds is available at https://tfb-status.techempower.com
Offline
Info: the pull request has been merged.
https://github.com/TechEmpower/Framewor … /pull/7833
So I hope next round will show better numbers.
I am very upset with the current "updates" performance on their high-end HW - much slower on their system than on my old laptop!
Offline
Looks like I fixed the update performance.
The reason is what we update table in random order, and simulations updates are lock each other (achieved after reading an epic book form postgrespro team https://postgrespro.ru/education/books/internals, unfortunately available only on russian).
So I add order by ID (Alternatively, we can sort by ID at the application server level, but at the database level it is easier) - see PR134
I setup and update test on 48core server - performance (with 512 concurrent connections and /updates?queries=20) increased from ~4k RPS to 16k RPS
Offline
This is a great finding!
I have merged your PR.
Thanks for the feedback.
The new round would start today I guess.
I hope they will include the latest trunk, and your previous PR with the new thread layout...
Offline
Offline
Results for new round is ready. I prepare a historical overview or composite results - first row is weight of each column:
Weights 1.000 1.737 21.745 4.077 68.363 0.163
Composire # JSON 1-query 20-query Fortunes Updates Plaintext Weighted score
38 mormot 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 mormot 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 mormot 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 mormot 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 mormot 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 mormot 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
Offline
I detect a bottleneck - it's in ThreadSafeConnection implementation.
Current implementation ( on hi load) is looping almost every time to found connection for thread ID, Ideal implementation should use array of fixed size (equal to HTTP server threads count) and get connection by thread index.
I have almost finished prototyping and got at last 10% in /db performance, but @#$%^&* russians turned off electricity by their @#$%^&* missiles and I lost all modifications.
@ab, what do you think about fixed thread pool for DB connections?
Filled by nulls on start and indexSafeConnection(threadIndex) instead of ThreadSafeConnection?
Offline
This is a bit unexpected because the loop is protected by a ReadLock so is not blocking.
If have added a threadvar which is the safest way to make it work as expected.
Because we never know how much threads there will be. There is no such thing as a "thread index" - just a thread ID.
https://github.com/synopse/mORMot2/commit/82f5085b
About the numbers, it is weird that the "updates" numbers are so low.
I hope with your latest fix about sorting the IDs it could be better. Or we could use a more complex OR/IN statement (as other ORMs do) instead of the nested SELECT. I am not sure...
Stay safe.
Offline
TFB PR 7860 is ready.
Using threadvar improves performance ~1%. Previous implementation do not locks - it loops over all connection in thread pool. And it takes some time...
In case of Async thread pool we have a TAsyncConnectionsThread.Index.
And we know a threadPoolSize, so efficient array-based implementation is possible (at last for /raw* endpoints) - the problem is what Ctxt.ConnectionThread is not filled by AsyncConnection here nil is passed. Can it be filled?
Offline
Yes, you are right it may be possible only for the raw code, since there is no link between mormot.sql.db and mormot.net.async - the thread pool is not known by the SQL layer.
Please check https://github.com/synopse/mORMot2/commit/99f97333
Now ConnectionThread should be populated.
Edit:
I have added a new TAsyncConnectionsThread.CustomObject property.
You could put your DB connection directly in this field: it will be freed by TAsyncConnectionsThread.Destroy.
https://github.com/synopse/mORMot2/commit/4b676390
But by design, it will be very basic, and will lack e.g. the auto-deletion-once-deprecated feature of TSqlDBConnectionPropertiesThreadSafe.
Offline
I tests new feature by assigning connection to thread CustomObject and there is almost no difference with threadvar version (at last on thread pool size = 28). So - you are right - better to keep ThreadSafeConnection for auto-deleteion\reconnectiong etc.
Please, look at https://github.com/synopse/mORMot2/pull/135/files. It you think what removing lock is unsafe we can at last use ReadLock in function body and WriteLock in TryPrepare sub-function.
Offline
I trace /plaintext n pipelining mode on the server.
12% of time program spends on parsing headers, 7% is on retrieving HOST header - on this line https://github.com/synopse/mORMot2/blob … .pas#L1284 (3.5% of them - on string compassion -> THttpRequestContext.SetRawUtf -> function SortDynArrayPUtf8Char(const A, B): integer; )
When I replace host by const:
// 'HOST:'
Host := '10.0.0.1'; // GetTrimmed(P + 5, Host);
I got +200 000 RPS on plaintext
@ab, please - see valgrind file - https://drive.google.com/file/d/1G1527e … share_link (viewing instruction is inside archive) - maybe you have some idea for optimization
Offline
You may rather try to disable hsoHeadersInterning and see what happens, especially with the libc memory manager.
I have also made https://github.com/synopse/mORMot2/commit/cfbc5694
Which may help a little.
Offline
Yes, updated HTTP headers parser helps for a few % for both /json and /plaintext. And with this changes where is no reason to verify disabled hsoHeadersInterning. I switch TFB PR 7860 to this mORMot version (it not merged yet, so waiting for a next round).
In PR 7860 I also remove hsoThreadSmooting - seems that it improves only plaintext, but bad for other endpoints.
I still investigating bad pipelining performance - no results yet, but at last I reproduce it on server hardware (not reproduced on my PC)
Offline
PR 7860 is merged by TFB team. Now we are waiting for the results (~ on 2023-01-28). From my approximations /updates should be around 12000 what gives +650 to composite scores, /json is also improved, for other endpoints results depends on reaction to `hsoThreadSmooting` removing..
BTW in last round
- mORMot is #2 ORM in /db test! (Orm=ful). First is Rust based `xitca-web`
- #2 ORM in /fortunes test (actually #3, but I exclude lithium - it`s not an ORM) - first is `asp.net core`
Not bad, not bad...
Offline
Solved TFB /rawqueries and /rawupdates performance problems:
- PostgreSQL pipeline mode has been rethought - better to use Flush after each statement instead of PipelineSync after last statement; In this case server starts to execute queries ASAP;
- use ::bigint[] instead of ::NUMERIC[] typecast in /rawupdates - NUMERIC is a floating point type, but we need Int64; added order by id to minimize locks waits (as in ORM)
See mORMot PR #140
Now I expect /rawupdates and /rawqueries to be in top 10.At last on my server hardware results are:
- 52000 RPS for /rawqueries?queries=20
- 26000 RPS for /rawupdates?queries=20
Will test once more after @ab merge PR#140 and prepare a merge request to TFB.....
Also a small performance tip to be fixed: TRestOrm.Retrieve call Model.GetTableIndexExisting twice - here inside Retrieve and second time - inside fCache.Retrieve
@ab - may be you add optional 3-d parameter tblIdx to Cache.Retrieve ?
Last edited by mpv (2023-01-21 19:19:58)
Offline
Great!
I have merged the pull request.
About hsoThreadSmooting, what is your feedback about its impact on the Citrine HW?
The mORMot numbers are part of latest https://www.techempower.com/benchmarks/ … ched-query
I don't understand the numbers of cached-query. They should be close to the /json numbers, and we reached only 100,000 per second.
Perhaps it is because hsoThreadSommothing is missing...
About TOrmCache.Retrieve see https://github.com/synopse/mORMot2/commit/440ffa93
Offline
Current round *NOT* includes last PR where we remove hsoThreadSmooting and adds `order by` for updates - let's wait for the next round results ~ on 2023-01-28
I also worried about cached-queries performance. Bad thing is what I can't reproduce it on my server. Independent of hsoThreadSmooting I always get a good numbers ~400k for cached-queries?count=20
The only thing is what in opposite to Citrine I execute both wrk and app on the same server over loopback - may be this is a reason.
Offline
Tested improved TOrmCache.Retrieve - it gives a small but measurement improvement about + 3000 RPS for /db performance (~ +3000 RPS).
If you don't mind, I prefer to wait for the result of the next round (without Smooting) to decide whether we need Smooting or not before making a new PR.
And during this time, I might find a way to reproduce the cached-queries problem....
Offline
Please try https://github.com/synopse/mORMot2/commit/87aa8faf
I think it is not needed to return the ID field from the DB when it is already part of the "where" clause.
I have done this for both ORM and rawdb queries. In fact, the ORM was allowing the DB layer to not return the ID value: it does set it manually (Value.IDValue := ID) after parsing the JSON returned by the DB layer.
There are some other minor optimizations in the previous commits.
From what I could read in the TFB requirements, it is not forbidden to do so, and I suppose it will leverage the DB a little more.
Numbers are better on my side.
Offline
No, it`s forbidden to read only randomNumber - see punkt i.
i. For every request, a single row from a World table must be retrieved from a database table. It is not acceptable to read just the randomNumber.
Offline
But what if the ORM does this pretty valid optimization?
We could try to let the ORM use his default behavior, but use a SELECT id,randomnumber for the raw queries.
https://github.com/synopse/mORMot2/commit/0eca2dff
Offline
I do not measure a valuable performance difference between ORM with ID and without ID. But in case of ORM without ID we break the rules. When we get to the top, someone will look at the source and say that this is not a "fair play".
I propose to rollback ORM also. In fact for Postgres having primary key in select fields affects only serialization and a tiny amount of traffic - PK value is always in buffer cache...
Offline
Make sense for PostgreSQL.
I saw a slight performance impact with SQLite3.
But I guess we could rollback it for all external databases, and strictly follow TFB rules (even if they do not make much sense).
Please check https://github.com/synopse/mORMot2/commit/5a69112b
And I have committed several ORM optimizations in https://github.com/synopse/mORMot2/commit/88742d08
On my side, performance is slightly better.
Offline
After "several ORM raw optimizations" /db performance increased by +2000 RPS (mostly because fCache.NotifyAllFields now not called for non-cached entities, as far as I understand)
Here is server-side valgrind profile data for cached-queries?count=20 for mORMot 8765c931
May be you find some optimization ideas.
Since I do not reproduce pure cached queries performance on my server, I think on Citrine HW results are difference because of something like CPU cache.
Last edited by mpv (2023-01-24 14:53:00)
Offline
@mpv
I looked at the cachegrind information... but I am not convinced what to do.
Please try with https://github.com/synopse/mORMot2/commit/5e6f3685
Offline
As @ab you consider about month ago I tries with libc memory manager (uses cmem) instead of fpcx64mm in FPCMM_SERVER mode and....
all results for libc MM are better on modern CPU. For my PC it's better a little (10%) but for server (28 Xeon cores) results are increased dramatically. Most valuable is /fortunes - from 180k RPS to 350k RPS. Other tests is about +20%, for example /rawfortunes - from 350k to 408k
Unfortunately where is floating AV what happens very rarely, I will try to found it after blackout. And also will prepare a detailed endpoint statistics in cmem mode for comparison.
P.S.
in cmem mode AV is for plaintext in pipelining mode
Last edited by mpv (2023-01-26 14:45:01)
Offline
I can't reproduce libc problems in TFB bench on my PC, where I can debug, but application crash is also easy reproducible in `mormot2tests` - when I put CMem as a first unit
uses
CMem,
{$I ..\src\mormot.uses.inc} // may include mormot.core.fpcx64mm.pas
and compile without any defines
@ab - do we need to do something specific like in mORMot1 SynFPCCMemAligned instead of using CMem?
Offline
@mpv
Sorry, I can be totally wrong (only use Delphi). The build is created (setup_and_build.sh) with the defines FPC_X64MM, FPCMM_SERVER, NOSYNDBZEOS, NOSYNDBIBX, FPCMM_REPORTMEMORYLEAKS. Wouldn't it be better to disable FPCMM_REPORTMEMORYLEAKS? FPCMM_SERVER also activates FPCMM_DEBUG. Shouldn't it be disabled for better performance? Couldn't definition FPCMM_BOOSTER be an option for the test scenario?
With best regards
Thomas
Offline
Yes, the correct way is
- to disable FPC_X64MM conditional
- include CMem as first unit
- include {$I ..\src\mormot.uses.inc}
Something like this, to compile also on Windows:
uses
{$ifdef OSPOSIX}
{$ifndef FPC_X64MM}
CMem, // or SynFPCCMemAligned
{$endif FPC_X64MM}
{$endif OSPOSIX}
{$I ..\src\mormot.uses.inc}
I will try to make a mormot 2 unit to use the libc memory manager.
The CMem works, but is a bit old in its implementation (we could call the libc directly with no prefix - as SynFPCCMemAligned does).
So @mpv try also with SynFPCCMemAligned instead of CMem.
Offline
current TFB round result is ready for mORMot - as expected /updates rates increased to 11k RPS (from 2-3k) because of order by. All test results increased because of removed Smooting. After round ends we will be #28
Weights 1.000 1.737 21.745 4.077 68.363 0.163
Composire # JSON 1-query 20-query Fortunes Updates Plaintext Weighted score
38 mormot 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 mormot 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 mormot 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 mormot 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 mormot 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 mormot 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 mormot 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
@tbo - you are rights. I tries with disabled FPCMM_REPORTMEMORYLEAKS and enabled FPCMM_BOOSTER (what disables FPCMM_DEBUG)
-dFPC_X64MM -dFPCMM_SERVER -dFPCMM_BOOSTER -dNOSYNDBZEOS -dNOSYNDBIBX
but results is near the same as with previous parameters.
@ab - I adopt SynFPCCMemAligned for mORMot2 but auto-test (mormot2tests) still fails with "core dumped"
Offline
Please try https://github.com/synopse/mORMot2/commit/01fd9895
There is the new mormot.core.fpclibcmm.pas unit.
To enable it, just define FPC_LIBCMM but not FPC_X64MM with {$I mormot.uses.inc} in the dpr.
But libc would abort/SIG_KILL the process on any problem.
And it seems a bit paranoid, because "s := s + s" raise an execption - https://github.com/synopse/mORMot2/commit/bdc67a02
We may also test the FPC RTL MM which is twice slower than fpcx64mm on a few threads, but is likely to scale better with 28 CPU cores, because it maintains a threadvar for small blocks.
And it won't abort/SIG_KILL without notice!
Offline
Made a TFB PR#7879 with glibc MM and improved PG pipelining mode for raw* tests
Even with randomly occurs error with glibc and plaintext in pipelinig mode this version should work - error does not occurs in case /plaintext tests in pipelining mode are executed after warm up (plaintext w/o pipelining) - as it done in TFB benchmark
Offline
For comparison - results for 28 cores server what shows performance increasing. First 3 because of MM, raw* - mostly because of new PG pipelining. For other endpoints, what almost not allocate results near the same
x64mm libc
/fortunes 181 000 361 000
/rawfortunes 367 000 424 000
/queries?queries=20 33 000 35 000
-- raw perf increased because of new PG pipeline impl
/rawqueries?queries=20 6 000 50 000
/rawupdates?queries=20 3 000 26 000
if the server does not crash I expect mORMom can be in the top 10
Offline
A small rawfortunes improvemet (avoid record copy) - gives +4000RPS (+80 composite points)
Now I expect mormot in fortunes to be #10 (just above asp.net core)
Last edited by mpv (2023-01-31 16:07:09)
Offline
See my remark in the last PR - you could try arr.NewPtr.
So fpcx64mm is a bottleneck with a lot of concurrent cores.
As we may expect due to its design, which was better than the original FastMM4 (much less contention), but still prone to contention.
For "regular" CPUs (up to 12-16 threads), my guess is that it is faster.
I also rewrote the mormot.core.fpclibcmm unit.
https://github.com/synopse/mORMot2/comm … bc2f487a0a
The prefix trick was not consistent and cmem fails to run mormot2tests on Linux x86_64.
This won't change for the TFB but it could help on other POSIX systems.
Offline
If you have time, please try
https://github.com/synopse/mORMot2/commit/0f944e51
The MM should now scale better on high-end hardware...
I have rewritten the fpcx64mm lockless free list
- to be really lockless
- and with no limit of size
- let GetMem() use this free list if it can instead of locking the block
Offline
Unfortunately there is no changes in fortunes an all... New x64mm - 181K RPS (~90% of CPU in userspace, 10% - in kernel), glibc MM - 360K RPS (50% CPU in user space, 50% in kernel).
I checked syscalls - both MM do near the same amount of mmap and munmap.
And I do not see anything strange in valgrind
If you need some addition help - please, tell me...
P.S.
compiled with -dFPC_X64MM -dFPCMM_SERVER -dFPCMM_BOOSTER
Last edited by mpv (2023-02-01 19:15:38)
Offline
So we will use the glibc MM then on this HW.
But could you try
- to enable FPCMM_DEBUG and FPCMM_BOOST (and not FPCMM_BOOSTER which disables FPCMM_DEBUG)
- run the tests on the 28 cores HW,
- and report the
WriteHeapStatus(' ', 16, 8, {compileflags=}true);
output on the console?
- perhaps including https://github.com/synopse/mORMot2/commit/a5195136 change which will ensure that the arena round-robin is really thread-safe.
It may help us see the actual contentions/locks/sleeps involved in the code.
Another possibility may be to change the following constants:
NumTinyBlockTypesPO2 = 4; // tiny are <= 256 bytes
NumTinyBlockArenasPO2 = 4; // 16 arenas
Perhaps 16 arenas is not enough with 28 cores (glibc maintain one pool per thread anyway)... so NumTinyBlockArenasPO2 = 5 would create 32 arenas which should not block on 28 cores, with a thread-safe round-robin.
Or TFB has contention on allocations > 256 bytes (which is not what I have seen).
The WriteHeapStatus() report may help identifying the problem.
So please try with:
NumTinyBlockTypesPO2 = 3; // 4=256 bytes triggers more medium locks
NumTinyBlockArenasPO2 = 5; // or 6
Offline
Edit:
You may also try FPCMM_BOOSTER
with https://github.com/synopse/mORMot2/commit/60024584
- it will define 32 tiny arenas, and also several (31) medium arenas which are split around the tiny arenas and small blocks.
Offline
Just tried with commit/60024584 and FPCMM_BOOSTER - results is better - 243K RPS on fortunes (instead of 181)
Flags: BOOSTER assumulthrd smallpools erms
Small: blocks=3K size=309KB (part of Medium arena)
Medium: 43MB/43MB sleep=137
Large: 0B/640KB sleep=0
Total Sleep: count=137
Small Blocks since beginning: 180M/22GB (as small=41/46 tiny=466/496)
48=68M 112=28M 80=20M 128=14M 32=10M 96=7M 64=7M 160=3M
144=3M 256=3M 880=3M 416=3M 1264=3M 272=2M 960=310K 448=308K
Small Blocks current: 3K/309KB
48=2K 64=426 352=200 32=87 128=80 112=73 80=48 96=21
192=14 416=8 576=7 880=7 288=6 736=5 672=4 624=4
Offline
BTW - glibc MM for x64 by default uses arenas count = CPUcores*8
Offline
With FPCMM_BOOST result is 226К
Flags: BOOST assumulthrd smallpool erms debug
Small: blocks=3K size=309KB (part of Medium arena)
Medium: 13MB/13MB peak=13MB current=11 alloc=11 free=0 sleep=229
Large: 0B/640KB peak=640KB current=0 alloc=2 free=2 sleep=0
Total Sleep: count=229
Small Blocks since beginning: 157M/19GB (as small=43/46 tiny=112/112)
48=56M 112=25M 80=18M 128=12M 32=9M 96=6M 64=6M 160=3M
144=3M 256=3M 880=2M 416=2M 1264=2M 272=1M 448=277K 960=273K
Small Blocks current: 3K/309KB
48=2K 64=426 352=200 32=87 128=80 112=73 80=48 96=21
192=14 416=8 576=7 880=7 288=6 736=5 672=4 160=4
Offline
So it is better.
The good news is the number of "Sleep". It is low, and it seems to affect only the medium part: there is no sleep/contention for the small/tiny blocks.
I have tried another approach.
In FPCMM_BOOSTER, we now have 64 arenas for tiny blocks, and we use the current thread ID to return to the same arena from each thread. Using the thread ID is very close to what the libc MM does for the smallest blocks. But we don't track the threads, we just redirect them to the same slot in the 64 arenas.
There is now also a lock-free list of free medium blocks, when the arena is locked - but it won't affect the TFB benchmark.
See https://github.com/synopse/mORMot2/commit/290bedf9
Please try on your 28-cores system, with FPCMM_BOOSTER option...
And perhaps try to use 128 arenas instead, playing with constant NumTinyBlockArenasPO2 = 7 instead of 6...
Offline
With arenas, bind to threadID fortunes result is 313K RPS - very close to 355K with libc. Congratulations!
Flags: BOOSTER assumulthrd smallpools perthrd erms
Small: blocks=3K size=309KB (part of Medium arena)
Medium: 51MB/51MB sleep=10K
Large: 0B/640KB sleep=0
Total Sleep: count=10K
Small Getmem Sleep: count=16
288=14 80=2
Small Blocks since beginning: 234M/28GB (as small=42/46 tiny=746/1008)
48=89M 112=37M 80=27M 128=18M 32=14M 96=9M 64=9M 144=4M
160=4M 256=4M 416=3M 880=3M 1264=3M 272=2M 960=465K 1376=464K
Small Blocks current: 3K/309KB
48=2K 64=426 352=200 32=87 128=80 112=73 80=48 96=21
192=14 416=8 576=7 880=7 288=6 736=5 160=4 672=4
P.S.
sleeps count is increased, but overall speed - also
Last edited by mpv (2023-02-03 20:52:42)
Offline