You are not logged in.
With latest sources TFB plaintext in pipeline mode crash server with message "free(): invalid next size (normal) Aborted (core dumped)" after 200k requests
For mormot test where is too many errors in memcheck mode ![]()
Where is an error (FPC 3.2.2) in
mormot.net.ldap.pas(1725,16) Error: Wrong number of parameters specified for call to "ASNObject"
Already verified. FPC RTL mm is much slover compared to x64mm.
Also I found what cmem uses less memory (at last in some cases) - see results for TFB /fortunes for mormot server with 168 threads
x64mm
Maximum resident set size (kbytes): 38952
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 21113
Voluntary context switches: 4118110
Involuntary context switches: 2813cmem
Maximum resident set size (kbytes): 28292
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 19733
Voluntary context switches: 8334379
Involuntary context switches: 863 During improving framework speed for TechEmpower benchmark we found what on modern hardware libc memory manager faster (sometimes x3 faster) compared to mORMot2 x64mm, but where is some errors in mORMot2 what cause AV while using glibc MM
This thread is for finding and solving such errors.
Default FPC CMem unit is a bit old in its implementation (we could call the libc directly with no prefix - as SynFPCCMemAligned does), but even in this case where is errors.
My favorite valgrind tool can help us to solve memory errors - see https://valgrind.org/docs/manual/quick-start.html
valgrind --leak-check=yes --track-origins=yes ./mormot2testsMy firs attempts shows many "Conditional jump or move depends on uninitialised value(s)" errors. Does mORMot expect the allocated memory to be filled 0?
current TFB round result is ready for mORMot - as expected /updates rates increased to 11k RPS (from 2-3k) because of order by. All test results increased because of removed Smooting. After round ends we will be #28
Weights 1.000 1.737 21.745 4.077 68.363 0.163
Composire # JSON 1-query 20-query Fortunes Updates Plaintext Weighted score
38 mormot 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 mormot 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 mormot 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 mormot 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 mormot 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 mormot 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 mormot 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates @tbo - you are rights. I tries with disabled FPCMM_REPORTMEMORYLEAKS and enabled FPCMM_BOOSTER (what disables FPCMM_DEBUG)
-dFPC_X64MM -dFPCMM_SERVER -dFPCMM_BOOSTER -dNOSYNDBZEOS -dNOSYNDBIBX but results is near the same as with previous parameters.
@ab - I adopt SynFPCCMemAligned for mORMot2 but auto-test (mormot2tests) still fails with "core dumped" ![]()
I can't reproduce libc problems in TFB bench on my PC, where I can debug, but application crash is also easy reproducible in `mormot2tests` - when I put CMem as a first unit
uses
CMem,
{$I ..\src\mormot.uses.inc} // may include mormot.core.fpcx64mm.pasand compile without any defines
@ab - do we need to do something specific like in mORMot1 SynFPCCMemAligned instead of using CMem?
As @ab you consider about month ago I tries with libc memory manager (uses cmem) instead of fpcx64mm in FPCMM_SERVER mode and....
all results for libc MM are better on modern CPU. For my PC it's better a little (10%) but for server (28 Xeon cores) results are increased dramatically. Most valuable is /fortunes - from 180k RPS to 350k RPS. Other tests is about +20%, for example /rawfortunes - from 350k to 408k
Unfortunately where is floating AV what happens very rarely, I will try to found it after blackout. And also will prepare a detailed endpoint statistics in cmem mode for comparison.
P.S.
in cmem mode AV is for plaintext in pipelining mode
BTW - names for TDynArraySortCompare SortDynArray* functions family (SortDynArrayInt64, ....) are very confusing (DynArray part in name). I understand such naming reason - it's because of FPC TListSortCompare naming, but in fact these functions not related to DynArray at all and can compare not only DynArray elements, and not only for sorting purpose, right?
May be (before mORMot2 is not released) rename all of them to Compare* (CompareInt64, ...) and family itself to TCompare. Or TfnCompare -> fnCompareInt64, fnCompareQWord etc to prevent possible conflicts.
After "several ORM raw optimizations" /db performance increased by +2000 RPS (mostly because fCache.NotifyAllFields now not called for non-cached entities, as far as I understand)
Here is server-side valgrind profile data for cached-queries?count=20 for mORMot 8765c931
May be you find some optimization ideas.
Since I do not reproduce pure cached queries performance on my server, I think on Citrine HW results are difference because of something like CPU cache.
I do not measure a valuable performance difference between ORM with ID and without ID. But in case of ORM without ID we break the rules. When we get to the top, someone will look at the source and say that this is not a "fair play".
I propose to rollback ORM also. In fact for Postgres having primary key in select fields affects only serialization and a tiny amount of traffic - PK value is always in buffer cache...
Tested improved TOrmCache.Retrieve - it gives a small but measurement improvement about + 3000 RPS for /db performance (~ +3000 RPS).
If you don't mind, I prefer to wait for the result of the next round (without Smooting) to decide whether we need Smooting or not before making a new PR.
And during this time, I might find a way to reproduce the cached-queries problem....
Current round *NOT* includes last PR where we remove hsoThreadSmooting and adds `order by` for updates - let's wait for the next round results ~ on 2023-01-28
I also worried about cached-queries performance. Bad thing is what I can't reproduce it on my server. Independent of hsoThreadSmooting I always get a good numbers ~400k for cached-queries?count=20
The only thing is what in opposite to Citrine I execute both wrk and app on the same server over loopback - may be this is a reason.
Solved TFB /rawqueries and /rawupdates performance problems:
- PostgreSQL pipeline mode has been rethought - better to use Flush after each statement instead of PipelineSync after last statement; In this case server starts to execute queries ASAP;
- use ::bigint[] instead of ::NUMERIC[] typecast in /rawupdates - NUMERIC is a floating point type, but we need Int64; added order by id to minimize locks waits (as in ORM)
See mORMot PR #140
Now I expect /rawupdates and /rawqueries to be in top 10.At last on my server hardware results are:
- 52000 RPS for /rawqueries?queries=20
- 26000 RPS for /rawupdates?queries=20
Will test once more after @ab merge PR#140 and prepare a merge request to TFB.....
Also a small performance tip to be fixed: TRestOrm.Retrieve call Model.GetTableIndexExisting twice - here inside Retrieve and second time - inside fCache.Retrieve
@ab - may be you add optional 3-d parameter tblIdx to Cache.Retrieve ?
PR 7860 is merged by TFB team. Now we are waiting for the results (~ on 2023-01-28). From my approximations /updates should be around 12000 what gives +650 to composite scores, /json is also improved, for other endpoints results depends on reaction to `hsoThreadSmooting` removing..
BTW in last round
- mORMot is #2 ORM in /db test! (Orm=ful). First is Rust based `xitca-web`
- #2 ORM in /fortunes test (actually #3, but I exclude lithium - it`s not an ORM) - first is `asp.net core`
Not bad, not bad...
Yes, updated HTTP headers parser helps for a few % for both /json and /plaintext. And with this changes where is no reason to verify disabled hsoHeadersInterning. I switch TFB PR 7860 to this mORMot version (it not merged yet, so waiting for a next round).
In PR 7860 I also remove hsoThreadSmooting - seems that it improves only plaintext, but bad for other endpoints.
I still investigating bad pipelining performance - no results yet, but at last I reproduce it on server hardware (not reproduced on my PC)
I trace /plaintext n pipelining mode on the server.
12% of time program spends on parsing headers, 7% is on retrieving HOST header - on this line https://github.com/synopse/mORMot2/blob … .pas#L1284 (3.5% of them - on string compassion -> THttpRequestContext.SetRawUtf -> function SortDynArrayPUtf8Char(const A, B): integer; )
When I replace host by const:
// 'HOST:'
Host := '10.0.0.1'; // GetTrimmed(P + 5, Host); I got +200 000 RPS on plaintext
@ab, please - see valgrind file - https://drive.google.com/file/d/1G1527e … share_link (viewing instruction is inside archive) - maybe you have some idea for optimization
I tests new feature by assigning connection to thread CustomObject and there is almost no difference with threadvar version (at last on thread pool size = 28). So - you are right - better to keep ThreadSafeConnection for auto-deleteion\reconnectiong etc.
Please, look at https://github.com/synopse/mORMot2/pull/135/files. It you think what removing lock is unsafe we can at last use ReadLock in function body and WriteLock in TryPrepare sub-function.
TFB PR 7860 is ready.
Using threadvar improves performance ~1%. Previous implementation do not locks - it loops over all connection in thread pool. And it takes some time...
In case of Async thread pool we have a TAsyncConnectionsThread.Index.
And we know a threadPoolSize, so efficient array-based implementation is possible (at last for /raw* endpoints) - the problem is what Ctxt.ConnectionThread is not filled by AsyncConnection here nil is passed. Can it be filled?
I detect a bottleneck - it's in ThreadSafeConnection implementation.
Current implementation ( on hi load) is looping almost every time to found connection for thread ID, Ideal implementation should use array of fixed size (equal to HTTP server threads count) and get connection by thread index.
I have almost finished prototyping and got at last 10% in /db performance, but @#$%^&* russians turned off electricity by their @#$%^&* missiles and I lost all modifications.
@ab, what do you think about fixed thread pool for DB connections?
Filled by nulls on start and indexSafeConnection(threadIndex) instead of ThreadSafeConnection?
Results for new round is ready. I prepare a historical overview or composite results - first row is weight of each column:
Weights 1.000 1.737 21.745 4.077 68.363 0.163
Composire # JSON 1-query 20-query Fortunes Updates Plaintext Weighted score
38 mormot 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 mormot 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 mormot 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 mormot 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 mormot 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 mormot 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity As far as I understand every new round is starts from trunk, so will include PR with the new thread layout.
While we wait for results of new round I will try to solve the problem with pipelining mode
2. Next, in TRestServerUriContext.FillInput I changed limit from 128 parameters to 512 parameters, because I use DataTables Javascript library (https://datatables.net/). And it uses up to a dozen parameters per table column. So 128 is not enough.
On the real life the URL length is limited (by web browser, by proxies, by CDNs etc). I've encountered such limitation several times. The best practice is to keep URL length < 2000 (so 128 parameters is far enough IMHO).
Looks like I fixed the update performance.
The reason is what we update table in random order, and simulations updates are lock each other (achieved after reading an epic book form postgrespro team https://postgrespro.ru/education/books/internals, unfortunately available only on russian).
So I add order by ID (Alternatively, we can sort by ID at the application server level, but at the database level it is easier) - see PR134
I setup and update test on 48core server - performance (with 512 concurrent connections and /updates?queries=20) increased from ~4k RPS to 16k RPS
I'm thinking about updates. After a deep dive into Postgres architecture I have some ideas, I will check (in few days) and report the results.
@ab, most of the open sources maintainers are not as quick as you (thanks for your perfect support!)
So yes, will wait for a next round...
The most close to TFB environment is Dual Xeon Raise-5 for $215/month - we can use first code for app server an second - for Postgres+wrk
The core of cloud provider business it to sell more resources they have.
For servers with 32 physical cores they usually sells at last 32*4 vCPU. We never know what tasks are running in the VM on the same host where our VM is located, and test results will vary over a wide range.
This is why I prefer dedicated servers.
But you can try.
On the server, I use for testing I have very limited rights. I can't verify updates, reconfigure posgres or tune OS, so lets wait for TFB round...
Such king of bugs often occurs when variable on stack is of unexpected size (see for example this topic - https://synopse.info/forum/viewtopic.ph … 798#p28798 and fix - https://github.com/synopse/mORMot/pull/299/files)
If so, it can disappear while compiling in debug/release/different optimization level or just changing a unit order. If so, it is a symptom of "dangling pointers" or incorrect var size.
In hsoThreadSmooting but WITHOUT hsoThreadCpuAffinity results are a little better
/json 1 175 000
/db 452 000
/fortunes 181 000
/cached-queries?count=20 612 000
/queries?queries=20 33 000
/plaintext 1 335 000
/plaintext 2 900 000 (pipelining 1024 concurrent conns)
/plaintext 2 761 000 (pipelining 16384 concurrent conns)so I remove "if servers = 1 then " to always enable it and made an MR to TFB
just finish verifying WITHOUT hsoThreadCpuAffinity
for 28 CORES (as on TFB server) in 28 thread * 6 server mode and PostgreSQL on the same server (unfortunately cant test pipelining mode because linpq is v12), results are:
/json 1 218 000
/db 444 000
/fortunes 182 000
/cached-queries?count=20 498 000
/queries?queries=20 35 000
/plaintext 1 314 000
/plaintext 2 830 000 (pipelining 1024 concurrent conns)
/plaintext 2 700 000 (pipelining 16384 concurrent conns)will try hsoThreadSmooting..
I tries to setup DB on the server - there is some problems there...
with the new pool of asynchronous threads the results are worse than with the previous one(in all cases)
~280k for 64 thread and 1 server (vs 350 with old algo), 720K in 24thread*6 server mode (vs 1050).
CPU load is strange - see this picture for 6 server mode - https://drive.google.com/file/d/1jPiYQI … share_link
The best /json and /plaintext performances for 24 cores is reached in
num CPU=24, num thread=24, num servers=6, total workers=144 mode
/json 1 050 000 RPS
/plaintext 1 110 000 RPS
/plaintext 2 410 000 RPS in pipelining mode
P.S. = tried before 54090417 commit
@ab, please, add an option hsoReusePort for THttpAsyncServer and in function NewSocket in line 1733 before bind
// Server-side binding/listening of the socket to the address:port
v := 1;
sock.SetOpt(SOL_SOCKET, SO_REUSEPORT, @v, SizeOf(v));
if (bind(sock.Socket, @addr, addr.Size) <> NO_ERROR)I create multiple server insances inside a single process:
// rawServer := TRawAsyncServer.Create(threads);
setLength(rawServers, servers);
for i := 0 to servers-1 do
rawServers[i] := TRawAsyncServer.Create(threads);For THttpAsyncServer running on localhost:8080; num CPU=24, num thread=48, num servers=4, total workers=192
I got
/json = 937 651RPS
/plaintext =1 008 705RPS
Will prepare a true test to be executed with tfb command ASAP (we are under massive missile attack right now, so electricity can off any time
An experiment: I add a SO_REUSEPORT to the listening socket of async server - now it's possible to run several instance of server on the same port and kernel redistribute incoming connections between them
And YES!!! for /plaintext (instances are limited by first 24 CPU cores) with 256 concurrent connections
- 1 server with 192 thread = 564k RPS
- 2 servers with 96 thread = 750k PRS - faster than one
- 3 servers with 64 thread = 790k RPS - the same total number of working threads, but redistributed on 3 thread pool
- 4 servers with 48 thread = 920k RPS (~840k for /json)
Here is a picture for 4 servers - https://drive.google.com/file/d/1e3Qgwe … share_link
For 4 server mode and 512 concurrent connections /json is ~940k, so problem not in per process epool size.
It's definitionally somewhere is Async server queue implementation.
With THttpServer and 256 connections I got ~865k RPS and 100% CPU load on /json (vs ~370k RPS and 45% CPU for AsyncServer)
With 512 connections there is ~750k RPS.
For more than 512 conn`s I got too many read error (not depends on HttpQueueLength - verified with 0 and with 10000)
For /plaintext results are:
- 256 conn = 1 050 000 RPS
(vs ~370k RPS for async)
- 512 conn = 854 000
plaintext pipelining mode is failed (as expected, I wonder what it works for async server )
So - memory manager is not a bottleneck
I shure this is not syscalls limitation, because on the same hardware nodejs (in cluster mode, limited to 16 cores) shows better results for json (500k RPS). And wrk remains on rhe same load.
Node create listening socket per fork using SO_REUSEPORT https://lwn.net/Articles/542629/. May be this is a way to go?
PS
Will try CMem after blackout, but I think memory manager is not a bottleneck
With /plaintext picture is the same as with JSON - most of the time most of the threads are in the S (sleep) state.
And max RPS is the same ~370 000 (2.4Ghz cpu's)
BUT! /plaintext in pipelining mode (just found how they do it)
wrk -H 'Host: 10.0.0.1' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 28 http://10.0.0.1:8080/plaintext -s pipeline.lua -- 16where pipeline.lua is
init = function(args)
local r = {}
local depth = tonumber(args[1]) or 1
for i=1,depth do
r[i] = wrk.format()
end
req = table.concat(r)
end
request = function()
return req
endRPS is ~2mln and our server consume 100% of each core.
BTW - for plaintext TFB made 256 1,024 4,096 16,384 connections (see Data Table section in plaintext visualization)
How to enable glibc memory manager?
TFB round ends, but CPU statistic is empty ![]()
Anyway I found hardware similar to TFB (2CPU 24 cores each) and reproduce a problem with /json
I remove a hsoThreadCpuAffinity option and bind server to first 16 cores and wrk to last 16 cores using taskset
taskset 0xFFFF00000000 wrk...
taskset 0xFFFF rawTwo screenshots below (one for "raw 32" and one for "raw 64") shows what our server do not consume a full CPU power. RPS is ~200k for 32 thread vs ~370k for 64 thread. But CPU consumption is near the same.
Problem is:
not in memory manager - I tries all possible combinations ( default MM, FPC_X64MM, FPCMM_SERVER)
not in pthread/spin lock - I tries with current and old mORMot2
It seems that the gap is in "waits" inside workers...
Default MM 32 thread - https://drive.google.com/file/d/1JMBYxp … share_link
Default MM 64 thread - https://drive.google.com/file/d/1df9njW … share_link
P.S.
Setting fHttpServer.HttpQueueLength := 0; does not change performance, only small amount ~300 of read errors occurs in wrk
P.S.2
Increasing the number of available cores
taskset 0xFFFFFF raw 64decrease an RPS (~10%)
P.S.3
binding of thread affinity to first 16 cores using hack
SystemInfo.dwNumberOfProcessors := 16;
// + enabling hsoThreadCpuAffinitydoes not change anything (a small perf incr)
P.S.5
Using poll instead of epoll - near the same results (for 256 connections)
I will include UriRouter. Let"s wait to end of TFB round to see a stats.
I have already check FPC MM some month ago, on 12 core mormot mm is faster...
MR is ready. Unfortunately last two Citrine rounds are stuck - see this issue. Reason is "power problem". Hope their servers is not in Ukraine ![]()
In fact, it is better to rely on the OS than let our TLightLock spin
I completely agree.
Just tests it with latest commit on my PC (12 cores) - results is a little better with all tests (I think I bounds in CPU because app/DB/wrk is on the same PC)
Very good news is what for all tests RPS increases together with wrk concurrency.
So I starts prepare a TFB MR with current sources, threadPoolSize = CPU*4 and hsoThreadCpuAffinity enabled (with disabled hsoThreadCpuAffinity I still have huge amount of cpu-migrations)
I will test it first on my evn. Hope today evening
I found detailed what for completed rounds TFB publish detailed statistic - "details" link above "visualize - here is for last completed round - https://tfb-status.techempower.com/unzi … /mormot/db
The only interesting thing i see is "Total CPU usage".
For example for cachedQueries it is here - https://tfb-status.techempower.com/unzi … s.txt.json
First number is # of wrk execution from raw.txt (first two "Running Primer..." and "Running Warmup" is ignored) so stat for "cached-queries?count=20" is on #2 - when timestamps
For for "cached-queries?count=20" statistic shows
total cpu usage
sys 1.786
stl 0
idl 0
usr 98.214what is strange...
For db = 38% is idle, what strange too