You are not logged in.
We may have to investigate the mORMot locks too.
The async server uses its own set of locks per connections (one for reads, one for writes). It is home made locks, which spin then use fpsleep() after a while.
They are meant to be small and efficient, but perhaps they don't scale well when there are a lot of threads involved, i.e. a lot of threads waiting to access the queu.
Maybe a true lockless ring may help.
But CPU affinity is also a good way to explore.
Offline
You may try https://github.com/synopse/mORMot2/commit/46d19d8e
I only have a single CPU socket system, so I was not able to see if it helps...
Edit 1: Current code won't work as expected on Linux because pthread_setaffinity_np() is about logical cores, not HW sockets.
I will fix it.
Edit 2: It should now work as expected, thanks to https://github.com/synopse/mORMot2/commit/b5e6868f
But I have still a doubt about the "physical id: ###" lines. I expected 0, 1... but it maybe 0, 3...
Could you post the content of /proc/cpuinfo on a dual socket system?
Offline
@ab - as far as I understand you mistakes a little about pthread_attr_setaffinity_np - it sets affinity mask for logical cores (including hyperthreaded), not for physical CPU. In other words - cores are
ls -dl /sys/devices/system/cpu/cpu[0-9]*
My server is off due to electricity (do not know at all when I can turn it back, country is in power-saving mode), but most of my prev. tests I did on my PC with one socket / 6phsical / 12logical cores (cpu0 - cpu12 for command above)
And perf stat cpu-migrations is migrations over cores also
Last edited by mpv (2022-11-28 19:02:42)
Offline
Yes, this mistake was fixed.
See my Edit 2.
It should work with both Windows and Linux.
And now I should have fixed the "physical id: ###" lines parsing error.
Physical CPU sockets should be properly identified now.
Please see https://github.com/synopse/mORMot2/commit/3d9feaa4
So do you mean that with a single socket, you had the problem?
Offline
I mean not this, but instead of
if CpuSockets > 1 then
......
use
for i := 0 to aThreadPoolCount - 1 do
ok := SetThreadCpuAffinity(fThreads[i], i mod SystemInfo.dwNumberOfProcessors);
this is exactly what h2o did as far as i understand.
After my changes I got the same plaintext ~700k RPS for both 96 and 24 mode!
But I do not sure it applicable for real production
Offline
This is how htop looks for 96 threads with affinity masks - https://drive.google.com/file/d/166Op86 … sp=sharing
and this is perf stat
$ perf stat ./raw 96
THttpAsyncServer running on localhost:8080; num thread=96 db=PostgreSQL
Performance counter stats for './raw 96':
25588,36 msec task-clock # 1,676 CPUs utilized
472073 context-switches # 18,449 K/sec
92 cpu-migrations # 3,595 /sec
1430 page-faults # 55,885 /sec
3,859381000 seconds user
22,026711000 seconds sys
I like this statistic
As for me binding of threads to CPU is OK for TFB, because at the same time all threads handle near the same type of load, only our application is running on host and so on. But in real life I not sure it is valid approach. Or not? May be add an option for this?
Last edited by mpv (2022-11-28 19:38:44)
Offline
Please try https://github.com/synopse/mORMot2/commit/65dfc3ae
You can set the new hsoThreadCpuAffinity option to tune the thread pool core affinity with the socket-based servers.
See https://github.com/synopse/mORMot2/comm … 8f0f747eeb
I have measured a slight performance penalty for the DB-related queries...
Offline
Now all works as expected. Thanks! From my tests db related queries become even a little faster. Can you, please do a mORMot release with sqlite 3.40.0 wonted by latest sources, because current 2.0.4148 release contains sqlite 3.39.4.
Offline
@mvp, can you push to TFB version with threads=CPU cores, to see performance for json and plaintext (to test http server performance without db). I expect better performance with db also. Looking at https://techcommunity.microsoft.com/t5/ … 06462#fn:1
test on 96 CPU core system, postgres have best perf at 100 clients 1,538,186 tps (100%), x4 400 clients 1,305,039 tps (85%), x5 500 clients 1,390,359 tps (90%).
Offline
I pushed version with CPU*4 + affinity. I think we need to have comparable results with previous tests to see is affinity really helps or not.
BTW @ttomas posts a link to similar bench for PG on AWS (https://pganalyze.com/blog/postgres-14- … monitoring), but from my POW and my tests we need at least workers=CPUs * 2, because for workers=1*CPUs at least half the time (wile mORMot parse HTTP request, serialize\deserialize results and sends responce) postgres will wait.
I sets CPUs * 4 because on valgring profiling results for /db PG part takes ~25% of all work (before affinity fix)
Let's wait for PR 7755 results
Offline
I checked yesterday about the last results on their test server.
It is better with the thread affinity, but there are still some very weird issues like the "cached queries" still being very slow.
Offline
It is better with the thread affinity
Just note, prev test use threads=CPU*5,last *4. We cant compare results
Offline
I found detailed what for completed rounds TFB publish detailed statistic - "details" link above "visualize - here is for last completed round - https://tfb-status.techempower.com/unzi … /mormot/db
The only interesting thing i see is "Total CPU usage".
For example for cachedQueries it is here - https://tfb-status.techempower.com/unzi … s.txt.json
First number is # of wrk execution from raw.txt (first two "Running Primer..." and "Running Warmup" is ignored) so stat for "cached-queries?count=20" is on #2 - when timestamps
For for "cached-queries?count=20" statistic shows
total cpu usage
sys 1.786
stl 0
idl 0
usr 98.214
what is strange...
For db = 38% is idle, what strange too
Offline
There was a problem if a lot of threads were trying to get the cache at the same time.
The TLightLock is not meant for that, for a lot of time was spent spinning the CPU in user land.
Note that such a ORM usage seems not very realistic.
But it may happen, and so much spinning should be avoided for sure.
https://github.com/synopse/mORMot2/comm … 19ff2e806e
would change it into a TRWLock which will allow parallel search in the cache, without any spinning.
Offline
https://github.com/synopse/mORMot2/comm … 19ff2e806e
would change it into a TRWLock which will allow parallel search in the cache, without any spinning.
Mormot.rest.server needs also 3 renames from Trestcache to Tormcache
Offline
@dcoun
Missing a file commit - should be fixed now.
And I changed the TOrmCache to use binary serialization instead of JSON: it gives even better numbers on my PC.
https://github.com/synopse/mORMot2/commit/2c8d7262
Offline
After investigation, I decided to go back to an OS lock for the internal HTTP sockets pending events list.
https://github.com/synopse/mORMot2/commit/b5ac514e
It is not slower with my 2 cores CPU, and I guess it would be much better for scaling with a high number of cores, and a high number of threads.
Our TLightLock was likely to be spinning a lot.
We should try this new version instead on the new Citrine round.
Offline
You are right.
Delphi compilation should be fixed by https://github.com/synopse/mORMot2/commit/b1b8f5c2
Offline
I have introduced a new TOSLightLock wrapper.
It uses Slim Multi-Read Single-Write locks on Windows, or directly call the pthread_mutex*() calls on Linux, without the overhead of the cthreads recursive mutexes, and the TRTLCriticalSection redirection.
I have seen 15% performance increase on /plaintext benchmark.
And I suspect it should fix some scaling issues we have with a high number of threads over a high number of CPU cores.
In fact, it is better to rely on the OS than let our TLightLock spin.
I would advice to use TOSLightLock when contention could happen, and use TLightLock only when contention are not likely to happen, i.e. when the protected code executes in O(1) in a few cycles, or if it is to protect a one-time initialization process.
I have made a big code review of the whole framework to ensure TOSLightLock/TLightLock/TRWLightLock/TSynLocker are used on purpose.
Offline
In fact, it is better to rely on the OS than let our TLightLock spin
I completely agree.
Just tests it with latest commit on my PC (12 cores) - results is a little better with all tests (I think I bounds in CPU because app/DB/wrk is on the same PC)
Very good news is what for all tests RPS increases together with wrk concurrency.
So I starts prepare a TFB MR with current sources, threadPoolSize = CPU*4 and hsoThreadCpuAffinity enabled (with disabled hsoThreadCpuAffinity I still have huge amount of cpu-migrations)
Offline
MR is ready. Unfortunately last two Citrine rounds are stuck - see this issue. Reason is "power problem". Hope their servers is not in Ukraine
Offline
I have modified the sample source to use our new TUriRouter.
https://blog.synopse.info/?post/2022/12 … -Christmas
Please check
https://github.com/synopse/mORMot2/commit/37f6e089
Resulting source is also 20 lines shorter.
In fact, it seems not slower than our previous IdemPPChar() routing scheme, on my machine at least.
TEB requirements expects to use the canonical router and parameter extraction of the framework, which is TUriRouter now for those custom REST endpoints.
So we stick to the rules as it should be.
@mpv You can try it, and include it if you find it interesting.
I looked at the current state: mORMot is still not in the rank we could expect.
I am also wondering if we should not try the default FPC memory manager with the TEB hardware: it uses a threadvar for small blocks, so could scale better with a very high number of cores, than our fpcx64mm....
Offline
Note that we could use the new SetOutJson() convenient method for our TFB sample.
Not faster, but perhaps cleaner and shorter in code.
https://github.com/synopse/mORMot2/comm … 6390aa64c3
Edit:
The new TUriRouter.RunMethods() could also be convenient for our use case.
Even smaller number of source code lines, using RTTI for the routing.
https://github.com/synopse/mORMot2/commit/e1804902
Offline
TFB round ends, but CPU statistic is empty
Anyway I found hardware similar to TFB (2CPU 24 cores each) and reproduce a problem with /json
I remove a hsoThreadCpuAffinity option and bind server to first 16 cores and wrk to last 16 cores using taskset
taskset 0xFFFF00000000 wrk...
taskset 0xFFFF raw
Two screenshots below (one for "raw 32" and one for "raw 64") shows what our server do not consume a full CPU power. RPS is ~200k for 32 thread vs ~370k for 64 thread. But CPU consumption is near the same.
Problem is:
not in memory manager - I tries all possible combinations ( default MM, FPC_X64MM, FPCMM_SERVER)
not in pthread/spin lock - I tries with current and old mORMot2
It seems that the gap is in "waits" inside workers...
Default MM 32 thread - https://drive.google.com/file/d/1JMBYxp … share_link
Default MM 64 thread - https://drive.google.com/file/d/1df9njW … share_link
P.S.
Setting fHttpServer.HttpQueueLength := 0; does not change performance, only small amount ~300 of read errors occurs in wrk
P.S.2
Increasing the number of available cores
taskset 0xFFFFFF raw 64
decrease an RPS (~10%)
P.S.3
binding of thread affinity to first 16 cores using hack
SystemInfo.dwNumberOfProcessors := 16;
// + enabling hsoThreadCpuAffinity
does not change anything (a small perf incr)
P.S.5
Using poll instead of epoll - near the same results (for 256 connections)
Last edited by mpv (2022-12-30 11:48:12)
Offline
And with the regular/venerable THttpServer instead of THttpAsyncServer?
It should create a thread per HTTP/1.1 kept alive client.
One thread per client is a waste of resource, but TFB does not make more than 512 concurrent clients at the same time, so I guess it could be good enough... 512 threads may not be too much for Linux.
Of course, we would have to lookout for the PostgreSQL connections...
But at least we could see about /json issue.
Is it a /json issue only?
Or a /plaintext scaling problem too?
/json makes some memory allocations, whereas /plaintext do not.
On my old laptop with a Core i5 of 2 cores / 4 threads, I reach more than 100,000RPS with /plaintext.
So we could expect much higher numbers...
Last but not least, try with the glibc memory manager?
Offline
With /plaintext picture is the same as with JSON - most of the time most of the threads are in the S (sleep) state.
And max RPS is the same ~370 000 (2.4Ghz cpu's)
BUT! /plaintext in pipelining mode (just found how they do it)
wrk -H 'Host: 10.0.0.1' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 28 http://10.0.0.1:8080/plaintext -s pipeline.lua -- 16
where pipeline.lua is
init = function(args)
local r = {}
local depth = tonumber(args[1]) or 1
for i=1,depth do
r[i] = wrk.format()
end
req = table.concat(r)
end
request = function()
return req
end
RPS is ~2mln and our server consume 100% of each core.
BTW - for plaintext TFB made 256 1,024 4,096 16,384 connections (see Data Table section in plaintext visualization)
How to enable glibc memory manager?
Offline
My guess is that both wrk and raw programs don't use all 100% of each core, because they are slowed down by the socket API.
The fact that both wrk and raw have only 40% of each core may be perfectly normal. There is very logic involved in their HTTP client or server code, and what is slow is the syscalls to access the sockets.
With pipelinging there are fewer socket syscalls, because the buffers are filled with several pipelined requests per syscall. So less syscalls, more CPU process, higher numbers. Make sense to me. But I may be wrong.
For glibc you can use SynFPCCMemAligned.pas from mORMot 1 (it is a stand-alone unit).
Then define -dFPC_NO_DEFAULT_MEMORYMANAGER -dFPC_SYNCMEM conditoinals.
Or use the FPC RTL cmem unit (which should work directly).
Also with -dFPC_NO_DEFAULT_MEMORYMANAGER and with cmem in first position in the program uses clause, instead of fpcx64mm:
uses
//{$I mormot.uses.inc} // include mormot.core.fpcx64mm
cmem,
cthreads,
sysutils,
classes,
....
and commenting WriteHeapStatus() at the end of the source.
On my computer, mormot fpcx64mm is slightly faster, but perhaps with more cores, the glibc memory manager may scale better.
Offline
I shure this is not syscalls limitation, because on the same hardware nodejs (in cluster mode, limited to 16 cores) shows better results for json (500k RPS). And wrk remains on rhe same load.
Node create listening socket per fork using SO_REUSEPORT https://lwn.net/Articles/542629/. May be this is a way to go?
PS
Will try CMem after blackout, but I think memory manager is not a bottleneck
Last edited by mpv (2022-12-30 17:26:15)
Offline
AFAIR SO_REUSEPORT was to speed up accept(), or to multiplex it in the context of forking like in Node.
We don't have troubles with accept(), which is perfectly fine in terms of performances. Our accept() thread has nothing to do with the threads repartition, because we maintain our own thread pool.
If Node creates one fork per connection, we could try to create one thread per connection... 16,384 threads may not be too much for the HW (if there is enough memory).
And the old THttpServer should be just reacting similarly to the node forks, because it blocks until there is something to read over the input socket.
Offline
With THttpServer and 256 connections I got ~865k RPS and 100% CPU load on /json (vs ~370k RPS and 45% CPU for AsyncServer)
With 512 connections there is ~750k RPS.
For more than 512 conn`s I got too many read error (not depends on HttpQueueLength - verified with 0 and with 10000)
For /plaintext results are:
- 256 conn = 1 050 000 RPS (vs ~370k RPS for async)
- 512 conn = 854 000
plaintext pipelining mode is failed (as expected, I wonder what it works for async server )
So - memory manager is not a bottleneck
Last edited by mpv (2022-12-31 09:00:01)
Offline
An experiment: I add a SO_REUSEPORT to the listening socket of async server - now it's possible to run several instance of server on the same port and kernel redistribute incoming connections between them
And YES!!! for /plaintext (instances are limited by first 24 CPU cores) with 256 concurrent connections
- 1 server with 192 thread = 564k RPS
- 2 servers with 96 thread = 750k PRS - faster than one
- 3 servers with 64 thread = 790k RPS - the same total number of working threads, but redistributed on 3 thread pool
- 4 servers with 48 thread = 920k RPS (~840k for /json)
Here is a picture for 4 servers - https://drive.google.com/file/d/1e3Qgwe … share_link
For 4 server mode and 512 concurrent connections /json is ~940k, so problem not in per process epool size.
It's definitionally somewhere is Async server queue implementation.
Last edited by mpv (2022-12-31 10:37:24)
Offline
I create multiple server insances inside a single process:
// rawServer := TRawAsyncServer.Create(threads);
setLength(rawServers, servers);
for i := 0 to servers-1 do
rawServers[i] := TRawAsyncServer.Create(threads);
For THttpAsyncServer running on localhost:8080; num CPU=24, num thread=48, num servers=4, total workers=192
I got
/json = 937 651RPS
/plaintext =1 008 705RPS
Will prepare a true test to be executed with tfb command ASAP (we are under massive missile attack right now, so electricity can off any time
Offline
@ab, please, add an option hsoReusePort for THttpAsyncServer and in function NewSocket in line 1733 before bind
// Server-side binding/listening of the socket to the address:port
v := 1;
sock.SetOpt(SOL_SOCKET, SO_REUSEPORT, @v, SizeOf(v));
if (bind(sock.Socket, @addr, addr.Size) <> NO_ERROR)
Last edited by mpv (2022-12-31 12:23:16)
Offline
In the meanwhile, I have rewritten the async thread pool.
https://github.com/synopse/mORMot2/commit/54090417
From my tests, it does not always wake up the sub-threads, but try to leverage the lower threads (e.g. R1,R2,R3...) to process more requests.
- on small load or quick response (like /plaintext or /json), only R1 thread is involved
- on slow process (e.g. remote DB access), R1 is identified as blocking, and R2..Rmax threads are awaken
Could you try it?
I will add SO_REUSEPORT anyway, as an hsoReusePort option. It could help in some cases, e.g. with multiple CPUs: we could mask one server process per CPU socket, then we could probably achieve very good performance...
Offline
The best /json and /plaintext performances for 24 cores is reached in
num CPU=24, num thread=24, num servers=6, total workers=144
mode
/json 1 050 000 RPS
/plaintext 1 110 000 RPS
/plaintext 2 410 000 RPS in pipelining mode
P.S. = tried before 54090417 commit
Last edited by mpv (2022-12-31 14:39:16)
Offline
with the new pool of asynchronous threads the results are worse than with the previous one(in all cases)
~280k for 64 thread and 1 server (vs 350 with old algo), 720K in 24thread*6 server mode (vs 1050).
CPU load is strange - see this picture for 6 server mode - https://drive.google.com/file/d/1jPiYQI … share_link
Last edited by mpv (2022-12-31 15:04:41)
Offline
Oups... so I will make the new async thread algorithm an option, to be enabled only if needed.
https://github.com/synopse/mORMot2/commit/a4bf3555
But on my Core i5 2 cores / 4 threads CPU, with wrk on localhost, I got 30% better results with the new algorithm: from 100K to 130K RPS... so it seems not so good for high-end CPUs...
I have added hsoReusePort/acoReusePort options:
https://github.com/synopse/mORMot2/commit/71abd980
It is likely to be the best solution for proper scaling, within the same process...
Can be enabled in the TFB sample: https://github.com/synopse/mORMot2/commit/be6bfbe3
And what about the DB requests?
Offline
You did not set the hsoReusePort option in the raw.pas code....
So I am afraid several server instances won't start.
And I guess you should also let download a new mormot source code including this option.
Best wishes and peaceful year!
Edit: Could you try hsoThreadSmooting but WITHOUT hsoThreadCpuAffinity ?
By design, hsoThreadSmooting focus on the first thread of the pool, and hsoThreadCpuAffinity will assign it with core #0, so performance will for sure not be good, especially with several bound instances.
My guess is that we should better disable hsoThreadCpuAffinity and let the system work as it wants, especially if number of cores = number of threads.
See https://github.com/synopse/mORMot2/commit/de25ae6f
Offline
just finish verifying WITHOUT hsoThreadCpuAffinity
for 28 CORES (as on TFB server) in 28 thread * 6 server mode and PostgreSQL on the same server (unfortunately cant test pipelining mode because linpq is v12), results are:
/json 1 218 000
/db 444 000
/fortunes 182 000
/cached-queries?count=20 498 000
/queries?queries=20 35 000
/plaintext 1 314 000
/plaintext 2 830 000 (pipelining 1024 concurrent conns)
/plaintext 2 700 000 (pipelining 16384 concurrent conns)
will try hsoThreadSmooting..
Offline
In hsoThreadSmooting but WITHOUT hsoThreadCpuAffinity results are a little better
/json 1 175 000
/db 452 000
/fortunes 181 000
/cached-queries?count=20 612 000
/queries?queries=20 33 000
/plaintext 1 335 000
/plaintext 2 900 000 (pipelining 1024 concurrent conns)
/plaintext 2 761 000 (pipelining 16384 concurrent conns)
so I remove "if servers = 1 then " to always enable it and made an MR to TFB
Last edited by mpv (2023-01-01 12:28:54)
Offline
Wow!
About /plaintext 3 millions requests per seconds seems promising...
And in fact, THttpAsync server seems to be "pipelining-ready" because it relies on its internal buffers for its reading process, so seems to be what a pipelined client expects.
What about /updates ?
We had pretty bad numbers on this test in https://www.techempower.com/benchmarks/ … est=update
I don't see what's wrong with this particular test...
And also with /rawqueries?queries=20 which seems to be pretty low to (much lower than the ORM). Is our PostgreSQL pipelining wrong?
Let's wait and see if the new algorithms give better numbers...
Note that with the DB process, we may enhance performances with the hsoThreadSmooting options, by adding some more threads to each server, so more possibilities to wait for the DB to answer.
With hsoThreadSmooting algorithm, the overnumerous threads should not be used unless it is needed... but perhaps the DB access is fast enough to not trigger the thread repartition...
Offline