#151 2022-11-27 18:04:29

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

We may have to investigate the mORMot locks too.
The async server uses its own set of locks per connections (one for reads, one for writes). It is home made locks, which spin then use fpsleep() after a while.
They are meant to be small and efficient, but perhaps they don't scale well when there are a lot of threads involved, i.e. a lot of threads waiting to access the queu.
Maybe a true lockless ring may help.

But CPU affinity is also a good way to explore.

Offline

#152 2022-11-28 17:01:44

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

You may try https://github.com/synopse/mORMot2/commit/46d19d8e

I only have a single CPU socket system, so I was not able to see if it helps...
wink

Edit 1: Current code won't work as expected on Linux because pthread_setaffinity_np() is about logical cores, not HW sockets.
I will fix it.

Edit 2: It should now work as expected, thanks to https://github.com/synopse/mORMot2/commit/b5e6868f
But I have still a doubt about the "physical id: ###" lines. I expected 0, 1... but it maybe 0, 3...
Could you post the content of /proc/cpuinfo on a dual socket system?

Offline

#153 2022-11-28 19:01:11

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

@ab - as far as I understand you mistakes a little about pthread_attr_setaffinity_np - it sets affinity mask for logical cores (including hyperthreaded), not for physical CPU. In other words - cores are

ls -dl /sys/devices/system/cpu/cpu[0-9]*

My server is off due to electricity (do not know at all when I can turn it back, country is in power-saving mode), but most of my prev. tests I did on my PC with one socket / 6phsical / 12logical cores (cpu0 - cpu12 for command above)

And perf stat  cpu-migrations is migrations over cores also

Last edited by mpv (2022-11-28 19:02:42)

Offline

#154 2022-11-28 19:08:26

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

Yes, this mistake was fixed.
See my Edit 2. wink
It should work with both Windows and Linux.

And now I should have fixed the "physical id: ###" lines parsing error.
Physical CPU sockets should be properly identified now.
Please see https://github.com/synopse/mORMot2/commit/3d9feaa4

So do you mean that with a single socket, you had the problem?

Offline

#155 2022-11-28 19:24:25

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I mean not this, but instead of

if CpuSockets > 1 then
 ......

use

for i := 0 to aThreadPoolCount - 1 do
  ok := SetThreadCpuAffinity(fThreads[i], i mod SystemInfo.dwNumberOfProcessors);

this is exactly what h2o did as far as i understand.

After my changes I got the same plaintext ~700k RPS for both 96 and 24 mode!
But I do not sure it applicable for real production sad

Offline

#156 2022-11-28 19:28:57

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

This is how htop looks for 96 threads with affinity masks - https://drive.google.com/file/d/166Op86 … sp=sharing

and this is perf stat

$ perf stat ./raw 96
THttpAsyncServer running on localhost:8080; num thread=96 db=PostgreSQL
 Performance counter stats for './raw 96':

         25588,36 msec task-clock                #    1,676 CPUs utilized          
           472073      context-switches          #   18,449 K/sec                  
                92      cpu-migrations            #    3,595 /sec                   
             1430      page-faults               #   55,885 /sec                   
       3,859381000 seconds user
      22,026711000 seconds sys

I like this statistic

As for me binding of threads to CPU is OK for TFB, because at the same time all threads handle near the same type of load, only our application is running on host and so on. But in real life I not sure it is valid approach. Or not? May be add an option for this?

Last edited by mpv (2022-11-28 19:38:44)

Offline

#157 2022-11-28 20:50:28

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

Please try https://github.com/synopse/mORMot2/commit/65dfc3ae

You can set the new hsoThreadCpuAffinity option to tune the thread pool core affinity with the socket-based servers.
See https://github.com/synopse/mORMot2/comm … 8f0f747eeb

I have measured a slight performance penalty for the DB-related queries...

Offline

#158 2022-11-28 22:17:25

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

Now all works as expected. Thanks! From my tests db related queries become even a little faster. Can you, please do a mORMot release with sqlite 3.40.0 wonted by latest sources, because current 2.0.4148 release contains sqlite 3.39.4.

Offline

#159 2022-11-29 11:20:26

ttomas
Member
Registered: 2013-03-08
Posts: 135

Re: High-performance frameworks

@mvp, can you push to TFB version with threads=CPU cores, to see performance for json and plaintext (to test http server performance without db). I expect better performance with db also. Looking at https://techcommunity.microsoft.com/t5/ … 06462#fn:1
test on 96 CPU core system, postgres have best perf at 100 clients 1,538,186 tps (100%), x4 400 clients 1,305,039 tps (85%), x5 500 clients 1,390,359 tps (90%).

Offline

#160 2022-11-29 19:28:50

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I pushed  version with CPU*4 + affinity. I think we need to have comparable results with previous tests to see is affinity really helps or not.
BTW @ttomas posts a link to  similar bench for PG on AWS (https://pganalyze.com/blog/postgres-14- … monitoring), but from my POW and my tests we need at least workers=CPUs * 2, because for workers=1*CPUs at least half the time (wile mORMot parse HTTP request, serialize\deserialize results and sends responce) postgres will wait.
I sets CPUs * 4 because on valgring profiling results for /db PG part takes ~25% of all work (before affinity fix)
Let's wait for PR 7755 results

Offline

#161 2022-12-07 13:43:35

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

I checked yesterday about the last results on their test server.

It is better with the thread affinity, but there are still some very weird issues like the "cached queries" still being very slow.

Offline

#162 2022-12-07 16:44:23

ttomas
Member
Registered: 2013-03-08
Posts: 135

Re: High-performance frameworks

ab wrote:

It is better with the thread affinity

Just note, prev test use threads=CPU*5,last *4. We cant compare results

Offline

#163 2022-12-08 14:27:23

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I found detailed what for completed rounds TFB publish detailed statistic - "details" link above "visualize - here is for last completed round - https://tfb-status.techempower.com/unzi … /mormot/db 
The only interesting thing i see is "Total CPU usage".

For example for cachedQueries it is here - https://tfb-status.techempower.com/unzi … s.txt.json
First number is # of wrk execution from raw.txt (first two "Running Primer..." and "Running Warmup" is ignored) so stat for "cached-queries?count=20" is on #2 - when timestamps

For for "cached-queries?count=20" statistic shows

 
total cpu usage	
sys	1.786
stl	0
idl	0
usr	98.214

what is strange...

For db = 38% is idle, what strange too

Offline

#164 2022-12-08 20:35:43

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

There was a problem if a lot of threads were trying to get the cache at the same time.
The TLightLock is not meant for that, for a lot of time was spent spinning the CPU in user land.

Note that such a ORM usage seems not very realistic.
But it may happen, and so much spinning should be avoided for sure.

https://github.com/synopse/mORMot2/comm … 19ff2e806e
would change it into a TRWLock which will allow parallel search in the cache, without any spinning.

Offline

#165 2022-12-08 21:17:38

dcoun
Member
From: Crete, Greece
Registered: 2020-02-18
Posts: 430

Re: High-performance frameworks

ab wrote:

https://github.com/synopse/mORMot2/comm … 19ff2e806e
would change it into a TRWLock which will allow parallel search in the cache, without any spinning.

Mormot.rest.server needs also 3 renames from Trestcache to Tormcache

Offline

#166 2022-12-09 11:12:25

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

@dcoun
Missing a file commit - should be fixed now.

And I changed the TOrmCache to use binary serialization instead of JSON: it gives even better numbers on my PC.
https://github.com/synopse/mORMot2/commit/2c8d7262

Offline

#167 2022-12-15 20:45:08

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

After investigation, I decided to go back to an OS lock for the internal HTTP sockets pending events list.
https://github.com/synopse/mORMot2/commit/b5ac514e

It is not slower with my 2 cores CPU, and I guess it would be much better for scaling with a high number of cores, and a high number of threads.
Our TLightLock was likely to be spinning a lot.

We should try this new version instead on the new Citrine round.

Offline

#168 2022-12-16 05:54:01

sakura
Member
From: Germany
Registered: 2018-02-21
Posts: 239
Website

Re: High-performance frameworks

Leads to a compiler error atm.

When using {$ifdef USE_WINIOCP}, in mormot.core.thread, line 3216 you call fSafe.Init, which is not defined for USE_WINIOCP.

Offline

#169 2022-12-16 08:01:25

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

You are right.

Delphi compilation should be fixed by https://github.com/synopse/mORMot2/commit/b1b8f5c2

Offline

#170 2022-12-16 08:18:10

sakura
Member
From: Germany
Registered: 2018-02-21
Posts: 239
Website

Re: High-performance frameworks

It is, thanks!

Offline

#171 2022-12-16 12:33:07

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I will test it first on my evn. Hope today evening

Offline

#172 2022-12-16 14:19:51

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

I have introduced a new TOSLightLock wrapper.

It uses Slim Multi-Read Single-Write locks on Windows, or directly call the pthread_mutex*() calls on Linux, without the overhead of the cthreads recursive mutexes, and the TRTLCriticalSection redirection.
I have seen 15% performance increase on /plaintext benchmark.

And I suspect it should fix some scaling issues we have with a high number of threads over a high number of CPU cores.
In fact, it is better to rely on the OS than let our TLightLock spin.
I would advice to use TOSLightLock when contention could happen, and use TLightLock only when contention are not likely to happen, i.e. when the protected code executes in O(1) in a few cycles, or if it is to protect a one-time initialization process.

I have made a big code review of the whole framework to ensure TOSLightLock/TLightLock/TRWLightLock/TSynLocker are used on purpose.

Offline

#173 2022-12-16 19:18:10

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

ab wrote:

In fact, it is better to rely on the OS than let our TLightLock spin

I completely agree.

Just tests it with latest commit on my PC (12 cores) - results is a little better with all tests (I think I bounds in CPU because app/DB/wrk is on the same PC)
Very good news is what for all tests RPS increases together with wrk concurrency.
So I starts prepare a TFB MR with current sources, threadPoolSize = CPU*4 and hsoThreadCpuAffinity enabled (with disabled hsoThreadCpuAffinity I still have huge amount of cpu-migrations)

Offline

#174 2022-12-16 20:45:20

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

@mpv
Sounds just fine - I think (and hope?) that it will give better numbers on the Citrine high-end CPU.

Offline

#175 2022-12-16 21:01:29

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

MR is ready. Unfortunately last two Citrine rounds are stuck - see this issue. Reason is "power problem". Hope their servers is not in Ukraine sad

Offline

#176 2022-12-17 10:24:35

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

So  perhaps the next round will come sooner and will include the pull request. :-)

Offline

#177 2022-12-26 17:20:11

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

I have modified the sample source to use our new TUriRouter.
https://blog.synopse.info/?post/2022/12 … -Christmas

Please check
https://github.com/synopse/mORMot2/commit/37f6e089
Resulting source is also 20 lines shorter.

In fact, it seems not slower than our previous IdemPPChar() routing scheme, on my machine at least.

TEB requirements expects to use the canonical router and parameter extraction of the framework, which is TUriRouter now for those custom REST endpoints.
So we stick to the rules as it should be. wink

@mpv You can try it, and include it if you find it interesting.
I looked at the current state: mORMot is still not in the rank we could expect.
I am also wondering if we should not try the default FPC memory manager with the TEB hardware: it uses a threadvar for small blocks, so could scale better with a very high number of cores, than our fpcx64mm.... hmm

Offline

#178 2022-12-26 21:25:14

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I will include UriRouter. Let"s wait to end of TFB round to see a stats.
I have already check FPC MM some month ago,  on 12 core mormot mm is faster...

Offline

#179 2022-12-27 17:28:47

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

Note that we could use the new SetOutJson() convenient method for our TFB sample.
Not faster, but perhaps cleaner and shorter in code.
https://github.com/synopse/mORMot2/comm … 6390aa64c3

Edit:
The new TUriRouter.RunMethods() could also be convenient for our use case.
Even smaller number of source code lines, using RTTI for the routing. wink
https://github.com/synopse/mORMot2/commit/e1804902

Offline

#180 2022-12-30 10:38:19

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

TFB round ends, but CPU statistic is empty sad
Anyway I found hardware similar to TFB (2CPU 24 cores each) and reproduce a problem with /json
 
I remove a hsoThreadCpuAffinity option and bind server to first  16 cores and wrk to last 16 cores using taskset

taskset  0xFFFF00000000 wrk...
taskset  0xFFFF raw

Two screenshots below (one for "raw 32" and one for "raw 64") shows what our server do not consume a full CPU power. RPS is ~200k for 32 thread vs ~370k for 64 thread. But CPU consumption is near the same.
Problem is:
not in memory manager - I tries all possible combinations ( default MM, FPC_X64MM, FPCMM_SERVER)
not in pthread/spin lock - I tries with current and old mORMot2

It seems that the gap is in "waits" inside workers...

Default MM 32 thread - https://drive.google.com/file/d/1JMBYxp … share_link
Default MM 64 thread - https://drive.google.com/file/d/1df9njW … share_link

P.S.
Setting fHttpServer.HttpQueueLength := 0; does not change performance, only small amount ~300 of read errors occurs in wrk

P.S.2
Increasing the number of available cores

taskset  0xFFFFFF raw 64

decrease an RPS (~10%)

P.S.3
binding of thread affinity to first 16 cores using hack

SystemInfo.dwNumberOfProcessors := 16;
// + enabling hsoThreadCpuAffinity

does not change anything (a small perf incr)

P.S.5
Using poll instead of epoll - near the same results (for 256 connections)

Last edited by mpv (2022-12-30 11:48:12)

Offline

#181 2022-12-30 12:30:26

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

And with the regular/venerable THttpServer instead of THttpAsyncServer?
It should create a thread per HTTP/1.1 kept alive client.
One thread per client is a waste of resource, but TFB does not make more than 512 concurrent clients at the same time, so I guess it could be good enough... 512 threads may not be too much for Linux.

Of course, we would have to lookout for the PostgreSQL connections...
But at least we could see about /json issue.

Is it a /json issue only?
Or a /plaintext scaling problem too?
/json makes some memory allocations, whereas /plaintext do not.

On my old laptop with a Core i5 of 2 cores / 4 threads, I reach more than 100,000RPS with /plaintext.
So we could expect much higher numbers...

Last but not least, try with the glibc memory manager?

Offline

#182 2022-12-30 14:06:16

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

With /plaintext picture is the same as with JSON - most of the time most of the threads are in the S (sleep) state.
And max RPS is the same ~370 000 (2.4Ghz cpu's)

BUT! /plaintext in pipelining mode (just found how they do it)

wrk -H 'Host: 10.0.0.1' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 256 --timeout 8 -t 28 http://10.0.0.1:8080/plaintext -s pipeline.lua -- 16

where pipeline.lua is

init = function(args)
  local r = {}
  local depth = tonumber(args[1]) or 1
  for i=1,depth do
    r[i] = wrk.format()
  end
  req = table.concat(r)
end

request = function()
  return req
end

RPS is ~2mln and our server consume 100% of each core.

BTW -  for plaintext TFB made 256    1,024    4,096    16,384 connections (see Data Table section in plaintext visualization)
   
How to enable glibc memory manager?

Offline

#183 2022-12-30 16:43:45

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

My guess is that both wrk and raw programs don't use all 100% of each core, because they are slowed down by the socket API.
The fact that both wrk and raw have only 40% of each core may be perfectly normal. There is very logic involved in their HTTP client or server code, and what is slow is the syscalls to access the sockets.
With pipelinging there are fewer socket syscalls, because the buffers are filled with several pipelined requests per syscall. So less syscalls, more CPU process, higher numbers. Make sense to me. But I may be wrong.

For glibc you can use SynFPCCMemAligned.pas from mORMot 1 (it is a stand-alone unit).
Then define -dFPC_NO_DEFAULT_MEMORYMANAGER -dFPC_SYNCMEM conditoinals.

Or use the FPC RTL cmem unit (which should work directly).
Also with  -dFPC_NO_DEFAULT_MEMORYMANAGER  and with cmem in first position in the program uses clause, instead of fpcx64mm:

uses
  //{$I mormot.uses.inc} // include mormot.core.fpcx64mm
  cmem,
  cthreads,
  sysutils,
  classes,
 ....

and commenting WriteHeapStatus() at the end of the source.
On my computer, mormot fpcx64mm is slightly faster, but perhaps with more cores, the glibc memory manager may scale better.

Offline

#184 2022-12-30 17:23:04

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I shure this is not syscalls limitation, because on the same hardware nodejs (in cluster mode, limited to 16 cores) shows better results for json (500k RPS). And wrk remains on rhe same load.
Node create listening socket per fork using SO_REUSEPORT https://lwn.net/Articles/542629/. May be this is a way to go?

PS
Will try CMem after blackout, but I think memory manager is not a bottleneck

Last edited by mpv (2022-12-30 17:26:15)

Offline

#185 2022-12-31 08:46:43

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

AFAIR SO_REUSEPORT was to speed up accept(), or to multiplex it in the context of forking like in Node.
We don't have troubles with accept(), which is perfectly fine in terms of performances. Our accept() thread has nothing to do with the threads repartition, because we maintain our own thread pool.

If Node creates one fork per connection, we could try to create one thread per connection... 16,384 threads may not be too much for the HW (if there is enough memory).
And the old THttpServer should be just reacting similarly to the node forks, because it blocks until there is something to read over the input socket.

Offline

#186 2022-12-31 08:57:16

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

With THttpServer and 256 connections I got ~865k RPS and 100% CPU load on /json (vs ~370k RPS and 45% CPU for AsyncServer)
With 512 connections  there is ~750k RPS.
For more than 512 conn`s I got too many read error (not depends on HttpQueueLength - verified with 0 and with 10000)

For /plaintext results are:
- 256 conn = 1 050 000 RPS smile (vs ~370k RPS for async)
- 512 conn =    854 000
plaintext pipelining mode is failed (as expected, I wonder what it works for async server )

So - memory manager is not a bottleneck

Last edited by mpv (2022-12-31 09:00:01)

Offline

#187 2022-12-31 10:34:42

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

An experiment: I add a SO_REUSEPORT to the listening socket of async server - now it's possible to run several instance of server on the same port and kernel redistribute incoming connections between them
And YES!!! for /plaintext (instances are limited by first 24 CPU cores) with 256 concurrent connections
- 1 server   with 192 thread = 564k RPS
- 2 servers with   96 thread = 750k PRS  - faster than one
- 3 servers with   64 thread = 790k RPS  - the same total number of working threads, but redistributed on 3 thread pool
- 4 servers with   48 thread = 920k RPS  (~840k for /json)

Here is a picture for 4 servers - https://drive.google.com/file/d/1e3Qgwe … share_link

For 4 server mode and 512 concurrent connections /json is ~940k, so problem not in per process epool size.
It's definitionally somewhere is Async server queue implementation.

Last edited by mpv (2022-12-31 10:37:24)

Offline

#188 2022-12-31 12:19:25

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I create multiple server insances inside a single process: 

// rawServer := TRawAsyncServer.Create(threads);
setLength(rawServers, servers);
  for i := 0 to servers-1 do
    rawServers[i] := TRawAsyncServer.Create(threads);

For THttpAsyncServer running on localhost:8080; num CPU=24, num thread=48, num servers=4, total workers=192
I got
/json        =   937 651RPS
/plaintext =1 008 705RPS

Will prepare a true test to be executed with tfb command ASAP (we are under massive missile attack right now, so electricity can off any time

Offline

#189 2022-12-31 12:23:01

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

@ab, please, add an option hsoReusePort for THttpAsyncServer and in function NewSocket in line 1733 before bind

      // Server-side binding/listening of the socket to the address:port
      v := 1;
      sock.SetOpt(SOL_SOCKET, SO_REUSEPORT, @v, SizeOf(v));
      if (bind(sock.Socket, @addr, addr.Size) <> NO_ERROR)

Last edited by mpv (2022-12-31 12:23:16)

Offline

#190 2022-12-31 14:37:24

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

In the meanwhile, I have rewritten the async thread pool.
https://github.com/synopse/mORMot2/commit/54090417

From my tests, it does not always wake up the sub-threads, but try to leverage the lower threads (e.g. R1,R2,R3...) to process more requests.
- on small load or quick response (like /plaintext or /json), only R1 thread is involved
- on slow process (e.g. remote DB access), R1 is identified as blocking, and R2..Rmax threads are awaken

Could you try it?
I will add SO_REUSEPORT anyway, as an hsoReusePort option. It could help in some cases, e.g. with multiple CPUs: we could mask one server process per CPU socket, then we could probably achieve very good performance...

Offline

#191 2022-12-31 14:38:00

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

The best /json and /plaintext performances for 24 cores is reached in 

num CPU=24, num thread=24, num servers=6, total workers=144 

mode
/json         1 050 000 RPS
/plaintext  1 110 000 RPS
/plaintext  2 410 000 RPS in pipelining mode

P.S. = tried before 54090417 commit

Last edited by mpv (2022-12-31 14:39:16)

Offline

#192 2022-12-31 15:04:05

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

with the new pool of asynchronous threads the results are worse than with the previous one(in all cases)

~280k for 64 thread and 1 server (vs 350 with old algo),   720K in 24thread*6 server mode (vs 1050).

CPU load is strange - see this picture for 6 server mode - https://drive.google.com/file/d/1jPiYQI … share_link

Last edited by mpv (2022-12-31 15:04:41)

Offline

#193 2022-12-31 15:21:02

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

Oups... so I will make the new async thread algorithm an option, to be enabled only if needed.
https://github.com/synopse/mORMot2/commit/a4bf3555
But on my Core i5 2 cores / 4 threads CPU, with wrk on localhost, I got 30% better results with the new algorithm: from 100K to 130K RPS...  so it seems not so good for high-end CPUs...

I have added hsoReusePort/acoReusePort options:
https://github.com/synopse/mORMot2/commit/71abd980
It is likely to be the best solution for proper scaling, within the same process...
Can be enabled in the TFB sample: https://github.com/synopse/mORMot2/commit/be6bfbe3

And what about the DB requests?

Offline

#194 2022-12-31 16:03:44

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

I tries to setup DB on the server - there is some problems there...

Offline

#195 2022-12-31 17:09:52

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

Added ability to set used cores and server instances count - see MR132
Will try to setup database tomorrow to verify /db and other endpoints
Happy New Year!

Offline

#196 2022-12-31 17:50:31

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

You did not set the hsoReusePort option in the raw.pas code....
So I am afraid several server instances won't start.
And I guess you should also let download a new mormot source code including this option.

Best wishes and peaceful year!

Edit: Could you try hsoThreadSmooting but WITHOUT hsoThreadCpuAffinity ?
By design, hsoThreadSmooting focus on the first thread of the pool, and hsoThreadCpuAffinity will assign it with core #0, so performance will for sure not be good, especially with several bound instances.
My guess is that we should better disable hsoThreadCpuAffinity and let the system work as it wants, especially if number of cores = number of threads.
See https://github.com/synopse/mORMot2/commit/de25ae6f

Offline

#197 2023-01-01 12:17:01

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

smile just finish verifying WITHOUT hsoThreadCpuAffinity
for 28 CORES (as on TFB server) in 28 thread * 6 server mode and PostgreSQL on the same server (unfortunately cant test pipelining mode because linpq is v12),  results are:

/json                     1 218 000
/db                         444 000
/fortunes                   182 000
/cached-queries?count=20    498 000
/queries?queries=20          35 000
/plaintext                1 314 000
/plaintext                2 830 000 (pipelining 1024 concurrent conns)
/plaintext                2 700 000 (pipelining 16384 concurrent conns)

will try hsoThreadSmooting..

Offline

#198 2023-01-01 12:28:25

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

In hsoThreadSmooting but WITHOUT hsoThreadCpuAffinity results are a little better

/json                     1 175 000
/db                         452 000
/fortunes                   181 000
/cached-queries?count=20    612 000
/queries?queries=20          33 000
/plaintext                1 335 000
/plaintext                2 900 000 (pipelining 1024 concurrent conns)
/plaintext                2 761 000 (pipelining 16384 concurrent conns)

so I remove "if servers = 1 then " to always enable it and made an MR to TFB

Last edited by mpv (2023-01-01 12:28:54)

Offline

#199 2023-01-01 13:10:03

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: High-performance frameworks

TFB MR is ready.
Hope it's help to solve scaling on TFB hardware.

But I still sure we have some places in Async server what unexpectedly "waits"...

Offline

#200 2023-01-01 16:46:56

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,700
Website

Re: High-performance frameworks

Wow!
About /plaintext 3 millions requests per seconds seems promising...
And in fact, THttpAsync server seems to be "pipelining-ready" because it relies on its internal buffers for its reading process, so seems to be what a pipelined client expects.

What about /updates ?
We had pretty bad numbers on this test in https://www.techempower.com/benchmarks/ … est=update
I don't see what's wrong with this particular test...
And also with /rawqueries?queries=20 which seems to be pretty low to (much lower than the ORM). Is our PostgreSQL pipelining wrong?

Let's wait and see if the new algorithms give better numbers...

Note that with the DB process, we may enhance performances with the hsoThreadSmooting options, by adding some more threads to each server, so more possibilities to wait for the DB to answer.
With hsoThreadSmooting algorithm, the overnumerous threads should not be used unless it is needed... but perhaps the DB access is fast enough to not trigger the thread repartition...

Offline

Board footer

Powered by FluxBB