You are not logged in.
With NumTinyBlockArenasPO2 = 7 instead of 6 result is 327К
CPU load in user space is ~10% higher than when using libc in both cases
Flags: BOOSTER  assumulthrd smallpools perthrd erms                            
Small:  blocks=3K size=309KB (part of Medium arena)                            
Medium: 60MB/60MB  sleep=15K                                                   
Large:  0B/640KB  sleep=0                                                      
Total Sleep: count=15K                                                         
Small Getmem Sleep: count=4                                                    
288=4                                                                          
Small Blocks since beginning: 239M/29GB (as small=42/46 tiny=1K/2032)          
48=91M  112=38M  80=27M  128=18M  32=14M  96=9M  64=9M  144=4M                 
160=4M  256=4M  416=3M  880=3M  1264=3M  272=2M  1376=485K  960=475K           
Small Blocks current: 3K/309KB                                                 
48=2K  64=427  352=200  32=87  128=79  112=73  80=48  96=21                    
192=14  416=8  576=7  880=7  288=6  160=5  736=5  624=4Offline
Memory usage statistic
//libc
Maximum resident set size (kbytes): 28896
Minor (reclaiming a frame) page faults: 12867
Voluntary context switches: 5888357
Involuntary context switches: 5049
//x64mm (NumTinyBlockArenasPO2 = 7)
Maximum resident set size (kbytes): 124380              
Minor (reclaiming a frame) page faults: 44196          
Voluntary context switches: 5220211                    
Involuntary context switches: 8087Offline
Great!
Please try in FPCMM_BOOSTER mode with https://github.com/synopse/mORMot2/commit/412fd883
It now has 128 arenas, and a bigger number of pools to fed from.
But of course, as you detected, it consumes more RAM to initialize its internal pools.
Some memory is lost in the process, if the memory does not remain allocated, but has very quick getmem/freemem (as in this server benchmark).
Offline
327K RPS for /fortunes. Memory consumption is higher
Flags: BOOSTER  assumulthrd smallpools perthrd erms                                
Small:  3K/309KB  including tiny<=256B arenas=128 pools=95                         
Medium: 126MB/126MB  sleep=2K                                                      
Large:  0B/640KB  sleep=0                                                          
Total Sleep: count=2K                                                              
Small Getmem Sleep: count=1                                                        
288=1                                                                              
Small Blocks since beginning: 244M/29GB (as small=42/46 tiny=1K/2032)              
48=93M  112=39M  80=28M  128=18M  32=14M  96=9M  64=9M  160=4M                     
144=4M  256=4M  416=3M  880=3M  1264=3M  272=2M  1376=509K  960=488K               
Small Blocks current: 3K/309KB                                                     
48=2K  64=426  352=200  32=87  128=80  112=73  80=48  96=21                        
192=14  416=8  576=7  880=7  288=6  736=5  672=4  160=4
Maximum resident set size (kbytes): 271852                                         
Minor (reclaiming a frame) page faults: 77196                                      
Voluntary context switches: 5309185                                                
Involuntary context switches: 7768Offline
Using -O4 optimization level (never use it before because of "beware" notes) slightly increases performance  (+44k for json for example) and pass all tests.
Also tries Whole Program Optimization - it's decrease executable size from 5Mb to 3Mb but without visible performance changes (compared to -O4)
@ab - how do you think - can we use -O4 for TFB (I'm afraid of accidental falls)?
Offline
Makes sense: only more memory consummed, with not less collision nor sleep.
So I will revert the previous commit to keep the memory lower - already more than glibc.
https://github.com/synopse/mORMot2/commit/19bcf72c
And for the TFB benchmarks, we would rather use the glibc MM.
And I never tested -O4 and I doubt there is any benefit of using it.
Offline
mROMot results is ready. In the end of round we will be #16 in composite score - mostly because of improved PG pipelining, what affect queries and updates on raw mode.
Weights		1.000	1.737	21.745	4.077	68.363	0.163
Composire #	JSON	1-query	20-query Fortunes Updates Plaintext 	Weighted score
38 	mormot 	731,119	308,233	19,074	288,432	3,431	2,423,283 	3,486  2022-10-26 - 64 thread limitation
43 	mormot 	320,078	354,421	19,460	322,786	2,757	2,333,124 	3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	mormot 	317,009	359,874	19,303	324,360	1,443	2,180,582 	3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	mormot 	563,506	235,378	19,145	246,719	1,440	2,219,248 	2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	mormot 	394,333	285,352	18,688	205,305	1,345	2,216,469 	2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	mormot 	859,539	376,786	18,542	349,999	1,434	2,611,307 	3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	mormot 	948,354	373,531	18,496	366,488	11,256	2,759,065 	4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 	mormot 	957,252	392,683	49,339	393,643	22,446	2,709,301 	6,293  2023-02-14 - 168 threads, cmem, inproved PG pipelining Also we will be: 
 - #1 for /queries in raw mode with PG pipelining. I hope everything was done correctly there
 - #3 for /fortunes in ORM(orm=full) mode. Result improved by +30%, thanks to cmem. We are behind lithium(C++) where template engine is not used and xitca-web (rust)
 - #2 for /db in ORM(orm=full) mode. A little (8000RPS) behind xitca-web
A possible way to optimize is to find a bottleneck in the acync HTTP server, which should improve json and plaintext
Last edited by mpv (2023-02-14 15:36:18)
Offline
This is great news!
/queries with raw pipelining is indeed impressive. I may eventually add the PG pipelining feature to the ORM too (using a batch for reads not only for writes). 
Access to the round info (still not finished, but with both mORMot raw and full ORM results) is at
https://www.techempower.com/benchmarks/ … test=query
About bottleneck in the async HTTP server, I don't know much what we could do.
There is no memory allocation for the /plaintext requests, and I guess that /json implementation is as minimal as it could.
Perhaps valgrid may help?
Is TPollSockets.GetOnePending the bottleneck? 
Should we try to switch the fPending: TPollSocketResults structure into a truly lock-less algorithm?
Offline
Great results! Thanks a lot @ab and @mvp for your hard work.
When you finish testing all new improvements, push one round with 28 threads. I expect best performance for /json and /plaintext.
Nice blog post for libreactor https://talawah.io/blog/extreme-http-pe … o-million/
Offline
Unfortunately, as noted here  our PG pipeline optimization breaks General test requirements #7 rule:
   "If using PostgreSQL's extended query protocol, each query must be separated by a Sync message"
I do not understand why this requirement exists for synchronous case, but we have what we have. So I will rewrite PG pipelining again 
About valgrid - I can't see a bottleneck there (mostly because it works slow), but now I trues to use perf + flamegraph - this techniques show more details because of HI-frequency sampling.
What I fund for a moment is a way to optimize /plaintext in HTTP pipelining mode. Flamegraph for pipelined plaintext shows what most of the time we are in send syscall - see flamegrapj for ./raw 4 4 1 (clickable)
This is because in current HTTP pipelining implementation we do less recv syscalls (read several GET request in one syscall) but we can also buffer an output and do less send`s!
This can be done if HTTP state machine detect what for GET request with Connection: keep-alive and HTTP 1.1  where is some additional GET after `\r\n`
and when send`s buffered response either if send buffer is overflow (for example nodeJS buffers such responses in 64k buffer) or pipeline ends (no additional bytes read)
Test case for 2 pipelined HTTP query:
(echo -en "GET /plaintext HTTP/1.1\nHost: foo.com\nConnection: keep-alive\n\nGET /plaintext HTTP/1.1\nHost: foo.com\n\n"; sleep 10) | telnet localhost 8080currently produce 2 sendto(7, "HTTP/1.1... syscall, but can produce one.
Since HTTP pipelining should not be used in production for many reasons, this optimization can be done under hsoHttpPipelining flag. @ab - can you implement this idea?
P.S. - to become clickable flamegraph should be downloaded first and when opened in browser - click do not work when svg is opened in Google Drive preview
Last edited by mpv (2023-02-15 11:07:25)
Offline
In fact, we can force to use the write buffer by setting the acoWritePollOnly option.
Then all writes will be done in a background thread, using connection buffers...
Try to uncomment line 3413 of THttpAsyncServer.Create in mormot.net.async.
Edit: it does not behave as we expect, because THttpAsyncConnection.AfterWrite is delayed to the next write operation, so the whole numbers are down...
acoWritePollOnly works great when there is no additional process (e.g. for our RTSP server) but not for HTTP.
Offline
So I have implemented what you proposed, and enabled pipelined sending for THttpAsyncConnection, using one output buffer if pipelined input is detected.
https://github.com/synopse/mORMot2/commit/d6d0919d
And measured a performance boost from 150,000/s without pipelining to 2,000,000/s with pipelined /plaintext requests on my laptop. 
I am curious to see the numbers on high-end hardware, but I guess we could be back in the course in TFB for /plaintext too with this!
Offline
Great!
We are waiting for your feedback.
During my tests in pipelined mode (using -- 3200 as parameter to multiply the queries count up to a big size) I discovered a nasty memory problem.
It should be fixed now with https://github.com/synopse/mORMot2/commit/0a59bffb
Offline
All TFB tests are passed
/plaintext results:
my PC: 1 271 496 -> 3 764 547
server:  2 621 195 -> 4 343 425 (near the same results should be on TFB hardware)
Perfect!
I need some time to solve PG Sync() and will prepare a PR
Offline
PG pipelining implemented using Sync() as required by TFB rules. raw* performance decreases a little (from 50K to 46K for /rawqueries)  
@ab, please merge https://github.com/synopse/mORMot2/pull/144 - I add Conn.CheckPipelineSync method and need it to prepare TFB PR
Offline
TFB PR #7926 is ready
 -   /queries and /updates for raw test case: Postgres SQL pipelining uses Bind->Exec->Sync as required by General Rule 7
 -   /plaintext: improved HTTP pipelining on mORMot level - added response buffering to minimize send syscall`s
 -   general optimization: use aggressive compiler optimization level (-O4)
I expect that with these changes we will be #12 ( +600 point for /plaintext and - 70 point for /rawqueries + some points because of -O4 for all endpoints)
So, the next goal is to be in TOP 10  
 
I have some ideas but still investigating....
Offline
Great!
Please try https://github.com/synopse/mORMot2/commit/7d7fde57
I tried to refactor the DB layer, and especially our PostgreSQL direct access classes, to gain a few points.
Offline
Yes, unexpectedly..
I post an announcement for our work on freepascal forum. May be community generate more ideas for optimizations....
Offline
Not much input in the FP forum yet... let's see.
I have made some optimizations today.
The main may be https://github.com/synopse/mORMot2/commit/1511dc7af05 about general HTTP processing.
My guess is also that https://github.com/synopse/mORMot2/commit/0f9f4092 could increase the /fortunes test performance.
Other micro-optimizations at ORM level may help too.
Offline
Today I tries to found bottlenecks using perf, but w\o success..
Your changes about HTTP processing (nice idea!) on server gives:
  +20k RPS for json
  +60k RPS for pipelined plaintext
   +3k for /db and /rawdb /rawfortunes
Cumulative /fortunes effect is +25K
Offline
Now, apart switching to io_uring instead of epoll/recv/send I don't see what could make much better difference now...
https://unixism.net/loti/index.html
But this would be a lot of work, only making a difference for the special case of very small requests with no DB access...
Offline
Correct me if I'm wrong, looking at python source look like benchmark start postgres docker, start app docker, then run all tests define in benchmark_config.json. If this is true then any prev test have impact to next test, mostly leaving MM garbage/fragmentation behind, also posgres is not restarted for orm/raw tests.
Looking at benchmark_config.json from ntex (No1 in compose score from prev round), also asp.net.core, instead of using one docker/monolith app, I propose to create 3 docker/app, 1-json/plaintext, 2-orm, 3-raw. Compose score will aggregate best results from all 3 app. Also for json/plaintext you can use different No of threads then db tests, what is scale better.
Offline
Switching to io_uring is too radical a change, IMHO. And as I see, the top 10 frameworks use epool...
For my opinion our current problem is what with the same number of threads increasing listening sockets count improves json performance
I catch a case, where performance difference is dramatically on both server and on my PC (for other therads/sockets count numbers more sockets also always win, but the difference is not so big)
Server (28 cores)
wrk -d 15 -c 64 --timeout 8 -t 28 "http://localhost:8080/json" 
./raw 56 28 1 (1x56 thread) -  143K RPS
./raw 28 28 2 (2x28 thread) -  770K RPSMy PC (12 cores)
wrk -d 15 -c 64 --timeout 8 -t 12 "http://localhost:8080/json" 
./raw 24 12 1 - 300K
./raw 12 12 2 - 500Kperf (for server) shows what:
in case of one socket most of the load is gone into only 2 thread R18080 R28080 - see flamergath_json_56_1
in case of two sockets load if distributed between many threads - see flamergath_json_28_2
I do not understand the reason of such behavior
P.S. 
I also tried
 - switching the PG protocol to binary mode - performance is the same  
 - sets HttpQueueLength to 0 - no changes in performance except small amount of sockets errors
Offline
Several processes was my first attempt - one process with 6 listeners and 28 thread for each is better when 6 processes x28 thread (a think because of memory managment)
Last edited by mpv (2023-02-18 16:51:01)
Offline
About @ttomas ideas above
https://synopse.info/forum/viewtopic.ph … 094#p39094
Tuning the dockers, to match what the best tests do, is a good idea.
Several docker instances, with specific tuning of our raw.dpr program (e.g. about threading), may have some benefits.
Also tuning the dockers may help too - e.g. disabling iptables, or NAT, or whatever idea dockers magicians could find out - I don't know much about docker myself.
Tuning the OS/container may help for high-performance: it has been seen in several places.
Offline
Looking at benchmark_config.json from ntex (No1 in compose score from prev round), also asp.net.core, instead of using one docker/monolith app, I propose to create 3 docker/app, 1-json/plaintext, 2-orm, 3-raw. Compose score will aggregate best results from all 3 app. Also for json/plaintext you can use different No of threads then db tests, what is scale better.
Nice idea, I will investigate it! 
Different thread counts for non-db test seams make sense. I'll try to determinate best and may be enable CPU affinity for them.
As I understand from python sources, DB container is restarted for each test(once for default and once for postgres-raw), and current test order is  db->json->queries->updates->fortunes->plaintext is OK (it's important what updates is after db/queries). So I do not see a reason to create separate containers for db and rawdb (but will check dose it make sense)
About "tuning the dockers" - we can't do this from inside a container. Docker daemon is configured on the host.
P.S.
My understanding of docker (simplified):
Docker is a slim wrapper around Linux namespaces and cgroups (+AuFS as a read only FS with layers). So everything is executed on host, but executable links to the libraries from docker image. No magic at all 
BTW this is a reason we can't use io_uring - as far as i understand TFB host machine is based on Ubuntu 18 and io_uring not supported there. TFB have a planes to migrate it to 22.04 in next round.
Last edited by mpv (2023-02-19 17:31:00)
Offline
My bad about https://synopse.info/forum/viewtopic.ph … 095#p39095 - in X X 1 mode raw enables hsoThreadSmooting - and this is why we got so strange thread using distribution
Last edited by mpv (2023-02-19 17:55:07)
Offline
Make sense about hsoThreadSmooting. 
Please see
https://github.com/synopse/mORMot2/commit/789b70e5
I allow customization of JSON serialization of ORM items/arrays, for more strict following of the TFB requirements.
Offline
Is TPollSockets.GetOnePending the bottleneck?
Should we try to switch the fPending: TPollSocketResults structure into a truly lock-less algorithm?
I think - yes. It's not visible in profiler, but this is a syscalls stat (cleanupped a little) of 1667410 /json requests for 24 thread server
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- -------------------
 40,34  242,887052         281    863994    163691 futex
 19,11  115,090885          69   1667410           sendto
 16,25   97,843747          58   1667499        15 recvfrom
  9,66   58,190893      843346        69           nanosleep
  0,14    0,826164          61     13415           epoll_wait <--- I play with event count - 256 is good enough
  0,01    0,032096          63       504           epoll_ctl  I think futex`es - is because of Lock; RTLCriticalSection and other friends in GetOnePending
Also tried @ttomas idea with different containers/threads settings for different tests.
After many attempts, I have achieved a small (100K) increase in /json, but the number of threads/instances is just magical (for 28 processors, 8 instances with 7 threads is the best). Also, separate containers do nothing for database-related tests. I suggest leaving everything as it is
Last edited by mpv (2023-02-20 19:07:50)
Offline
I expect more difference playing with servers and threads for pipeline /plaintext.
Tested on old Dell R710, 2xE5645 (6 cores, 12 threads), XCP-NG(XenServer), VM Ubuntu 22.04 with 24 vCPU.
@mvp nice catch about hsoThreadSmooting, I delete   
if servers = 1 then
    include(flags, hsoThreadSmooting); // 30% better /plaintext e.g. on i5 7300U
raw running in docker, also ntex [tokio,platform] docker (in last tfb round ~7M req/s)
mormot raw have better results then ntex on this platform!!!
wrk is run on same VM with -t 6, letter found -t 7 have better results. 3 test without warming.
About docker, we use ubuntu 20.04 focal for builder and run on ubuntu 22.04. If TFB run docker on Ubuntu 18, maybe is better to use ubuntu 18. I will try same docker test on U18VM.
wrk -d 15 -c 256 --timeout 8 -t 6 "http://localhost:8080/plaintext" -s pipeline.lua -- 16
Workers Servers Threads T1         T2         T3         AVG
24      1       24      1,342,597  1,431,568  1,408,269  1,394,145
24      2       12      1,372,775  1,395,293  1,399,634  1,389,234
24      3       8       1,402,108  1,399,117  1,387,009  1,396,078
24      4       6       1,424,154  1,419,472  1,389,615  1,411,080
48      1       48      1,292,962  1,341,772  1,278,475  1,304,403
48      2       24      1,391,753  1,380,388  1,384,175  1,385,439
48      3       16      1,404,478  1,417,240  1,403,237  1,408,318
48      4       12      1,375,386  1,384,120  1,383,750  1,381,085
48      6       8       1,374,656  1,410,551  1,378,311  1,387,839
72      1       72      1,268,166  1,018,549    966,550  1,084,422
72      2       36      1,372,683  1,338,490  1,357,713  1,356,295
72      3       24      1,384,253  1,376,085  1,362,731  1,374,356
72      4       18      1,363,745  1,400,351  1,360,997  1,375,031
72      6       12      1,379,854  1,401,052  1,376,467  1,385,791
96      1       96        918,802    972,689    946,801    946,098
96      2       48      1,282,146  1,329,924  1,341,937  1,318,003
96      3       32      1,356,407  1,358,175  1,379,885  1,364,822
96      4       24      1,355,498  1,387,323  1,361,684  1,368,168
96      6       16      1,379,136  1,367,597  1,393,248  1,379,994
ntex [tokio,platform]   1,042,539  1,014,515  1,042,050  1,033,035Offline
About the futexes, I am not sure the syscall come from GetOnePending.
I suspect it comes from the thread awakening (SetEvent) if hoTreahdSmoothing is not defined.
Please try to change TOSLightLock into TLightLock for fPendingSafe definition in mormot.net.sock line 800.
If the number of futex call is lower, then GetOnePending is the culprit.
But I doubt it is... I may be wrong.
Thanks anyway ttomas for the feedback.
Any such set of numbers is very interresting...
Perhaps we could also try hsoThreadSmoothing but tuning the ThreadPollingWakeupLoad parameter to something smaller for /plaintext or /json
fThreadPollingWakeupLoad is the number of events the thread balancer accept before waking up a new thread. Clearly, 32 is too high for /plaintext.
This default value of 32 was tuned on my 2 cores / 4 threads PC for high-level REST/JSON process.
You can try to change the default value of 32 in TAsyncConnections.Create to 16 or 8...
Perhaps a formula could emerge, like  
fThreadPollingWakeupLoad := (cardinal(aThreadPoolCount) div SystemInfo.dwNumberOfProcessors) * 8;  or something....
I don't have any high-end hardware (yet) for testing.
But I will try to implement a lock-free GetOnePending next week (this week it will be difficult for me).
Offline
Results are ready - we still #16. Impressive plaintext improvements, rawqueries (with PG Sync) worse than I expected, small improvement for json and db (I think because of -O4)
Weights		1.000	1.737	21.745	4.077	68.363	0.163
Composire #	JSON	1-query	20-query Fortunes Updates Plaintext 	Weighted score
38 	mormot 	731,119	308,233	19,074	288,432	3,431	2,423,283 	3,486  2022-10-26 - 64 thread limitation
43 	mormot 	320,078	354,421	19,460	322,786	2,757	2,333,124 	3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	mormot 	317,009	359,874	19,303	324,360	1,443	2,180,582 	3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	mormot 	563,506	235,378	19,145	246,719	1,440	2,219,248 	2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	mormot 	394,333	285,352	18,688	205,305	1,345	2,216,469 	2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	mormot 	859,539	376,786	18,542	349,999	1,434	2,611,307 	3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	mormot 	948,354	373,531	18,496	366,488	11,256	2,759,065 	4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 	mormot 	957,252	392,683	49,339	393,643	22,446	2,709,301 	6,293  2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 	mormot 	963,953	394,036	33,366	393,209	18,353	6,973,762 	6,368  2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required,  -O4 optimization  @ab - you are right. Replacing TOSLightLock into TLightLock dose not change futex calls count. So, this is thread awaking. Will play with WakeupLoad later...
@ttomas - current round result for mormot /plaintext is ~7M - and this is 10Gb network limitation on TFB hw. I think we should focus on /json. If we speed up HTTP for json, this automatically speed up db related tests.
Last edited by mpv (2023-02-23 19:31:00)
Offline
Wow, /plaintext results are impressive. Idea of 3 docker/test for json/plaintext was to tune workers to 2*CPU. Differences are small, in range of statistical errors.
wrk -d 15 -c 256 --timeout 8 -t 7 "http://localhost:8080/json"                         
Workers  Servers  Threads  T1       T2       T3       AVG
24       4         6       119.802  123.277  122.998  122.026
48       4        12       117.842  125.779  126.378  123.333
72       4        18       121.465  121.547  119.914  120.975
96       4        24       116.398  116.500  120.684  117.860
ntex [tokio,platform]      139.886  138.929  141.283  140.033Last edited by ttomas (2023-02-21 10:03:35)
Offline
Perhaps https://github.com/synopse/mORMot2/commit/1e76b7d7 may help a little.
But it is likely to be also within the measurement error margin.
At least the code is slightly cleaner.
Offline
I don't get why the 20-query decreased so much
all frameworks decrease in the same way after adding Sync() - we not first.
When PG server is local the decrease is not so significant.   
@ab - crazy idea - what if we use named pipes (mkfifo) to wake up a threads in pool instead of futex? Have you considered this option?
I mean TAsyncConnections creates a fifo pipe and write when data is ready, and all workers simply read from pipe in blocking mode? And let's kernel decide what worker wakes up by read successfully..
In this case we twice syscalls count (write + read instead of futex) but avoid loops over fThreads and simplify code a lot. Isn't it?
P.S.
Even simple pipe should work. And where is also a message queue if pipes is slow...
Last edited by mpv (2023-02-21 20:59:04)
Offline
When I investigated, MQ was not better than futex: "On Linux, mq_timedreceive() is a system call, and mq_receive() is a library function layered on top of that system call." says the manuel.
About mkfifo performance, I am not sure it would be faster either: there will be a syscall on waiting anyway.
IIRC Linus wrote that futex are the lightest and safe way of waiting in the user space.
Other mechanisms use kernel futexes themselves.
I think we could get some ideas for the queue in
https://github.com/h2o/h2o/blob/master/ … tithread.c
Perhaps https://man7.org/linux/man-pages/man2/eventfd.2.html could be used in a way similar to your mkfifo: let the thread way for data, and let the kernel do the scheduling, instead on waking them up manually with our ThreadPollingWakeupLoad custom algorithm.
static void init_async(h2o_multithread_queue_t *queue, h2o_loop_t *loop)
{
#if defined(__linux__)
    /**
     * The kernel overhead of an eventfd file descriptor is
     * much lower than that of a pipe, and only one file descriptor is required
     */An eventfd is what iouring seems to use for multi-threading: https://unixism.net/loti/tutorial/register_eventfd.html
BTW, did you try to play with ThreadPollingWakeupLoad ?
Offline
Yes, I plays with ThreadPollingWakeupLoad + hsoThreadSmoothing (4, 8, 16 32). A best s for for ThreadPollingWakeupLoad = 8. In best case(if on TFB HW boots will be near the same) we got +150 points, but I still not sure Smoothing is a good solution in terms of "realistc". So my propose is to enable it when 150 points meters, not now...
            w\o     tpw4      tpw8      tpw16    tpw32(m)
json      1338 754  1360 205  1457 470  1406 111  1163 232
rawdb      470 228   471 954   477 038   476 576   474 571
db         460 406   455 805   460 086   456 671   455 382
fortunes   386 097   388 822   392 342   387 425   387 932
rawfort    437 250   435 545   436 378   438 344
plaintext 4238 095  4125 256  4353 065  4249 592  4223 741
rawQuery    45 925              45 809    45 819
queries     34 712              34 658    34 826I can try to use eventfd on weekend, or, if you plane to try it by yourself (mormot async code is still al little complex for me), please notify me.
Last edited by mpv (2023-02-22 11:21:40)
Offline
Thanks for the input.
So I have setup a simple calculation for this constant, as
https://github.com/synopse/mORMot2/commit/aa5786ad93ad
It is always customizable anyway, once the server is launched.
I will try to include eventfd this weekend, without breaking the current algorithm - which is still to be used outside of Linux.
I am confident that letting the kernel waking up the threads when needed is a best option than doing it from the web server side.
This eventfd mechanism seems to be used everywhere async performance is needed with the Linux kernel.
Offline
I will try to include eventfd this weekend
Thank you very match!
I will made a new PR to TFB based on today`s sources with all minor changes and clean-upped raw.pas (w\o enabling smoothing). This give us a clean picture for comparison of event vs eventfd in future.
P.S.
 Ready #7944
Last edited by mpv (2023-02-22 17:11:13)
Offline
Just tried Microsoft mimalloc MM. https://github.com/microsoft/mimalloc
No noticeable improvements, same results as libc MM. ldd /usr/local/bin/raw confirm libmimalloc.so is used.
I used alpine docker https://github.com/emerzon/alpine-mimalloc.
Dockerfile
# ... Same builder
FROM emerzon/alpine-mimalloc
COPY --from=builder /build/bin/fpc-x86_64-linux/raw /usr/local/bin/raw
RUN apk --update add postgresql-client && \
    # Workaround musl vs glibc entrypoint
    mkdir /lib64 && \
    ln -s /lib/ld-musl-x86_64.so.1 /lib64/ld-linux-x86-64.so.2
EXPOSE 8080
CMD ["raw"]Offline
Last round has mormot results
Offline
@ttomas - thanks for investigation. At last now we know it didn't help. BTW I investigate jemalloc with the same result. So for now we can consider glibc mm is good enough
@dcoun - this round (started 2023-02-23) should be the same as previous, because our last MR with minor improvements is not merged yet.
@ab - I investigating a pipe (simple program with one writer, what write pointer into pipe and many readers concurrently read it) - performance depends very much on kernel version. For old kernel (4.18 as on my server) it's terrible. On never (5.18 as on my desktop) it`s faster a lot (x10 times). Still planning to try eventfd + ring buffer. Will post a gists when finish.
Offline
@mpv
I have made a first eventfd() implementation in https://github.com/synopse/mORMot2/commit/a0e916a1
It is disabled by default because the numbers I got were not very good...
Perhaps the EFD_SEMAPHORE mode is not the one we need. But it was the one which fits into the existing code AFAICT.
You can uncomment the line #1981 of mormot.net.async and make some tests on your side.
https://github.com/synopse/mORMot2/comm … 26c50R1981
At least, it consumes all CPU cores on my PC - but with a lot of syscalls for sure.
Edit: I have added a new hsoEventFD option, since the numbers don't feel so bad after all even on my PC.
See https://github.com/synopse/mORMot2/commit/e3589c99
Please try it, perhaps with thread affinity to cpu or socket...
Offline
Just checked /json on Ubuntu 18.04, Kernel 4.15 as host for docker, previous tests are on U22.04 VM, to be sure thаt different kernel, glibc/MM version have any impact.
Small improvements on all tests, also for ntex. The Ubuntu version does not affect the test!
        Workers  Ser. Thr. T1       T2       T3       AVG
libc MM   48      2   24   121.680  126.686  124.550  124.305
mimalloc  48      2   24   125.354  120.953  127.821  124.709
ntex [tokio,platform]      144.824  148.111  146.022  146.319Last edited by ttomas (2023-02-28 10:10:29)
Offline
I tried it yesterday (uncomment the line #1981) - on server HW numbers is slight lower compared to Events based algo. 
We can reduce syscalls by using blocking IO for eventFD. And remove fOwner.fThreadPollingEventFD.WaitFor(5000) at all. 
Blocked read() call should be terminated by signal (and return -1), when application (deamon) stops.
I tried this approach but threads locks somewhere.....
Offline