You are not logged in.
With NumTinyBlockArenasPO2 = 7 instead of 6 result is 327К
CPU load in user space is ~10% higher than when using libc in both cases
Flags: BOOSTER assumulthrd smallpools perthrd erms
Small: blocks=3K size=309KB (part of Medium arena)
Medium: 60MB/60MB sleep=15K
Large: 0B/640KB sleep=0
Total Sleep: count=15K
Small Getmem Sleep: count=4
288=4
Small Blocks since beginning: 239M/29GB (as small=42/46 tiny=1K/2032)
48=91M 112=38M 80=27M 128=18M 32=14M 96=9M 64=9M 144=4M
160=4M 256=4M 416=3M 880=3M 1264=3M 272=2M 1376=485K 960=475K
Small Blocks current: 3K/309KB
48=2K 64=427 352=200 32=87 128=79 112=73 80=48 96=21
192=14 416=8 576=7 880=7 288=6 160=5 736=5 624=4
Offline
Memory usage statistic
//libc
Maximum resident set size (kbytes): 28896
Minor (reclaiming a frame) page faults: 12867
Voluntary context switches: 5888357
Involuntary context switches: 5049
//x64mm (NumTinyBlockArenasPO2 = 7)
Maximum resident set size (kbytes): 124380
Minor (reclaiming a frame) page faults: 44196
Voluntary context switches: 5220211
Involuntary context switches: 8087
Offline
Great!
Please try in FPCMM_BOOSTER mode with https://github.com/synopse/mORMot2/commit/412fd883
It now has 128 arenas, and a bigger number of pools to fed from.
But of course, as you detected, it consumes more RAM to initialize its internal pools.
Some memory is lost in the process, if the memory does not remain allocated, but has very quick getmem/freemem (as in this server benchmark).
Offline
327K RPS for /fortunes. Memory consumption is higher
Flags: BOOSTER assumulthrd smallpools perthrd erms
Small: 3K/309KB including tiny<=256B arenas=128 pools=95
Medium: 126MB/126MB sleep=2K
Large: 0B/640KB sleep=0
Total Sleep: count=2K
Small Getmem Sleep: count=1
288=1
Small Blocks since beginning: 244M/29GB (as small=42/46 tiny=1K/2032)
48=93M 112=39M 80=28M 128=18M 32=14M 96=9M 64=9M 160=4M
144=4M 256=4M 416=3M 880=3M 1264=3M 272=2M 1376=509K 960=488K
Small Blocks current: 3K/309KB
48=2K 64=426 352=200 32=87 128=80 112=73 80=48 96=21
192=14 416=8 576=7 880=7 288=6 736=5 672=4 160=4
Maximum resident set size (kbytes): 271852
Minor (reclaiming a frame) page faults: 77196
Voluntary context switches: 5309185
Involuntary context switches: 7768
Offline
Using -O4 optimization level (never use it before because of "beware" notes) slightly increases performance (+44k for json for example) and pass all tests.
Also tries Whole Program Optimization - it's decrease executable size from 5Mb to 3Mb but without visible performance changes (compared to -O4)
@ab - how do you think - can we use -O4 for TFB (I'm afraid of accidental falls)?
Offline
Makes sense: only more memory consummed, with not less collision nor sleep.
So I will revert the previous commit to keep the memory lower - already more than glibc.
https://github.com/synopse/mORMot2/commit/19bcf72c
And for the TFB benchmarks, we would rather use the glibc MM.
And I never tested -O4 and I doubt there is any benefit of using it.
Offline
mROMot results is ready. In the end of round we will be #16 in composite score - mostly because of improved PG pipelining, what affect queries and updates on raw mode.
Weights 1.000 1.737 21.745 4.077 68.363 0.163
Composire # JSON 1-query 20-query Fortunes Updates Plaintext Weighted score
38 mormot 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 mormot 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 mormot 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 mormot 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 mormot 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 mormot 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 mormot 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 mormot 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
Also we will be:
- #1 for /queries in raw mode with PG pipelining. I hope everything was done correctly there
- #3 for /fortunes in ORM(orm=full) mode. Result improved by +30%, thanks to cmem. We are behind lithium(C++) where template engine is not used and xitca-web (rust)
- #2 for /db in ORM(orm=full) mode. A little (8000RPS) behind xitca-web
A possible way to optimize is to find a bottleneck in the acync HTTP server, which should improve json and plaintext
Last edited by mpv (2023-02-14 15:36:18)
Offline
This is great news!
/queries with raw pipelining is indeed impressive. I may eventually add the PG pipelining feature to the ORM too (using a batch for reads not only for writes).
Access to the round info (still not finished, but with both mORMot raw and full ORM results) is at
https://www.techempower.com/benchmarks/ … test=query
About bottleneck in the async HTTP server, I don't know much what we could do.
There is no memory allocation for the /plaintext requests, and I guess that /json implementation is as minimal as it could.
Perhaps valgrid may help?
Is TPollSockets.GetOnePending the bottleneck?
Should we try to switch the fPending: TPollSocketResults structure into a truly lock-less algorithm?
Offline
Great results! Thanks a lot @ab and @mvp for your hard work.
When you finish testing all new improvements, push one round with 28 threads. I expect best performance for /json and /plaintext.
Nice blog post for libreactor https://talawah.io/blog/extreme-http-pe … o-million/
Offline
Unfortunately, as noted here our PG pipeline optimization breaks General test requirements #7 rule:
"If using PostgreSQL's extended query protocol, each query must be separated by a Sync message"
I do not understand why this requirement exists for synchronous case, but we have what we have. So I will rewrite PG pipelining again
About valgrid - I can't see a bottleneck there (mostly because it works slow), but now I trues to use perf + flamegraph - this techniques show more details because of HI-frequency sampling.
What I fund for a moment is a way to optimize /plaintext in HTTP pipelining mode. Flamegraph for pipelined plaintext shows what most of the time we are in send syscall - see flamegrapj for ./raw 4 4 1 (clickable)
This is because in current HTTP pipelining implementation we do less recv syscalls (read several GET request in one syscall) but we can also buffer an output and do less send`s!
This can be done if HTTP state machine detect what for GET request with Connection: keep-alive and HTTP 1.1 where is some additional GET after `\r\n`
and when send`s buffered response either if send buffer is overflow (for example nodeJS buffers such responses in 64k buffer) or pipeline ends (no additional bytes read)
Test case for 2 pipelined HTTP query:
(echo -en "GET /plaintext HTTP/1.1\nHost: foo.com\nConnection: keep-alive\n\nGET /plaintext HTTP/1.1\nHost: foo.com\n\n"; sleep 10) | telnet localhost 8080
currently produce 2 sendto(7, "HTTP/1.1... syscall, but can produce one.
Since HTTP pipelining should not be used in production for many reasons, this optimization can be done under hsoHttpPipelining flag. @ab - can you implement this idea?
P.S. - to become clickable flamegraph should be downloaded first and when opened in browser - click do not work when svg is opened in Google Drive preview
Last edited by mpv (2023-02-15 11:07:25)
Offline
In fact, we can force to use the write buffer by setting the acoWritePollOnly option.
Then all writes will be done in a background thread, using connection buffers...
Try to uncomment line 3413 of THttpAsyncServer.Create in mormot.net.async.
Edit: it does not behave as we expect, because THttpAsyncConnection.AfterWrite is delayed to the next write operation, so the whole numbers are down...
acoWritePollOnly works great when there is no additional process (e.g. for our RTSP server) but not for HTTP.
Offline
So I have implemented what you proposed, and enabled pipelined sending for THttpAsyncConnection, using one output buffer if pipelined input is detected.
https://github.com/synopse/mORMot2/commit/d6d0919d
And measured a performance boost from 150,000/s without pipelining to 2,000,000/s with pipelined /plaintext requests on my laptop.
I am curious to see the numbers on high-end hardware, but I guess we could be back in the course in TFB for /plaintext too with this!
Offline
Great!
We are waiting for your feedback.
During my tests in pipelined mode (using -- 3200 as parameter to multiply the queries count up to a big size) I discovered a nasty memory problem.
It should be fixed now with https://github.com/synopse/mORMot2/commit/0a59bffb
Offline
All TFB tests are passed
/plaintext results:
my PC: 1 271 496 -> 3 764 547
server: 2 621 195 -> 4 343 425 (near the same results should be on TFB hardware)
Perfect!
I need some time to solve PG Sync() and will prepare a PR
Offline
PG pipelining implemented using Sync() as required by TFB rules. raw* performance decreases a little (from 50K to 46K for /rawqueries)
@ab, please merge https://github.com/synopse/mORMot2/pull/144 - I add Conn.CheckPipelineSync method and need it to prepare TFB PR
Offline
TFB PR #7926 is ready
- /queries and /updates for raw test case: Postgres SQL pipelining uses Bind->Exec->Sync as required by General Rule 7
- /plaintext: improved HTTP pipelining on mORMot level - added response buffering to minimize send syscall`s
- general optimization: use aggressive compiler optimization level (-O4)
I expect that with these changes we will be #12 ( +600 point for /plaintext and - 70 point for /rawqueries + some points because of -O4 for all endpoints)
So, the next goal is to be in TOP 10
I have some ideas but still investigating....
Offline
Great!
Please try https://github.com/synopse/mORMot2/commit/7d7fde57
I tried to refactor the DB layer, and especially our PostgreSQL direct access classes, to gain a few points.
Offline
Yes, unexpectedly..
I post an announcement for our work on freepascal forum. May be community generate more ideas for optimizations....
Offline
Not much input in the FP forum yet... let's see.
I have made some optimizations today.
The main may be https://github.com/synopse/mORMot2/commit/1511dc7af05 about general HTTP processing.
My guess is also that https://github.com/synopse/mORMot2/commit/0f9f4092 could increase the /fortunes test performance.
Other micro-optimizations at ORM level may help too.
Offline
Today I tries to found bottlenecks using perf, but w\o success..
Your changes about HTTP processing (nice idea!) on server gives:
+20k RPS for json
+60k RPS for pipelined plaintext
+3k for /db and /rawdb /rawfortunes
Cumulative /fortunes effect is +25K
Offline
Now, apart switching to io_uring instead of epoll/recv/send I don't see what could make much better difference now...
https://unixism.net/loti/index.html
But this would be a lot of work, only making a difference for the special case of very small requests with no DB access...
Offline
Correct me if I'm wrong, looking at python source look like benchmark start postgres docker, start app docker, then run all tests define in benchmark_config.json. If this is true then any prev test have impact to next test, mostly leaving MM garbage/fragmentation behind, also posgres is not restarted for orm/raw tests.
Looking at benchmark_config.json from ntex (No1 in compose score from prev round), also asp.net.core, instead of using one docker/monolith app, I propose to create 3 docker/app, 1-json/plaintext, 2-orm, 3-raw. Compose score will aggregate best results from all 3 app. Also for json/plaintext you can use different No of threads then db tests, what is scale better.
Offline
Switching to io_uring is too radical a change, IMHO. And as I see, the top 10 frameworks use epool...
For my opinion our current problem is what with the same number of threads increasing listening sockets count improves json performance
I catch a case, where performance difference is dramatically on both server and on my PC (for other therads/sockets count numbers more sockets also always win, but the difference is not so big)
Server (28 cores)
wrk -d 15 -c 64 --timeout 8 -t 28 "http://localhost:8080/json"
./raw 56 28 1 (1x56 thread) - 143K RPS
./raw 28 28 2 (2x28 thread) - 770K RPS
My PC (12 cores)
wrk -d 15 -c 64 --timeout 8 -t 12 "http://localhost:8080/json"
./raw 24 12 1 - 300K
./raw 12 12 2 - 500K
perf (for server) shows what:
in case of one socket most of the load is gone into only 2 thread R18080 R28080 - see flamergath_json_56_1
in case of two sockets load if distributed between many threads - see flamergath_json_28_2
I do not understand the reason of such behavior
P.S.
I also tried
- switching the PG protocol to binary mode - performance is the same
- sets HttpQueueLength to 0 - no changes in performance except small amount of sockets errors
Offline
Several processes was my first attempt - one process with 6 listeners and 28 thread for each is better when 6 processes x28 thread (a think because of memory managment)
Last edited by mpv (2023-02-18 16:51:01)
Offline
About @ttomas ideas above
https://synopse.info/forum/viewtopic.ph … 094#p39094
Tuning the dockers, to match what the best tests do, is a good idea.
Several docker instances, with specific tuning of our raw.dpr program (e.g. about threading), may have some benefits.
Also tuning the dockers may help too - e.g. disabling iptables, or NAT, or whatever idea dockers magicians could find out - I don't know much about docker myself.
Tuning the OS/container may help for high-performance: it has been seen in several places.
Offline
Looking at benchmark_config.json from ntex (No1 in compose score from prev round), also asp.net.core, instead of using one docker/monolith app, I propose to create 3 docker/app, 1-json/plaintext, 2-orm, 3-raw. Compose score will aggregate best results from all 3 app. Also for json/plaintext you can use different No of threads then db tests, what is scale better.
Nice idea, I will investigate it!
Different thread counts for non-db test seams make sense. I'll try to determinate best and may be enable CPU affinity for them.
As I understand from python sources, DB container is restarted for each test(once for default and once for postgres-raw), and current test order is db->json->queries->updates->fortunes->plaintext is OK (it's important what updates is after db/queries). So I do not see a reason to create separate containers for db and rawdb (but will check dose it make sense)
About "tuning the dockers" - we can't do this from inside a container. Docker daemon is configured on the host.
P.S.
My understanding of docker (simplified):
Docker is a slim wrapper around Linux namespaces and cgroups (+AuFS as a read only FS with layers). So everything is executed on host, but executable links to the libraries from docker image. No magic at all
BTW this is a reason we can't use io_uring - as far as i understand TFB host machine is based on Ubuntu 18 and io_uring not supported there. TFB have a planes to migrate it to 22.04 in next round.
Last edited by mpv (2023-02-19 17:31:00)
Offline
My bad about https://synopse.info/forum/viewtopic.ph … 095#p39095 - in X X 1 mode raw enables hsoThreadSmooting - and this is why we got so strange thread using distribution
Last edited by mpv (2023-02-19 17:55:07)
Offline
Make sense about hsoThreadSmooting.
Please see
https://github.com/synopse/mORMot2/commit/789b70e5
I allow customization of JSON serialization of ORM items/arrays, for more strict following of the TFB requirements.
Offline
Is TPollSockets.GetOnePending the bottleneck?
Should we try to switch the fPending: TPollSocketResults structure into a truly lock-less algorithm?
I think - yes. It's not visible in profiler, but this is a syscalls stat (cleanupped a little) of 1667410 /json requests for 24 thread server
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- -------------------
40,34 242,887052 281 863994 163691 futex
19,11 115,090885 69 1667410 sendto
16,25 97,843747 58 1667499 15 recvfrom
9,66 58,190893 843346 69 nanosleep
0,14 0,826164 61 13415 epoll_wait <--- I play with event count - 256 is good enough
0,01 0,032096 63 504 epoll_ctl
I think futex`es - is because of Lock; RTLCriticalSection and other friends in GetOnePending
Also tried @ttomas idea with different containers/threads settings for different tests.
After many attempts, I have achieved a small (100K) increase in /json, but the number of threads/instances is just magical (for 28 processors, 8 instances with 7 threads is the best). Also, separate containers do nothing for database-related tests. I suggest leaving everything as it is
Last edited by mpv (2023-02-20 19:07:50)
Offline
I expect more difference playing with servers and threads for pipeline /plaintext.
Tested on old Dell R710, 2xE5645 (6 cores, 12 threads), XCP-NG(XenServer), VM Ubuntu 22.04 with 24 vCPU.
@mvp nice catch about hsoThreadSmooting, I delete
if servers = 1 then
include(flags, hsoThreadSmooting); // 30% better /plaintext e.g. on i5 7300U
raw running in docker, also ntex [tokio,platform] docker (in last tfb round ~7M req/s)
mormot raw have better results then ntex on this platform!!!
wrk is run on same VM with -t 6, letter found -t 7 have better results. 3 test without warming.
About docker, we use ubuntu 20.04 focal for builder and run on ubuntu 22.04. If TFB run docker on Ubuntu 18, maybe is better to use ubuntu 18. I will try same docker test on U18VM.
wrk -d 15 -c 256 --timeout 8 -t 6 "http://localhost:8080/plaintext" -s pipeline.lua -- 16
Workers Servers Threads T1 T2 T3 AVG
24 1 24 1,342,597 1,431,568 1,408,269 1,394,145
24 2 12 1,372,775 1,395,293 1,399,634 1,389,234
24 3 8 1,402,108 1,399,117 1,387,009 1,396,078
24 4 6 1,424,154 1,419,472 1,389,615 1,411,080
48 1 48 1,292,962 1,341,772 1,278,475 1,304,403
48 2 24 1,391,753 1,380,388 1,384,175 1,385,439
48 3 16 1,404,478 1,417,240 1,403,237 1,408,318
48 4 12 1,375,386 1,384,120 1,383,750 1,381,085
48 6 8 1,374,656 1,410,551 1,378,311 1,387,839
72 1 72 1,268,166 1,018,549 966,550 1,084,422
72 2 36 1,372,683 1,338,490 1,357,713 1,356,295
72 3 24 1,384,253 1,376,085 1,362,731 1,374,356
72 4 18 1,363,745 1,400,351 1,360,997 1,375,031
72 6 12 1,379,854 1,401,052 1,376,467 1,385,791
96 1 96 918,802 972,689 946,801 946,098
96 2 48 1,282,146 1,329,924 1,341,937 1,318,003
96 3 32 1,356,407 1,358,175 1,379,885 1,364,822
96 4 24 1,355,498 1,387,323 1,361,684 1,368,168
96 6 16 1,379,136 1,367,597 1,393,248 1,379,994
ntex [tokio,platform] 1,042,539 1,014,515 1,042,050 1,033,035
Offline
About the futexes, I am not sure the syscall come from GetOnePending.
I suspect it comes from the thread awakening (SetEvent) if hoTreahdSmoothing is not defined.
Please try to change TOSLightLock into TLightLock for fPendingSafe definition in mormot.net.sock line 800.
If the number of futex call is lower, then GetOnePending is the culprit.
But I doubt it is... I may be wrong.
Thanks anyway ttomas for the feedback.
Any such set of numbers is very interresting...
Perhaps we could also try hsoThreadSmoothing but tuning the ThreadPollingWakeupLoad parameter to something smaller for /plaintext or /json
fThreadPollingWakeupLoad is the number of events the thread balancer accept before waking up a new thread. Clearly, 32 is too high for /plaintext.
This default value of 32 was tuned on my 2 cores / 4 threads PC for high-level REST/JSON process.
You can try to change the default value of 32 in TAsyncConnections.Create to 16 or 8...
Perhaps a formula could emerge, like
fThreadPollingWakeupLoad := (cardinal(aThreadPoolCount) div SystemInfo.dwNumberOfProcessors) * 8;
or something....
I don't have any high-end hardware (yet) for testing.
But I will try to implement a lock-free GetOnePending next week (this week it will be difficult for me).
Offline
Results are ready - we still #16. Impressive plaintext improvements, rawqueries (with PG Sync) worse than I expected, small improvement for json and db (I think because of -O4)
Weights 1.000 1.737 21.745 4.077 68.363 0.163
Composire # JSON 1-query 20-query Fortunes Updates Plaintext Weighted score
38 mormot 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 mormot 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 mormot 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 mormot 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 mormot 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 mormot 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 mormot 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 mormot 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 mormot 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
@ab - you are right. Replacing TOSLightLock into TLightLock dose not change futex calls count. So, this is thread awaking. Will play with WakeupLoad later...
@ttomas - current round result for mormot /plaintext is ~7M - and this is 10Gb network limitation on TFB hw. I think we should focus on /json. If we speed up HTTP for json, this automatically speed up db related tests.
Last edited by mpv (2023-02-23 19:31:00)
Offline
Wow, /plaintext results are impressive. Idea of 3 docker/test for json/plaintext was to tune workers to 2*CPU. Differences are small, in range of statistical errors.
wrk -d 15 -c 256 --timeout 8 -t 7 "http://localhost:8080/json"
Workers Servers Threads T1 T2 T3 AVG
24 4 6 119.802 123.277 122.998 122.026
48 4 12 117.842 125.779 126.378 123.333
72 4 18 121.465 121.547 119.914 120.975
96 4 24 116.398 116.500 120.684 117.860
ntex [tokio,platform] 139.886 138.929 141.283 140.033
Last edited by ttomas (2023-02-21 10:03:35)
Offline
Perhaps https://github.com/synopse/mORMot2/commit/1e76b7d7 may help a little.
But it is likely to be also within the measurement error margin.
At least the code is slightly cleaner.
Offline
I don't get why the 20-query decreased so much
all frameworks decrease in the same way after adding Sync() - we not first.
When PG server is local the decrease is not so significant.
@ab - crazy idea - what if we use named pipes (mkfifo) to wake up a threads in pool instead of futex? Have you considered this option?
I mean TAsyncConnections creates a fifo pipe and write when data is ready, and all workers simply read from pipe in blocking mode? And let's kernel decide what worker wakes up by read successfully..
In this case we twice syscalls count (write + read instead of futex) but avoid loops over fThreads and simplify code a lot. Isn't it?
P.S.
Even simple pipe should work. And where is also a message queue if pipes is slow...
Last edited by mpv (2023-02-21 20:59:04)
Offline
When I investigated, MQ was not better than futex: "On Linux, mq_timedreceive() is a system call, and mq_receive() is a library function layered on top of that system call." says the manuel.
About mkfifo performance, I am not sure it would be faster either: there will be a syscall on waiting anyway.
IIRC Linus wrote that futex are the lightest and safe way of waiting in the user space.
Other mechanisms use kernel futexes themselves.
I think we could get some ideas for the queue in
https://github.com/h2o/h2o/blob/master/ … tithread.c
Perhaps https://man7.org/linux/man-pages/man2/eventfd.2.html could be used in a way similar to your mkfifo: let the thread way for data, and let the kernel do the scheduling, instead on waking them up manually with our ThreadPollingWakeupLoad custom algorithm.
static void init_async(h2o_multithread_queue_t *queue, h2o_loop_t *loop)
{
#if defined(__linux__)
/**
* The kernel overhead of an eventfd file descriptor is
* much lower than that of a pipe, and only one file descriptor is required
*/
An eventfd is what iouring seems to use for multi-threading: https://unixism.net/loti/tutorial/register_eventfd.html
BTW, did you try to play with ThreadPollingWakeupLoad ?
Offline
Yes, I plays with ThreadPollingWakeupLoad + hsoThreadSmoothing (4, 8, 16 32). A best s for for ThreadPollingWakeupLoad = 8. In best case(if on TFB HW boots will be near the same) we got +150 points, but I still not sure Smoothing is a good solution in terms of "realistc". So my propose is to enable it when 150 points meters, not now...
w\o tpw4 tpw8 tpw16 tpw32(m)
json 1338 754 1360 205 1457 470 1406 111 1163 232
rawdb 470 228 471 954 477 038 476 576 474 571
db 460 406 455 805 460 086 456 671 455 382
fortunes 386 097 388 822 392 342 387 425 387 932
rawfort 437 250 435 545 436 378 438 344
plaintext 4238 095 4125 256 4353 065 4249 592 4223 741
rawQuery 45 925 45 809 45 819
queries 34 712 34 658 34 826
I can try to use eventfd on weekend, or, if you plane to try it by yourself (mormot async code is still al little complex for me), please notify me.
Last edited by mpv (2023-02-22 11:21:40)
Offline
Thanks for the input.
So I have setup a simple calculation for this constant, as
https://github.com/synopse/mORMot2/commit/aa5786ad93ad
It is always customizable anyway, once the server is launched.
I will try to include eventfd this weekend, without breaking the current algorithm - which is still to be used outside of Linux.
I am confident that letting the kernel waking up the threads when needed is a best option than doing it from the web server side.
This eventfd mechanism seems to be used everywhere async performance is needed with the Linux kernel.
Offline
I will try to include eventfd this weekend
Thank you very match!
I will made a new PR to TFB based on today`s sources with all minor changes and clean-upped raw.pas (w\o enabling smoothing). This give us a clean picture for comparison of event vs eventfd in future.
P.S.
Ready #7944
Last edited by mpv (2023-02-22 17:11:13)
Offline
Just tried Microsoft mimalloc MM. https://github.com/microsoft/mimalloc
No noticeable improvements, same results as libc MM. ldd /usr/local/bin/raw confirm libmimalloc.so is used.
I used alpine docker https://github.com/emerzon/alpine-mimalloc.
Dockerfile
# ... Same builder
FROM emerzon/alpine-mimalloc
COPY --from=builder /build/bin/fpc-x86_64-linux/raw /usr/local/bin/raw
RUN apk --update add postgresql-client && \
# Workaround musl vs glibc entrypoint
mkdir /lib64 && \
ln -s /lib/ld-musl-x86_64.so.1 /lib64/ld-linux-x86-64.so.2
EXPOSE 8080
CMD ["raw"]
Offline
Last round has mormot results
Offline
@ttomas - thanks for investigation. At last now we know it didn't help. BTW I investigate jemalloc with the same result. So for now we can consider glibc mm is good enough
@dcoun - this round (started 2023-02-23) should be the same as previous, because our last MR with minor improvements is not merged yet.
@ab - I investigating a pipe (simple program with one writer, what write pointer into pipe and many readers concurrently read it) - performance depends very much on kernel version. For old kernel (4.18 as on my server) it's terrible. On never (5.18 as on my desktop) it`s faster a lot (x10 times). Still planning to try eventfd + ring buffer. Will post a gists when finish.
Offline
@mpv
I have made a first eventfd() implementation in https://github.com/synopse/mORMot2/commit/a0e916a1
It is disabled by default because the numbers I got were not very good...
Perhaps the EFD_SEMAPHORE mode is not the one we need. But it was the one which fits into the existing code AFAICT.
You can uncomment the line #1981 of mormot.net.async and make some tests on your side.
https://github.com/synopse/mORMot2/comm … 26c50R1981
At least, it consumes all CPU cores on my PC - but with a lot of syscalls for sure.
Edit: I have added a new hsoEventFD option, since the numbers don't feel so bad after all even on my PC.
See https://github.com/synopse/mORMot2/commit/e3589c99
Please try it, perhaps with thread affinity to cpu or socket...
Offline
Just checked /json on Ubuntu 18.04, Kernel 4.15 as host for docker, previous tests are on U22.04 VM, to be sure thаt different kernel, glibc/MM version have any impact.
Small improvements on all tests, also for ntex. The Ubuntu version does not affect the test!
Workers Ser. Thr. T1 T2 T3 AVG
libc MM 48 2 24 121.680 126.686 124.550 124.305
mimalloc 48 2 24 125.354 120.953 127.821 124.709
ntex [tokio,platform] 144.824 148.111 146.022 146.319
Last edited by ttomas (2023-02-28 10:10:29)
Offline
I tried it yesterday (uncomment the line #1981) - on server HW numbers is slight lower compared to Events based algo.
We can reduce syscalls by using blocking IO for eventFD. And remove fOwner.fThreadPollingEventFD.WaitFor(5000) at all.
Blocked read() call should be terminated by signal (and return -1), when application (deamon) stops.
I tried this approach but threads locks somewhere.....
Offline