You are not logged in.
WaitFor() is not the bottleneck on high contention, because most of the time is spent within the "while GetNext and GetOnePending do ProcessRead" loop.
In fact, GetNext() will make a syscall read() to decrement the evenfd counter.
This GetNext() / read() is mandatory in EFD_SEMAPHORE mode.
Edit:
I managed to run the eventfd() in blocking mode, then having some clean syscalls: read(eventfd) + recvfrom(socket) + sendto(socket) in loop in the threads.
But the numbers were slower than my previous attempt with poll()... and I had to tweak the R0 thread to consume more CPU...
So eventfd + EFD_SEMAPHORE sounds not a good option. I don't see how we could use the non-semaphore mode anyway.
Offline
But the correct solution may be that on Linux, TSynEvent will use eventfd() instead of PRTLEvent
- eventfd() in blocking and non-semaphore mode reduces the syscalls to simple read/write/poll
- seems to be stable in practice - feedback is welcome!
https://github.com/synopse/mORMot2/commit/c551e487
Now, in the syscall traces, we have no futex call any more (unless there is a real lock contention) during the sub-thread process.
Just some read() and write() for inter-thread communication.
My guess is that the whole previous acoEventFD code could be dropped.
It will always be slower than this solution.
Offline
Digging deeper, I found what with new implementation we got better result with lower total threads. This is very good - I expect db will be faster. Moreover - 7th*14server is better when 14th*7server.
I will investigate more.
BTW - TFB guys plane to begin setup a NEW servers with Ubuntu 22.04 this week (see https://github.com/TechEmpower/Framewor … 449036999), so we have some time...
Offline
Nice to read!
Did you try with hsoThreadSmooting now?
It could help focus the server on a few threads for /plaintext or /json, but spread the flow to all the threads in case of slowest DB access.
Perhaps the next step may be to implement DB pipelining for the ORM, maybe using a batch....
This is where the "raw" results are really behind the "ORM" results.
Offline
Playing with mormot2 master from today. Best results on 24 vCpu are 2 servers with 12 threads and include flags hsoThreadSmooting and hsoReusePort. Some strange results when run docker -p 8080:8080 vs --net=host, compared with ntex. ntex run faster with -p, mormot with --net=host. Also result running raw native on host without docker. mormot is faster then ntex if run with --net!!!
T1 T2 T3 AVG
Native raw 12 24 2 196.817 196.177 196.558 196.517
docker run -p 8080:8080
raw 12 24 2 132.468 123.207 126.877 127.517
ntex 144.824 148.111 146.022 146.319
docker run --net=host
raw 12 24 2 186.369 188.466 186.378 187.071
ntex 137.510 136.848 136.738 137.032
Online
Nice numbers!
Is there a way to ensure TFB use --net=host for the mORMot image?
Final numbers of previous round are available.
mORMot is #16 as expected - no regression.
https://www.techempower.com/benchmarks/ … =composite
The previous PR from Pavel has been merged, so next round would include the code cleaned before our eventfd() modifications.
https://synopse.info/forum/viewtopic.ph … 154#p39154
Then the following round, maybe on new Hardware, and new Linux/Ubuntu version, could include the latest TSynEvent / eventfd() ideas.
Sounds we are in good tracks.
We need to validate if hsoThreadSmooting is really better on the new HW/SW stack too. I wonder how many cores the new HW will have.
Perhaps io_uring may be the next big step. We could statically link liburing-ffi.a wrapper from https://github.com/axboe/liburing to try it.
If the new Ubuntu kernel supports io_uring. Ubuntu 22.04.2 LTS seems to have a 5.19 HWE kernel... don't know if they would use this just released update.
Side note:
I looked at ntek source, and I found
https://github.com/TechEmpower/Framewor … /db.rs#L54
my guess is that pl += 1; is incorrect in this loop - it should be pl += 2;
So that ntek updates are clearly wrong for half of them.
The TFB validator should check that the /updates actually do what is expected.
Offline
Update:
I propose to change the /rawupdates SQL query
https://github.com/synopse/mORMot2/commit/725087dc and
https://github.com/synopse/mORMot2/commit/0d3804cf
- ORM-like nested SELECT is replaced by " update table set randomNumber = CASE id when $1 then $2 when $3 then $4 ... when $9 then $10 else randomNumber end where id in ($1,$3,$5,$7,$9) "
- this weird syntax gives best number for TFB /rawupdates?queries=20 but is not good for smaller or higher count
- we won't include it in the ORM but only for our RAW results - as other frameworks (e.g. ntex) do
Offline
Configuring host for such heavy bench-marking as TFB is a separate issue. I recommend:
- kernel - turn off mitigations patches on kernel level (by adding mitigations=off into /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT)
- for docker- disable a userland proxy (by adding "userland-proxy": false into /etc/docker/deamon.json) @ttomas - try this instead of --net
- do not use any VM - on VM results are always unpredictable
- increase max open files limits
I tries with hsoThreadSmooting - for /json result is better - 1450K vs 1300K w/o, for /plaintext - 4220K vs 4200K w/o, for /rawdb and /rawfortunes - near the same (but not slower), queries and updates are a little <1K slower)
About updates SQL - your version not works for Postgres (syndb layer expect ? as a parameter placeholder). When I fix it to generate "when ? then ? ... where id in (?....)" and binds 60 parameters instead of 40 performance is not changed (+-100) compared to array-binded version. Sorting of words array dose not change anything
Offline
Fixed /rawupdates - https://github.com/synopse/mORMot2/pull/152, but, since performance is the same, I propose to revert back to sub-query as more realistic
Offline
Is /rawupdates really with no benefit of using UPDATE ... CASE WHEN ... ?
It sounds like the frameworks ranking better than mORMot uses this pattern.
And on my side, with SQLite3 it is even better for /rawupdates?queries=20 which is the one used for the TFB ranking IIRC.
We could at least try on the TFB hardware.
About ntex it was my bad - but the main issue remains: I don't think they validate that updates do actually take place.
Perhaps we could try enabling hsoThreadSmooting by default.
I made some small refactoring today:
does https://github.com/synopse/mORMot2/commit/6e242a7149 make any difference on server HW?
I had to switch the timer on for Windows, because the select() polling is not the same as epoll().
But on Linux, the syscall trace seems cleaner now. There is no poll() any more in the R0 thread, and this thread uses less CPU than before AFAICT.
I also fixed the TAsyncConnections.ThreadPollingWakeup() algorithm, so the R1..Rn threads should better be awaken.
Offline
For /rawupdates - where is really no benefit, at last on my server.
I think our problem is not in SQL clause, but in concurrency. Other better ranking frameworks are async and creates connection count = CPU count. We need at last connections=CPU*3. Hope with eventFD I decrease connections from current CPU*6 to CPU*3 without performance lost. And also apply new rawupdate - just to verify..
I will verify latest changes today evening (server I use for tests is someone's production and additional load exists during workday). But the code is much cleaner now, for sure.
Abut io_uring - let's wait while some framework implements it and see a numbers....
Offline
The other frameworks use coroutines/futures to manage the slow DB process, whereas we use blocking access to the DB, so we need more threads than they do.
Good idea to hear about io_uring: in practice, the 7M RPS on /plaintext is a maximum about the current HW network IIRC.
It may be possible to scale better with /json but it is not certain, since benchmarks with io_uring vs epoll are not really consistent...
What is sure is that we spend a lot of time in the kernel during read/write operations, whereas io_uring scheme may help by reducing a lot the number of syscalls.
Ensure you include https://github.com/synopse/mORMot2/commit/20b5f3c1 in your tests.
Offline
I rethink a way how I run a test on server hardware (2socket * 12core*2thread = 48 logical CPU).
All my prev. tests are executed on the same server with:
- app. server limited to first 28 CPU (taskset -c 0-28 ./raw ...) to match a TFB hardware logical CPU count
- wrk limited to last 20 cores
- Postgres w\o CPU limits
This is totally wrong for several reasons:
- the good network is local, but we can't do anything with this. TFB uses 10G switch, so we can expect this is near the same as local network
- the badPostgres is not limited to CPUs - this is bad and we got unexpected results for db-related tests (out numbers not match TFB, for example I got better /rawqueries)
- the uglyapp server uses CPUs from different socket - this is ugly, for sure
So I decide to set CPU limits to 16 cores for each part - app 0-15 : Postgres 16-31 (systemctl set-property postgresql.service AllowedCPUs=15-31 + restart) and wrk 32-48 and repeat all tests. The numbers is lower, but proportionally should be more close to TFB HW
Will publish numbers soon..
Last edited by mpv (2023-03-05 19:15:40)
Offline
Results for 16-16-16 CPU is on google drive
The best mode is 8-16 mode - 8 thread server * 1 server per CPU;
Mode I select for TFB (16 4 in this case 28 6 for their HW) is one of the worst "(
threadSmoothong aviability and/or eventfd vs RLTEvent results are very close;
In all tests app server consumer ~98% CPUs
@ab - take a look into cached-queries x2 difference for many threads per server
Last edited by mpv (2023-03-05 21:28:31)
Offline
Very good fully proven testing!
A lot of time spent, I suppose....
About the cached-queries x2 difference, this is weird.
There is a R/W lock, so multiple threads should be able to access the cache in parallel.
Perhaps the O(log(n)) algorithm of the ORM cache could be improved - it was the easiest and good enough in terms of memory. I could use a O(1) TDynArrayHasher instead of a O(log(n)) binary/sorted search. For 10,000 items, it makes a difference.
Here I tried to follow your findings:
https://github.com/synopse/mORMot2/commit/213542ca
Offline
Your changes is adopted for TFB PR, see small fixes in pull/153. Will made a pull request to FTB (based on mORMot2.0 stable ) after we got a results for current run (they can stop it for tuning server, hope our results will be ready before this happens). Many thanks for statics.tgz!
I fund a Thread-safe lock-free multiple producer multiple consumer queue implementation by @BeRo1985
@ab - have you seen it? May be it will be better for waking threads than eventfd/RTLEvent?
Offline
I already looked at Bero code for the queue, but it sounded a bit too complex for the task.
There are alternate lock-free queues around, which just use indexes and not double-pointer slots - which are not native in the FPC RTL.
AFAIR the pending list is not a contention.
There is no switch to kernel futex during the process, and I looked at the logs timing which showed only a few microsecond run. The TOsLightLock protects a O(1) branchless algorithm which access data in L1 CPU cache.
Offline
I have rewritten the TOrmCache process.
https://github.com/synopse/mORMot2/commit/abb5e068
Now we can directly retrieve a ready-to-be serialiazed TOrm instance from the cache.
Numbers should be better now for /cached-queries - even if they are not included in the main composite ranking.
Edit: ensure you got at least https://github.com/synopse/mORMot2/commit/46f5360a with the latest ORM cache optimizations.
Offline
TFB MR 7994 is ready - based on 46f5360a commit:
- thread pool auto-tuning: use 1 listening socket per logical CPU and 8 working thread per listener socket
- for /updates with count <=20 use 'case .. when .. then' pattern
- [mORMot] update to mORMot 2.0.stable
- [mORMot] improved cached queries performance
Our db layer do not support parameters count > 999, so I add fallback to UNNEST for /rawupdates (TFB tests uses count=500 for verification = 1500 parameters in query) - will do MR to mORMot ex/ later...
Cached queries performance increased 788k -> 923k for 8th*16servers mode
Offline
Yes, they updates only OS (Ubuntu 18.04 -> Ubuntu 22.04). Numbers are lowed ~10-12% for all frameworks, I think they do not turn off new all mitigations patches..
In any case we decrease the gap with .NET.
Let's wait for next round with improved thread pool and new updates query. Hope my thread pool size investigations is correct....
Offline
Current TFB status:
in round completed 2023-03-10 we move down in composite score #16->#17 (new rust(tokio) viz framework is added), but in endpoints tests we improve our results:
- /json #67 -> #65
- /db #31 -> #25
- /queries #30 -> #29
- /cached #46 -> #38 (we expect huge improvement in next results)
- /fortunes #19 -> #20 (unfortunently)
- /updates #29 -> #25 (some improvements expected in next round with updates w/o unnest)
- /plain #23 -> #21
Some of results improved because of our changes, some - because other frameworks are more affected by new kernel mitigations patches.
Next results expected on 2023-03-15
Offline
I have stlightly updated TFB benchmark sample:
- introducing /rawcached entrypoint
- updated README.md content
See https://github.com/synopse/mORMot2/commit/8e9313c4
I have added /rawcached because some frameworks with no ORM do implement it, so it could be a good idea to include it in the tests.
It is implemented just by using an in-memory dynamic array of TOrmWorld instances - it seems to be allowed by the specifications, because we are allowed to assume that the IDs will be in the range 1..10000 so we could just use an array as caching mechanism. On my notebook, numbers are better than the ORM cache, because the O(log(n)) binary lookup clearly has a cost.
Offline
For the /raw kind of endpoints, I don't see why it should be avoided. And we don't use a key-value map but an array.
But you are right, it seems to break the rule - even if a cached query without ORM does not make much sense.
For the ORM, we use the standard/canonical mORMot ORM cache: this is what matters IMHO.
Offline
Weights 1.000 1.737 21.745 4.077 68.363 0.163
# JSON 1-query 20-q Fortunes Updates Plaintext Scores
38 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
17 915,202 376,813 30,659 350,490 17,051 6,824,917 5,943 2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17 1,011,928 370,424 30,674 357,605 13,994 6,958,656 5,871 2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
Conclusions:
- because of ThreadSmooting scores changed: json +96, db -6, fortunes +28. = +118
- rawupdate with new algo when..than: -4k RPS (-272 scores). But ORM updates improved +3k. The good news here is what I am 95% sure this is because we bind Int params as string. I will implement a binary binding today (retrieve textual results is OK - I verify several times)
- cached-queries improved 148K -> 349K. Mostly because of 8*28 threads + ready-to-be serialiazed TOrm instance
Last edited by mpv (2023-03-14 06:54:08)
Offline
I add binary parameter binding - please, see https://github.com/synopse/mORMot2/pull/159
This MR break parameters logging, since I rewrite a p^.VInt64 value to be in htonl byte order as required by PG binary protocol.
Can we extend a TSqlDBParam by adding VInt64BE? because comment below say don't to extend....
// - don't change this structure, since it will be serialized as binary
// for TSqlDBProxyConnectionCommandExecute
TSqlDBParam = packed record
This speed up a little all endpoints what retrieve one row (for both ORM and raw level). May be on TFB environment the increase will be more significant.
For /rawupdates on my server there is almost no difference between unnest / when..then / when..then+bin_bind (I always got +-25k). So I can`t explain why TFB /rawupdates is so pure.
Structure for bind array in binary format is undocumented, quite complex and uses x4 more traffic than string representation we currently use, so I do not see any reason to implement it for binding binary arrays into unnest (see array representation here https://stackoverflow.com/a/66499392)
My propose is to use when..than for count <=10 and fallback to unnest otherwise - this gives the best speed for any parameters count.
Offline
Great results for /json /plaintext.
Looking again on read only connection table https://techcommunity.microsoft.com/t5/ … 06462#fn:1
for 96vCpu compare 100 and 800 connections.
I don't find similar table for write/update connections, but for sure impact will be much more worse for shared I/O (hdd/ssd).
Maybe you can create one more test (docker) with same code but reduced threads/connections (2*CPU) and in one round see differences.
Last edited by ttomas (2023-03-14 14:17:13)
Online
@mpv
I have merged the PR.
- IMHO the logs should be correct, as I wrote as comment in the PR: the log text is computed BEFORE BindParams within SqlDoBegin(sllSQL) method call.
- Makes sense for array binding to keep the text format.
- about <= 10 limit: why not, but we much be sure it is not worth it on TFB HW to use CASE WHEN THEN as other frameworks.
@ttomas
IIRC the docker images makes writing tests after reading tests, so it should not affect the performance.
Do you mean than 228 threads is slower than less threads?
Anyway 2*CPU is very likely to be slower for the DB access - but should not change much for plaintexT/json; anyway mpv did a lot of testing to achieve to this 8*CPU count, which makes sense. Our "thread smoothing" algorithm won't use the more threads unless really necessary, in all cases.
Offline
@ttomas - I really-really verify different threads count. We work with PG in blocking mode, so need at last connections = CPUx3 (yes, some of them will be idle periodically). Take a look into first 3 rows of results table 3 post above: for /db 3d row with 140 connections is better then 2d with 112 connections, and 2d better than 1t with 64.
@ab - logging problem is only for /rawqueries - see my comment on github
Offline
@mpv
Please try https://github.com/synopse/mORMot2/commit/2ae346fe about TSqlDBPostgresStatement.GetPipelineResult logging.
Offline
About /rawupdates - on last round with when..than we got only 4k RPS (13K is an ORM /updates result). I don't have any ideas why on TFB hardware /updates results are so strange and do not match my tests, but in fact unnest is better for mORMot
Offline
@mpv
Please try https://github.com/synopse/mORMot2/commit/2ae346fe about TSqlDBPostgresStatement.GetPipelineResult logging.
It's a little strange. Was this the intention to have q= on second result retrieving?
17:40:07.957 28 DB mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Prepare t=2.98ms c=01 q=select id,randomNumber from World where id=?
...
17:40:07.958 28 SQL mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=2.99ms c=01 q=select id,randomNumber from World where id=7991
17:40:07.958 28 SQL mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=2.99ms c=01 q=select id,randomNumber from World where id=4057
17:40:07.958 28 Result mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=3.15ms c=01 r=1 q=select id,randomNumber from World where id=4057
17:40:07.958 28 Result mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=3.24ms c=01 r=1 q=
17:40:07.958 28 Result mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=3.25ms c=01 r=1 q=
...
Offline
@ab
Do you mean than 228 threads is slower than less threads?
Yes looking in the pgbench test table for 96Cpu server, best result when connections=cpu, for read only. We are testing postgres server and depend on I/O shared resource speed (hdd), the bigger bottleneck, too many threads, too many connections drop performance. Write/update connections have more stress to IO.
@mvp, different HW, hdd/ssd/raidX, maybe tfb HW. I know we miss async client. On slow hdd threads count is more important as in my old server.
Online
@mpv
With https://github.com/synopse/mORMot2/commit/24ef0cbc there is no q= any more.
@ttomas
Sadly, we don't have any async coding in FPC. Even with anonymous methods it won't change much...
Only possibility may be to use a state machine instead of blocking code for the DB execution in /raw endpoints... such a state machine (as we use in the async http/websockets sever) is even better than modern languages async code (less overhead).
Offline
Upss. @ab - with latest sources TFB tests fails (server crashed) with message "mormot: double free or corruption (!prev)"
Do your regression tests passed?
Updates - crashes on commit "fixed TSqlDBPostgresStatement.GetPipelineResult logging"
Last edited by mpv (2023-03-14 16:03:00)
Offline
No problem with the regression tests.
I will investigate further.
Using mORMot 2.1.5100 x64MMs
TSqlite3LibraryStatic 3.41.0 with internal MM
Generated with: Free Pascal 3.2.3 64 bit Linux compiler
Time elapsed for all tests: 2m11
Performed 2023-03-14 17:12:19 by abouchez on tisab
Total assertions failed for all test suits: 0 / 75,969,743
! All tests passed successfully.
Heap dump by heaptrc unit of /home/abouchez/dev/lib2/test/fpc/bin/x86_64-linux/mormot2tests
66769342 memory blocks allocated : 17206425493/17380405688
66769342 memory blocks freed : 17206425493/17380405688
0 unfreed memory blocks : 0
True heap size : 0
True free heap : 0
Ensure you include my latest fix.
Offline
It's strange. Another 3 run on latest sources, and everything is OK. May it was be my PC problem...
I made a TFB PR 8031:
- for /updates with count <=15 using 'case .. when .. then' pattern, for count > 15 - 'unnest' pattern
- using binary parameter binding format for Int4/Int8 parameter types - should be faster than textual
Offline
Our test is used by TFB team to verify their CI, just because we builds quickly
See https://github.com/TechEmpower/Framewor … 1467058544
Offline
Good news - I found a way to improve PG pipelining performance (rawqueries, rawupdates)
libpq PQPipelineSync function flush socket for each call (do write syscall).
I trace justjs implementation and observe whey send all pipeline commands in one write syscall. After this I rebuild libpq with commented flush and all works correctly and performance increased +4k(10%) RPS for rawqueries and +1k for rawupdates. (I'm using local Postgres, it should increase more over the network). This should add ~150 composite scores
Now I either implement PQPipelineSync in Pascal (need access to internal libpq structures) or, if I can`t, add a modified pibpq into docker file
Offline
Or even a pull request to the official libpq ?
I just add a thought: could it be possible to use a single connection among threads, but make it use pipelining instead of blocking commands when running Bind and ExecutePrepared?
Offline
TFB's requirement to call Sync() after each pipeline step is highly debatable. Moreover - this discussion was started by the .net team, and they may have their reasons for doing so (after all, MS sponsors citrine hardware, so they can do it).
My opinion is what we do not require sync() at all, the same opinion have postgres tests authors on their test_nosync.
So I decide not to do a PR to libpq but use a modified version (I place it on github). New TFB#8057 is ready. Based on latest mORMot sources.
About current state - mORMot results for round 2023-03-16 is ready, but seams all frameworks results are lower in this round, se we can`t ensure binary bindings helps. We move one place up in composite score.
Offline
About use one connection for several threads - I don't like this idea (even if it may improve performance), because it's not "realistic" in terms of transaction. In TFB bench we don't need transactions, but in real life - yes. Several thread may want to commit/rollback their own transactions (in parallel) and this is impossible in single connection.
In fact, I still don't understand why our /rawdb result is twice lower compared to top "async" frameworks. This is abnormally. DB server CPU load is ~70% for our test in this case, so bottleneck is on our side. But there? In libpq select call`s? Or because all our threads are busy? No answer yet, but solving this problem is for sure a way to top10. I sure we can solve it without going to "async"
Offline
And a small hack - set a server name to 'M' instead of default "mORMot (linux)". On 7 millions responses it's meter . Just saw that .net calls its server 'K' instead of 'Kestrel' for TFB tests
Offline
Thanks for the feedback.
Nice numbers, perhaps a bit disappointing for the integer parameters binding.
We will see how your "hacked libpq" will make a difference - hacking the lib is a good solution.
We could try to make a round with more threads per core, e.g. 16 instead of 8.
Perhaps it makes a difference with their HW...
And on another subject, you were right about LDAP client: it is a nightmare to develop, I spent 2 days in WireShark to find out what Microsoft did NOT do for their AD in respect to the RFC.
They don't follow the RFC for the Kerberos handshake, then they make SASL encryption mandatory (whereas it should be customizable)... But at least it seems to work now from Linux and Windows clients. Up to the next bug/unexpected MS "feature".
Offline