You are not logged in.
And we do not need asoForceConnectionFlush option at all. Even on modified libpq PQGetResult will internally call flush first. In rawqueries pConn.Flush; can also be removed
Please, see this PR
Last edited by mpv (2023-04-28 18:56:21)
Offline
For a while the best /asyncfortunes result I was able to achieve is for servers=CPUCount*2, thread per server =1, pinned
# taskset -c 0-15 ./raw12 -s 32 -t 1
....
num servers=32, threads per server=1, total threads=32, total CPU=48, accessible CPU=16, pinned=TRUE
taskset -c 31-47 ./wrk -H 'Host: 10.0.0.1' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 10 -c 512 --timeout 8 -t 16 "http://localhost:8080/asyncfortunes"
Requests/sec: 482751.02
This is +25%, what is very VERY good, but I almost sure I need to play more with parameters...
Offline
For DB queries, you need to use more cores for ./raw and less for wrk.
Perhaps https://github.com/synopse/mORMot2/commit/35dbef14
makes sense.
I guess there are some missing files in your PR.
Offline
Now I have merged your PR.
I did miss the line about Flush - now I got it.
My next step is to create a new per-connection ExecuteAsyncPrepared() method, in addition to the current per-properties pattern.
As a result, there should be no lock at all during the DB requests. The responses would be handled by a new async thread, one per connection thread, so one per core.
My guess is that we could try to have a more regular threading model, perhaps with a single server, and no pinning.
Offline
Just tested current implementation - for a while best results is with `-s CPU*2 -t 1 -p`. /acyncdb and /asyncfortunes is faster (+25%) compared to rawdb/fortunes.
I can wait while you implement `ExecuteAsyncPrepared`, or update TFB PR with current implementation - what is your opinion?
Offline
I've updated TFB PR 8182 with current (refactored PostgreSQL async DB) sources state - new async test suit added. They usually merging today's (Saturday) night - so we may participate with async in Monday`s run
Offline
BTW - modification to libpq, similar to our is applied by Postgres reviewers and should be included into Postgres v17 (in ~1 year). Next h2o test should also use modified libpq what do not flush on every sync.
Offline
Yes, better have a round with the first algorithm of async writings.
I won't be able to deliver something stable until the next round.
-s CPU*2 -t 1 -p
makes somehow sense - but is a pretty weird setting for sure.
Nice seeing the modified libpq.
We could identify the new endpoint and use it with a "pre-release" build - which no one could argue against. This is just the future official version.
Offline
TFB state
Weights 1.000 1.737 21.745 4.077 68.363 0.163
# JSON 1-query 20-q Fortunes Updates Plaintext Scores
38 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
17 915,202 376,813 30,659 350,490 17,051 6,824,917 5,943 2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17 1,011,928 370,424 30,674 357,605 13,994 6,958,656 5,871 2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11 1,039,306 362,739 29,363 354,564 15,748 6,959,479 5,964 2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
17 1,045,953 362,716 30,896 353,131 16,568 6,994,573 6,060 2023-04-13 - 224 threads (8*28 eft, ts), update using VALUES (),().., removed Connection: Keep-Alive resp header
13 1,109,267 363,671 31,652 352,706 16,897 6,956,038 6,156 2023-04-24 - 224 threads (-s 28 -t8 -p), each server (with all threads) are pinned to the different CPU
Thanks to the CPU ping, we are now #13 (above .NET). Today's round started without a merge.
We hope that our MR with *async* test suite and improved Int64 JSON serialization will be merged in the next round.
It is very likely that we will be in the top 10 (and #1 in cached queries) after that.
Last edited by mpv (2023-05-01 11:30:11)
Offline
Your MR has been merged.
We will see in the next round what's up with the initial async process.
Here is some new threading model for async process, with one async/pipelined connection per thread (in addition to the default non-pipelined connection per thread).
This thread and connection is only initialized if async methods are used - so there is no change for regular/non-pipelined connections.
Please try https://github.com/synopse/mORMot2/commit/44cc2507
and https://github.com/synopse/mORMot2/commit/6b4e1a98
From my tests, it gives better results with a single instance and no pinning nor affinity, and around 2-8 threads per cpu core.
It makes sense to me that several instances, and core pinning may not be mandatory for the best performance: we could let the Kernel to its scheduling job...
Offline
Tested new async implementation on 2X Xeon(R) Silver 4214R CPU @ 2.40GHz. Each component limited by taskset to use, first 16 CPU for app, second 16 CPU for db and third 16CPU for wrk - to emulate three TFB servers)
Result is better than initial async implementation (first table row). The best values is still for -s 32 -t 1 -p mode
See table - on google drive
Offline
Thanks a lot for the numbers!
Which I don't fully understand, to be honest: with -s 32 my guess would be that the new raw14 implementation would be slower than the previous raw12.
But anyway, it sounds like a good step up in respect to the non-async version of the code.
We could try a round on TFB HW with -s 32 -t 1 -p then at least two rounds with -s 1 -t 64, one with hsoThreadSmooting and another without.
(with 32 or 64 numbers changed to match the TFB core counts, of course)
Offline
I also do not fully understand the numbers, but we have what we have. TFB results for first async implementation should appears at 2023-05-11, after this I'll made a PR with new implementation and one more test case for async, se we will verify both `-s CPU*2 -t 1 -p` and `-s 1 -t CPU*4` cases
Offline
I submitted this issue https://github.com/TechEmpower/Framewor … ssues/8205 to TFB.
Some of the frameworks or benchmarks are clearly cheating and are no robust HTTP servers at all.
Therefore, I proposed to validate if the server is able to properly respond to HTTP/1.0 or Connection: Close input
My guess is that a few framework should be marked as "Stripped" and disappear from the benchmark ranking - until the most basic HTTP behavior is implemented.
Some of the buggy/unrealistic benchmark are part of the top #20 - e.g. several rust implementations or even some asp.net core.
If you find it meaningful, you can comment on the issue too.
Even propose a simple script to validate the fact (I am not fluent in bash/grep).
I guess we will appear soon in https://www.techempower.com/benchmarks/ … d7dc0a0f74
And I hope we will be higher with the asynchronous database requests.
Offline
我们拭目以待:)
Offline
Numbers for this round did appear.
Not as good as I hoped.
They are pretty weird, not consistent with what we expected by using taskset on our hardware.
The async version is not faster than the blocking version, apart for the updates.
This is exactly the contrary of what we observed with taskset: no benefit for updates, but faster db/queries/fortunes.
My best guess is that
1) this first async mode was using a single async connection per server, which is not scaling so well.
2) pinning was not beneficiary at all with the async way of execution - or we should also pin the async thread.
3) the raw config was not properly setup for async (to be verified in the logs when they will be available)
But thanks to the updates better numbers, we are now higher in the composite ranking. We should be in the top #10 now.
@mpv
Perhaps we could now make a MR with the current state of the framework and raw source, i.e. with the new async code.
With whatever config `-s CPU*2 -t 1 -p` or `-s 1 -t CPU*4` you want.
I would not be surprised that `-s 1 -t CPU*4` would be better than we measured before.
Offline
I understand why updates is better for async - this is because of less concurrency - in fact on my environment I also got ~23K fro updates..
I will made new PR today with latest sources and new async test-case with `-s 1 -t CPU*4`
And YES - we are in TOP10 now!!!! Congratulations!!!!
Offline
New MR 8207 based on latest sources and a new test case `-s 1 -t CPU*2 -nopin` is ready.
I use `unnest` pattern for /asyncUpdates (as in prev. MR), because your implementation fails on ?queries=501 test (too many parameters). In /rawupdaes we use if count>20 - use `unnest` else use `select from values`, but unnest works well, IMHO.
Offline
Nice!
It makes sense to to use "unnest" - especially because this is what previous async did - with pretty good numbers.
And the new test case would give use result within the next round, without the need to wait for another one.
Perhaps https://github.com/synopse/mORMot2/commit/37fe0d8d may help for updates too.
It would reduce the number of memory allocations during the array binding.
Worth trying on the next MR.
Hope they will merge the request before the next round.
Edit: perhaps https://github.com/synopse/mORMot2/commit/f58c289e would help a little better too.
Edit2: I have added binary array binding for 32-bit and 64-bit parameters.
https://github.com/synopse/mORMot2/commit/fa3cd430
But from my tests, it is not really faster - as you already stated.
Offline
Env. variable now correctly passed into app container see modified dockerfile
I suggest running it once without a binary array binding to have a basis for comparison..
Offline
Nice finding!
Sad that your MR was not take into account before this new round.
Hope they will include it in the next!
Do you know enough python code to write how to check that HTTP 1.0 is properly interpreted for my issue?
Offline
TFB state
Weights 1.000 1.737 21.745 4.077 68.363 0.163
# JSON 1-query 20-q Fortunes Updates Plaintext Scores
38 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
17 915,202 376,813 30,659 350,490 17,051 6,824,917 5,943 2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17 1,011,928 370,424 30,674 357,605 13,994 6,958,656 5,871 2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11 1,039,306 362,739 29,363 354,564 15,748 6,959,479 5,964 2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
17 1,045,953 362,716 30,896 353,131 16,568 6,994,573 6,060 2023-04-13 - 224 threads (8*28 eft, ts), update using VALUES (),().., removed Connection: Keep-Alive resp header
13 1,109,267 363,671 31,652 352,706 16,897 6,956,038 6,156 2023-04-24 - 224 threads (-s 28 -t8 -p), each server (with all threads) are pinned to the different CPU
7 1,109,693 381,633 32,725 353,182 23,022 6,975,086 6,634 2023-05-13 - 224 threads, added async test in -s 28 -t8 -p mode: db, queries & updates is for async, fortunes for direct
We are #7 even with non-optimal thread/server count for async tests. And #2 in cached-queries
Tomorrow new results are expected - async tests will be executed in `-s 56 -t 1 -p` and `-s 1 -t 56 --nopin`. I am waiting for the results with bated breath..
Last edited by mpv (2023-05-23 10:54:32)
Offline