You are not logged in.
About "with more threads per core, e.g. 16 instead of 8." - we can. But I almost sure this not help, because currently our server uses 100% CPU on rawdb. Let's wait for next round and after try with more threads.
About LDAP - I discover different MS implementations with different Windows Server versions. Also Azure AD (ADFS) has its own nuances
When you get tired of fighting it - here is how I use libldap (mormot1 compatible). I need only ldapbind (use it to verify user password), but I sure it's work in many scenarios. Here is URL's example and some troubleshooting.
Offline
Using LDAP for authentication as such is working, but unsecure.
As you wrote:
Security warning - password for LDAP authentication passed in plain text other the wire, so server should accept only HTTPS connection to be secure.
Even on the server side, the connection with the LDAP server is in plain text.
I would never advice using it on production.
But why needed to call libldap? A simple plain bind is very easy to code.
See also how we retrieve the LDAP addresses from the system information, and some DNS service discovery in https://github.com/synopse/mORMot2/blob … s.pas#L304
Offline
About libldap it is a long story - first we use Synapse, but there is an TLS problems there, when - libcurl, but it also have a known LDAP issues. So we switch to libldap. At last ldapsearch (what builds on top on libldap) utility is well documented, and our customer can use it to verify their problems. libldap works for us for a long time..
Our latest TFB changes in PR 8057 generated some discussion (as expected)
Offline
Current round ends
Weights 1.000 1.737 21.745 4.077 68.363 0.163
# JSON 1-query 20-q Fortunes Updates Plaintext Scores
38 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
17 915,202 376,813 30,659 350,490 17,051 6,824,917 5,943 2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17 1,011,928 370,424 30,674 357,605 13,994 6,958,656 5,871 2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11 1,039,306 362,739 29,363 354,564 15,748 6,959,479 5,964 2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
We are on #11, mostly because many top-rated frameworks fail in this round. Good news what we are VERY close to .net now.
It looks like the next round will be without our latest changes, which caused a lot of discussion.
I found how to improve db-related performance, but such changes requires rewriting a part of libpq on pascal: currently to get a result libpq call poll and then recv. Pull call can be avoided - it used only to implement a timeout. On Linux we can use SO_RCVTIMEO for this. Such changes should improve db round-trip by 10-30%
So my idea is to use libpq for connection establishing, and then operate directly with socket PQSocket. I will do it little by little...
Last edited by mpv (2023-03-23 12:55:05)
Offline
It is weird that the DB readings are a little bit lower.
Of course, some pure mORMot client code could help... perhaps including our own socket polling...
I am also thinking about adding an event to send back the answer at HTTP server level. That is, an event method called when needed.
So we could have another thread pool just for the DB requests... or even merge the DB sockets with the main HTTP server threads and epoll...
Edit:
I looked at the just-js source code.
It is very expressive, and even the libs are very cleverly designed.
But is it usable in any realistic work? For instance, the pg.js driver seems to only handle text and integer values.
Anyway it gave me some clues that implementing a native PG client may not bee too difficult - at least the protocol is very well detailed https://www.postgresql.org/docs/current/protocol.html and there are several implementations around.
Offline
Yes, I also saw just-js code - it's good. But this is just a proof-of-concept, as author notes. Repository is not maintained for a long time. In last round just-js (as many others, who implement pg by hand) fails because TFB team change PG auth algo from MD5 to something other.
So my guess is to use libpq as much as possible, and implement only a subset of methods and only for raw* tests.
About having a separate pool of DB connections: IMHO this will complicate everything, but I do not sure this gives better results. .net, for example, have a separate DB thread pool, but their results is not better compared to our current implementation.
I almost sure what removing a unneeded `poll` call in libpq gives us very valuable boots.
P.S.
PG auth problem is describedhere - https://github.com/TechEmpower/Framewor … ssues/8061
Last edited by mpv (2023-03-24 08:25:44)
Offline
Current TFB results
Weights 1.000 1.737 21.745 4.077 68.363 0.163
# JSON 1-query 20-q Fortunes Updates Plaintext Scores
38 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
17 915,202 376,813 30,659 350,490 17,051 6,824,917 5,943 2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17 1,011,928 370,424 30,674 357,605 13,994 6,958,656 5,871 2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11 1,039,306 362,739 29,363 354,564 15,748 6,959,479 5,964 2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
16 1,046,044 360,576 30,919 352,592 16,509 6,982,578 6,048 2023-03-30 - 224 threads (8*28 eft, ts), modified libpq, header `Server: M`
- tiny (<1%) improved plaintext and json (shorten Server header value)
- +1.5K (~5%) improved rawqueries (and rawupdates as side effect), thank`s to modified libpq
I tries to use update table set .. from values (), () pattern for rawupdates in MR 8128. On my env it's works better than CASE and UNNEST patterns.
Also periodically made some tests by directly modifying libpq to improve db performance, for a while w/o success
Last edited by mpv (2023-04-06 16:16:18)
Offline
I modify my prev. post after round finished - we are #16 (all frameworks returns back to rating)
@ab - I found what here we add a `Connection: Keep-Alive` header for HTTP 1.1. This is not necessary - by default HTTP 1.1 is keep alive.
So, I propose to replace
result^.AppendShort('Connection: Keep-Alive'#13#10#13#10);
by
result^.AppendCRLF;
Or, if you preferring, add an option for this.
I checked - such replacement works correctly and improve plaintext performance (may be we even got beautiful 7M req/sec on TFB hardware)
Last edited by mpv (2023-04-06 17:49:26)
Offline
Should be set with https://github.com/synopse/mORMot2/commit/b99133a466168
I also assume it won't break anything.
Offline
Thanks! I update TFB MR. Will sync all changes back in mormot after new update algo will be verified.
Starting from commit [2ae346fe11b91fbe6fa1945cf535abed3de99d37] (Mar 14, 2023) I observe memory problems.
Occurs randomly (sometimes after /db, sometimes after /json), but always after wrk session is finished (on sockets closing?)
I can't reproduce it in normal execution, only during tfb --bencmark.
Also reproduced once 2023-03-30 on TFB environment - this is why this round not contains cached-queries results
glibc MM messages are:
- corrupted size vs. prev_size while consolidating
- double free or corruption (!prev)
Offline
I don't see why https://github.com/synopse/mORMot2/comm … ed3de99d37 would generate memory problems.
Around this date, I don't see many potential memory issue - perhaps https://github.com/synopse/mORMot2/comm … 9680d09016 was faulty but it has been fixed afterwards.
Could you try to come closer to a faulty commit?
Offline
I am not able to reproduce the issue with the latest version of mORMot 2, even with a new 20 cores CPU I now have access to.
The only potential memory problem I could see in the /json context, with high multi-threading, is in TRawByteStringStream.GetAsText, but it seems fine with the current pattern:
...
begin
Text := fDataString;
fDataString := ''; // release it ASAP to avoid multi-threading reuse bug
EnsureRawUtf8(Text);
end
It is the version from https://github.com/synopse/mORMot2/commit/6dbc8b811
Offline
Memory problem is reproduced when I run `./tfb --test mormot --query-levels 20 -m benchmark` - not for every run, randomly. And in random places..
I hadn't seen it before 2023-03-08 [46f5360a66], first time It appears when I checkout to commit [2ae346fe11b] (2023-03-14), so it introduced somewhere in-between 08-14 March
I'll try to come closer to a faulty commit using bisect technique, but this is long process.....
Offline
I found we do not initialize global flags variable in raw.pas - may be unexpected flags is added and this is a reason of our memory problems.... Will fix it in next MR (to both TFB and mORMot).
Also I verify my new idea - we create 28 servers with 8 threads each, I binds all threads for each server to the same CPU and on my hardware it gives 1 002K -> 1200k boots for /json. Please, give me access to TAsyncConnections.fThreads - MR #171 - it's allow me to set affinity mask from TFB test program
Last edited by mpv (2023-04-10 18:06:43)
Offline
I have merged your MR in mORMot 2.
Setting CPU affinity does perfectly make sense in our context.
About flags, any global variable are always initialized with zeros at startup - this is one requirement on all systems.
So "flags"' is always [].
Edit:
If you want, you can give me more information about where the memory errors occur.
It could help reduce the scope of the investigation.
Offline
Added CPU pinning feature to the TFB example - see mORMot2 PR #172. I will do the same PR for TFB after got results for next (should starts on 2023-04-13) run (with new update algo and removed keep-alive header)
In this PR I add accessible CPU analyzing - for testing purpose, when we limit CPUs using `taskset`..
About memory error - unfortunately this is all I currently have. If I enable logging it not reproduced, currently reproduced ONLY during `./tfb --test mormot --query-levels 20 -m benchmark`. but not for every run
Last edited by mpv (2023-04-11 21:48:50)
Offline
Pinning a core to a given server instance is a bit weird to me.
I am afraid it would not scale as expected, because we will loose most of the concurrent work of the server process.
It may help with /json on your HW but does it too with other endpoints?
I have indeed worse numbers with pinning: (run on a 20 cores machine with a single CPU)
/json /plaintext /cached-queries?count=20
pinning 975549 1022849 779124
default 1234377 1306604 983743
IMHO we should rather pin on HW cores, not on SW/logical core.
That is, calling SetThreadSocketAffinity() over each HW CPU sockets.
Offline
TFB hardware is 1 socket CPU....
I run tests on 48 cores server (2 sockets * 24 cores each) using
taskset -c 0-15 ./raw
num thread=8, total CPU=48, accessible CPU=16, num servers=16, pinned=TRUE, total workers=128, db=PostgreSQL
Postgres is limited to cores 15-31 by adding systemd-dropin /etc/systemd/system.control/postgresql.service.d/50-AllowedCPUs.conf with content
[Service]
AllowedCPUs=15-31
and wrk limited to last 16 cores
taskset -c 31-47 ./wrk
In this case results are
json 1,207,744
rawdb 412,057
rawfortunes 352,382
rawqueries?queries=20 48,465
cached-queries?count=100 483,290
db 376,684
queries?queries=20 32,878
updates?queries=20 22,016
fortunes 300,411
plaintext 3,847,097
while the same without pinning are
json 1,076,755
rawdb 409,145
rawfortunes 359,764
rawqueries?queries=20 47,887
cached-queries?count=100 456,215
db 395,335
queries?queries=20 33,542
updates?queries=20 22,148
fortunes 306,237
plaintext 3,838,749
There is a small degradation in db-related tests, but composit scores is better. I plane to check pinning on TFB hardware and decide what to do - depending on results. We can, for example, create separate docker file with pinning for non-db endpoints and w/o pinning for db related (as @ttomas propose)
Last edited by mpv (2023-04-13 19:13:53)
Offline
I think we could enhance the /json performance without changing the thread affinity.
There is no reason /json is 4 times slower than /plaintext, because it is some pure code with no syscall - just a few memory allocations with minimal JSON process.
IIRC there are only two or three memory allocations during the process (one for the TJsonWriter, one for TRawByteStringStream when heavily threaded, one for the result RawUtf8), then O(1) linear JSON serialization work.
Perhaps valgrid could help find the bottlenecks.
Offline
Actually json is not x4 slower, because plaintext is pipelining with 16 HTTP requests in one package, so there is 7000000/16 packages, and performance is limited by 10G network.
I analyse json valgrind many times and currently do not see any possible improvements, except minimizing a cpu-migrations and conttext-switch'es using CPU pinning.
Your results is strange for me.. Did you try to use first 10 CPU for app and second 10 for wrk ? And please, check what you use a cmem.
Offline
W/O pipelining (with cmem) results are (node 100 for cached queries - as in TFB test):
/json /plaintext /cached-queries?count=100
pinning 1,281,204 1,301,311 493,913
default 1,088,939 1,168,009 471,235
I put program, I use to create load for smoke tests in this gist. CORES2USE and CORES2USE_COUNT should be edited to match CPUs used by wrk
Offline
The .sh fails as
./tfb-smoke.sh: line 20: unexpected EOF while looking for matching ``'
I am no bash expert so I can't understand what is wrong here...
Edit:
On my 20 cores CPU:
taskset -c 10-19 wrk -d 5 -c 128 -t 10 http://localhost:8080/json
./raw -s 10 -t 8
Requests/sec: 1269521.04
./raw -s 10 -t 8 -p
Requests/sec: 1537866.42
So with proper taskset I got better results with pinning on my HW too.
But weirdly enough, /plaintext numbers are lower than /json when pinning is used.
So we have still room for improvement in the HTTP server.
Offline
@mvp, nice gist. Just some comment about concurrency, maybe add as param to perform, for all db test to use 512 (will have impact on active connections to postgers), for plaintext to use 1k or 4k, the best concurrency for each test.
Last edited by ttomas (2023-04-14 12:19:33)
Offline
@ttomas - thanks for idea - added CONN param into gist - a connection count for wrk, for plaintext 1024 is used (all fw shows best results for 1024)
@ab - I add shebang to gist (first line) - may be your default shell is not bash.. Also ensure you have `bc` utility (apt install bc)
Nice to head what our measurement with pinning match now... I do not understand why in your case json is better than plaintext - in my case plaintext is always better.
I will made PR to TFB on Sunday (when current run result for mormot appears) - we can see what pinning give us on real hardware.. BTW pinning is a common practice for acync servers - even nginx have worker affinity option in config. In TFB tests pinning is used at least by libreactor and H2O
Last edited by mpv (2023-04-14 19:07:16)
Offline
@ab - HTTP pipelining is currently broken. Introduced by feature "added Basic and Digest auth".
Last good commit is [1434d3e1] prepare HTTP server authentications - 2023-04-13 1:48. After that series of commits what not compiles die to new param aAuthorize for THttpServerRequestAbstract.Prepare, and first commit what compiles responds only for first pipeline request.
Can be verified using console commad below - should return 2 Hello, World!
(echo -en "GET /plaintext HTTP/1.1\nHost: foo.com\nConnection: keep-alive\n\nGET /plaintext HTTP/1.1\nHost: foo.com\n\n"; sleep 10) | telnet localhost 8080
Last edited by mpv (2023-04-16 18:11:01)
Offline
You are right.
Should be fixed by https://github.com/synopse/mORMot2/commit/1eb4ac4e
State machines are great, but it is sometimes difficult to track their logic.
Offline
HTTP pipelining fixed - thanks! I made a TFB PR 8153 with CPU pinning - let's wait for results.
Memory problems still exists. Today I catch it twice (from 5-6 runs) - once after /db and once - after /rawqueries while running
./tfb --test mormot mormot-postgres-raw --query-levels 20 -m benchmark
Still can't reproduce in more "debuggable' way
Also synced my latest changes to TFB with ex/techempower-bench/raw.pas - see PR 175 for mORMot2
Offline
About the memory problems.
Perhaps it may be due to the fact that we run several THttpAsyncServer instances.
Do you confirm it occurs at server shutdown?
And that it occurred also after a /json set of calls?
Anyway, I tried to rewrite some memory allocation code used during /json
https://github.com/synopse/mORMot2/commit/412c9deb
Edit:
Look at https://github.com/synopse/mORMot2/blob … w.pas#L665
using the new TExecuteCommandLine parser - I tried to use the best ideas from https://pkg.go.dev/flag
Its usage seems easier (and more powerful) than FPC TCustomApplication command line parsing.
Offline
Today I run TFB tests 5 times (each run takes ~30 minutes) and memory error not occurs (with old sources, without GetAsText), so it's really a hisenbug.
It occurs NOT on server shutdown, but just after wrk command ends, I think - when sockets are closing... Will continue to investigate...
About command line parameters - nice code. Please - look at PR 176 - I made a more Unix-way formatting of help message
Offline
Current TFB status
Weights 1.000 1.737 21.745 4.077 68.363 0.163
# JSON 1-query 20-q Fortunes Updates Plaintext Scores
38 731,119 308,233 19,074 288,432 3,431 2,423,283 3,486 2022-10-26 - 64 thread limitation
43 320,078 354,421 19,460 322,786 2,757 2,333,124 3,243 2022-11-13 - 112 thread (28CPU*4)
44 317,009 359,874 19,303 324,360 1,443 2,180,582 3,138 2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 563,506 235,378 19,145 246,719 1,440 2,219,248 2,854 2022-12-01 - 112 thread (28CPU*4) CPU affinity
51 394,333 285,352 18,688 205,305 1,345 2,216,469 2,586 2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 859,539 376,786 18,542 349,999 1,434 2,611,307 3,867 2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 948,354 373,531 18,496 366,488 11,256 2,759,065 4,712 2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 957,252 392,683 49,339 393,643 22,446 2,709,301 6,293 2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 963,953 394,036 33,366 393,209 18,353 6,973,762 6,368 2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required, -O4 optimization
17 915,202 376,813 30,659 350,490 17,051 6,824,917 5,943 2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17 1,011,928 370,424 30,674 357,605 13,994 6,958,656 5,871 2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11 1,039,306 362,739 29,363 354,564 15,748 6,959,479 5,964 2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
17 1,045,953 362,716 30,896 353,131 16,568 6,994,573 6,060 2023-04-13 - 224 threads (8*28 eft, ts), update using VALUES (),().., removed Connection: Keep-Alive resp header
We still #17, but Composite scores improves for every new run. Also we moved up form #7 to #3 in cached-queries test
Now we tries with CPU pinning - I expect good improvement in /json and /cached-queries....
Offline
In current round we moved above actix and .NETCore
The final results will be in 3 days, I expect we will be #15
@ab - is it correct to calc POrmCacheTable once in TRawAsyncServer constructor (instead of calc it every time here)? This should give us +few request we need to be #1 in cached queries...
Last edited by mpv (2023-04-23 18:53:49)
Offline
Nice!
About POrmCacheTable of course we could put it as a field. But I doubt it would make any performance change: it is a O(1) lookup process.
Perhaps more performance could be achieved for the benchmark composite scores if we include the /rawcached endpoint too, in addition to the /cached_queries enpoint.
https://github.com/synopse/mORMot2/blob … w.pas#L495
It has no per-ID lookup, as POrmCacheTable does, so it is perfectly O(1) whereas POrmCacheTable.Get() uses binary search so a few iterations as O(log(n)).
Now the bottleneck seems to be the endpoints which makes a single DB request i.e. 1-query and Fortunes.
Perhaps we could feed an internal request queue to execute those requests in a pipelined DB request.
I would need to modify the HTTP server to be able to return its answer later, from a callback.
Offline
About POrmCacheTable of course we could put it as a field. But I doubt it would make any performance change: it is a O(1) lookup process.
It called 400k times per second. Caching can give us +0.1% performance boots we need to be #1... At least in my environment, this is happening.
Unfortunately, rawcached is brake a rules. There is already discussions in TFB issues what such implementations should be banned - I don't want to take risks.
About pipelining DB requests for /db and /fortunes - this is interesting idea. Actually top rated frameworks did such..
In this case we need
- callback on HTTP server level and
- callback on DB level
Each server can use single per-server DB connection and new method on DB layer stmt.ExecutePipelining(maxCnt, timeout, callback);
stmt.ExecutePipelining can do buffering up to maxCnt statements or until timeout, run they in single pipeline and notify callback for each caller.
And finally we got a callback hell (especially while handling exceptions) - I've seen this in old .NET and JavaScript before they implemented async/await at the runtime level.
But for benchmark purpose we can try
Offline
While looking on cached-queries performance I found *VERY* unexpected thing:
TTextWriter.Add(Value: PtrInt) uses fast lookup table for values < 999.
I decide to increase it to 9999 (TFB ID's are 0..10000) and..... performance has gotten worse
If I comment lookup code - performance increases.
For cached-queries?count=100:
- no lookup: 511k RPS
- 999 lookup size: 503k RPS
- 9999 lookup size: 466k RPS
@ab - do you have any ideas why so? Relative numbers do not depends on CPU pinning, server count, thread count.....
Last edited by mpv (2023-04-25 17:51:53)
Offline
I guess it is because of cache pollution.
The StrInt32 asm code uses a two digit lookup table of 200 bytes, and a multiplication per reciprocal which is a very fast.
Whereas a big table is likely to quickly fill and pollute the L1 cache, which is a bad thing for performance.
The biggest SmallUInt32Utf8[], the more L1 cache is filled, the slowest it becomes.
My guess is that if we have fast multiplication per reciprocal for the "div 100", then we could bypass the cache for values < 999.
Only for RawUtf8 generation it makes sense to have a pre-computed array of ref-counted RawUtf8 instances.
We may try https://github.com/synopse/mORMot2/commit/c4256371
Offline
No, because StrInt*() write the digits backwards in the temp buffer, so need a copy once the length is known.
Pre-computing the length is possible in constant time (with no branch) but it is actually slower than the current pattern with movefast.
About callbacks, I find this worth reading:
https://devblogs.microsoft.com/dotnet/h … lly-works/
Offline
I know this cool article about async in .NET. In fact, the same steps were taken in JS. In browser client for UnityBase I started with callbacks 13 years ago, then moved to iterator-based Promises poly-fill, then to Promises and finally - to async/await.
In Pascal we need at least iterator support on compiler level, without this the only option is callbacks, but this is hell.... Callback-based implementation example is h2o
I like our current implementation - at the app level, everything is quite simple. Complicating it to the level of manual implementation of asynchronization is likely to alienate potential users.
I'm still confident that we can find a way to improve the current implementation (and I'm working on it periodically) - we only need +200 composite points to get into the top 10 TFB...
TFB PR 8182 is ready - should improve /cached-queries and may be /queries also.
Offline
I have written some new methods to raw.pas :
https://github.com/synopse/mORMot2/commit/510ef7b3
It should add new asynchronous pipelined-based /asyncdb and /asyncfortunes endpoints.
The code is not difficult to follow. It is not a true asynchronous system, just the basic callbacks we want for our purpose.
https://github.com/synopse/mORMot2/commit/4dea86cd
Note that there are some corresponding changes in THttpAsyncServer too.
https://github.com/synopse/mORMot2/commit/bd33c64e
But I did NOT test it at all yet.
I will try to do the tests and debugging tomorrow or this week-end.
What needs still to be done is to let run the process not always asynchronously, but within some conditions (e.g. the number of connected clients, or the last requests per second).
I am almost sure /asyncdb will run awfully from a few connections.
Offline
There is missed connect:
--- a/src/db/mormot.db.sql.postgres.pas
+++ b/src/db/mormot.db.sql.postgres.pas
@@ -1379,6 +1379,7 @@ begin
fProperties := Owner;
fStatements := TSynObjectListLightLocked.Create;
fConnection := fProperties.NewConnection as TSqlDBPostgresConnection;
+ fConnection.Connect;
fConnection.EnterPipelineMode;
end;
And should be tested, because currently result is always `{"id":0,"randomNumber":0}` and only 19 RPS per server
Looking forward to it! We still have at least 7 days until the next merge request...
Last edited by mpv (2023-04-27 18:38:02)
Offline
With some fixes to TSqlDBPostgresAsync
https://github.com/synopse/mORMot2/commit/c362604b
What I see:
- it seems more stable
- for a low number of connection, numbers are awful - but fine with 512 concurrent clients, which is the point of TFB
- now, on a local docker PostgreSQL instance, I reach the same performance level than /rawdb and /rawfortunes - only slightly slower
- I expect /asyncdb and /asyncfortunes to scale better on a remote PostgreSQL instance, with a slower network
The next step may be to use a dedicated thread, polling the PostgreSQL socket for reading, then executing the callbacks from this thread.
It could help scale better with low numbers of connection, and also simplify the whole process and - I hope - enhance the responsiveness even with 512 clients.
Offline
I have implemented the dedicated thread for TSqlDBPostgresAsync, and got rid of the previous complex (and not efficient) scaling algorithm.
Now I have pretty good numbers:
b@dev-ab:~/Downloads$ wrk -c 512 -d 5 -t 8 http://localhost:8080/rawfortunes
Running 5s test @ http://localhost:8080/rawfortunes
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.82ms 1.12ms 21.49ms 74.29%
Req/Sec 27.91k 5.03k 47.72k 70.28%
1104299 requests in 5.07s, 1.39GB read
Requests/sec: 218023.57
Transfer/sec: 280.90MB
ab@dev-ab:~/Downloads$ wrk -c 512 -d 5 -t 8 http://localhost:8080/asyncfortunes
Running 5s test @ http://localhost:8080/asyncfortunes
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.64ms 1.07ms 25.53ms 91.41%
Req/Sec 38.84k 2.43k 61.20k 94.78%
1553960 requests in 5.06s, 1.96GB read
Requests/sec: 307239.69
Transfer/sec: 395.85MB
It means that I got 50% more requests!!!!!
Please try https://github.com/synopse/mORMot2/commit/b6dc0c81
There is still some issue with the statement initialization - perhaps some more % to come.
Offline
Yes, the statement was not cached on server side...
https://github.com/synopse/mORMot2/commit/5e502ff4
(it was obvious using the strace tool: I could see the whole SQL "select" in the output frame)
Now /asyncfortunes is almost twice faster than /rawfortunes !
I am confident we are in top #10 now.
And the raw.pas code is still very readable, from my POV.
Next step is to try to apply asynchronous requests for two new /asyncqueries and /asyncupdates endpoints.
For this, we would need a small class implementing a state machine for the various SELECT (+ UPDATE) steps to follow. But it should be feasible.
@mpv
I also guess:
1) we may need to review the whole pinning / thread count algorithm.
There is no DB latency any more, so perhaps we may need another pass about finding the best parameters possible with the /async* endpoints - which should shine on TFB ranking.
2) perhaps the whole point of having a modified libpq is pointless with the new async thread.
3) we should be both very proud now.
Offline
BTW - nice pictures
About /asyncupdates - from my POV it is not correct to pipeline updates - in realistic /updates scenario we should do all select's together with update in one transaction (even if TFB do not require this). But in our "acync" model we can't (actually can, but from consistency POW it is not correct) do transactions at all - only atomic select operations.
This is why I consider to do /async* endpoints only for db queries and fortunes - I'll create a separate test-case in benchmark_config.json with "approach": "Stripped" for such endpoints.
I'm right, or I miss something?
Last edited by mpv (2023-04-28 15:19:24)
Offline
I have finally implemented /asyncqueries and /asyncupdates too.
- so we have full ORM + RAW + ASYNC endpoints coverage for comparison
- it was a nice showcase of the new callback mechanism: we can in fact run UPDATE statements in pipelined mode
- I agree that the TFB scenario of SELECT + UPDATE without a transaction is not realistic
- see https://github.com/synopse/mORMot2/commit/252b256d
From my measurements, /asyncqueries is also faster than /rawqueries :
- it is twice faster for ?queries=1 - as expected
- it is only slightly faster for ?queries=20
- and it seems to be also much more reliable - the first run of wrk over /rawqueries always make a timeout due to the high number of connections, whereas /asyncqueries is much more stable due to its single connection
On the other hand, /asyncupdates is not faster than /rawupdates.
I suspect this is because
1) we start the UPDATE statement from a callback inside the async thread, so it is less scaling
2) we can only use the array binding + unnest SQL statement because the async statements are to be prepared ahead of time (they can not be prepared in the middle of some pipelined work)
Anyway, the updates test was not the test in which we had a lot of benefit to expect, in regard to the best frameworks in top #10.
I also tried a set of naive 'update World set randomNumber=? where id=?' but the unnest SQL is way faster.
My first remark is that we should better restart the /raw server between PLAIN+JSON / ORM / RAW / ASYNCH modes of queries.
The thread and DB connection usage is not the same at all.
BTW I was able to seldom reproduce some unexpected GPF during my tests, after a few dozen of runs.
They seem to appear within the THttpAsyncServer, in some obscure case. I hope I would be able to find something.
Offline
What cmd line parameters do you use to test (threads/servers/pinning)? On my server hw acync* results (with servers=CPUCount threads=8 pinning) is a little worse compared to raw*
num servers=16, threads per server=8, total threads=128, total CPU=48, accessible CPU=16, pinned=TRUE, db=PostgreSQL
taskset -c 31-47 ./wrk -H 'Host: 10.0.0.1' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 10 -c 512 --timeout 8 -t 16 "http://localhost:8080/asyncfortunes"
Requests/sec: 353990.97
taskset -c 31-47 ./wrk -H 'Host: 10.0.0.1' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 10 -c 512 --timeout 8 -t 16 "http://localhost:8080/rawfortunes"
Requests/sec: 393226.26
Last edited by mpv (2023-04-28 18:11:59)
Offline