High-Performance Frameworks

mpv · 2023-03-21 19:10:00

About "with more threads per core, e.g. 16 instead of 8." - we can. But I almost sure this not help, because currently our server uses 100% CPU on rawdb. Let's wait for next round and after try with more threads.

About LDAP - I discover different MS implementations with different Windows Server versions. Also Azure AD (ADFS) has its own nuances
When you get tired of fighting it - here is how I use libldap (mormot1 compatible). I need only ldapbind (use it to verify user password), but I sure it's work in many scenarios. Here is URL's example and some troubleshooting.

ab · 2023-03-21 20:11:49

Using LDAP for authentication as such is working, but unsecure.
As you wrote:

Security warning - password for LDAP authentication passed in plain text other the wire, so server should accept only HTTPS connection to be secure.

Even on the server side, the connection with the LDAP server is in plain text.
I would never advice using it on production.

But why needed to call libldap? A simple plain bind is very easy to code.
See also how we retrieve the LDAP addresses from the system information, and some DNS service discovery in https://github.com/synopse/mORMot2/blob … s.pas#L304

mpv · 2023-03-22 09:56:22

About libldap it is a long story - first we use Synapse, but there is an TLS problems there, when - libcurl, but it also have a known LDAP issues. So we switch to libldap. At last ldapsearch (what builds on top on libldap) utility is well documented, and our customer can use it to verify their problems. libldap works for us for a long time..

Our latest TFB changes in PR 8057 generated some discussion (as expected)

mpv · 2023-03-23 12:54:39

Current round ends

Weights	1.000	1.737	21.745	4.077	68.363	0.163
#	JSON	1-query	20-q   Fortunes Updates Plaintext  Scores
38 	731,119	308,233	19,074	288,432	3,431	2,423,283  3,486  2022-10-26 - 64 thread limitation
43 	320,078	354,421	19,460	322,786	2,757	2,333,124  3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	317,009	359,874	19,303	324,360	1,443	2,180,582  3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	563,506	235,378	19,145	246,719	1,440	2,219,248  2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	394,333	285,352	18,688	205,305	1,345	2,216,469  2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	859,539	376,786	18,542	349,999	1,434	2,611,307  3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	948,354	373,531	18,496	366,488	11,256	2,759,065  4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 	957,252	392,683	49,339	393,643	22,446	2,709,301  6,293  2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 	963,953	394,036	33,366	393,209	18,353	6,973,762  6,368  2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required,  -O4 optimization
17      915,202	376,813	30,659	350,490	17,051	6,824,917  5,943  2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17    1,011,928	370,424	30,674	357,605	13,994	6,958,656  5,871  2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11    1,039,306	362,739	29,363	354,564	15,748	6,959,479  5,964  2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding

We are on #11, mostly because many top-rated frameworks fail in this round. Good news what we are VERY close to .net now.

It looks like the next round will be without our latest changes, which caused a lot of discussion.

I found how to improve db-related performance, but such changes requires rewriting a part of libpq on pascal: currently to get a result libpq call poll and then recv. Pull call can be avoided - it used only to implement a timeout. On Linux we can use SO_RCVTIMEO for this. Such changes should improve db round-trip by 10-30%
So my idea is to use libpq for connection establishing, and then operate directly with socket PQSocket. I will do it little by little...

Last edited by mpv (2023-03-23 12:55:05)

ab · 2023-03-23 20:46:58

It is weird that the DB readings are a little bit lower.

Of course, some pure mORMot client code could help... perhaps including our own socket polling...
I am also thinking about adding an event to send back the answer at HTTP server level. That is, an event method called when needed.
So we could have another thread pool just for the DB requests... or even merge the DB sockets with the main HTTP server threads and epoll...

Edit:
I looked at the just-js source code.
It is very expressive, and even the libs are very cleverly designed.
But is it usable in any realistic work? For instance, the pg.js driver seems to only handle text and integer values.
Anyway it gave me some clues that implementing a native PG client may not bee too difficult - at least the protocol is very well detailed https://www.postgresql.org/docs/current/protocol.html and there are several implementations around.

mpv · 2023-03-24 08:22:58

Yes, I also saw just-js code - it's good. But this is just a proof-of-concept, as author notes. Repository is not maintained for a long time. In last round just-js (as many others, who implement pg by hand) fails because TFB team change PG auth algo from MD5 to something other.
So my guess is to use libpq as much as possible, and implement only a subset of methods and only for raw* tests.

About having a separate pool of DB connections: IMHO this will complicate everything, but I do not sure this gives better results. .net, for example, have a separate DB thread pool, but their results is not better compared to our current implementation.

I almost sure what removing a unneeded `poll` call in libpq gives us very valuable boots.

P.S.
PG auth problem is describedhere - https://github.com/TechEmpower/Framewor … ssues/8061

Last edited by mpv (2023-03-24 08:25:44)

ab · 2023-03-25 08:18:47

If you implement the pipeline calls in pure pascal, as blocking code using the raw PG socket, then I could find the best way to include it with the async web server...

mpv · 2023-04-05 11:01:09

Current TFB results

Weights	1.000	1.737	21.745	4.077	68.363	0.163
#	JSON	1-query	20-q   Fortunes Updates Plaintext  Scores
38 	731,119	308,233	19,074	288,432	3,431	2,423,283  3,486  2022-10-26 - 64 thread limitation
43 	320,078	354,421	19,460	322,786	2,757	2,333,124  3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	317,009	359,874	19,303	324,360	1,443	2,180,582  3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	563,506	235,378	19,145	246,719	1,440	2,219,248  2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	394,333	285,352	18,688	205,305	1,345	2,216,469  2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	859,539	376,786	18,542	349,999	1,434	2,611,307  3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	948,354	373,531	18,496	366,488	11,256	2,759,065  4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 	957,252	392,683	49,339	393,643	22,446	2,709,301  6,293  2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 	963,953	394,036	33,366	393,209	18,353	6,973,762  6,368  2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required,  -O4 optimization
17      915,202	376,813	30,659	350,490	17,051	6,824,917  5,943  2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17    1,011,928	370,424	30,674	357,605	13,994	6,958,656  5,871  2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11    1,039,306	362,739	29,363	354,564	15,748	6,959,479  5,964  2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
16   1,046,044	360,576	30,919	352,592	16,509	6,982,578  6,048  2023-03-30 - 224 threads (8*28 eft, ts), modified libpq, header `Server: M`

- tiny (<1%) improved plaintext and json (shorten Server header value)
- +1.5K (~5%) improved rawqueries (and rawupdates as side effect), thank`s to modified libpq

I tries to use update table set .. from values (), () pattern for rawupdates in MR 8128. On my env it's works better than CASE and UNNEST patterns.
Also periodically made some tests by directly modifying libpq to improve db performance, for a while w/o success

Last edited by mpv (2023-04-06 16:16:18)

mpv · 2023-04-06 17:48:31

I modify my prev. post after round finished - we are #16 (all frameworks returns back to rating)

@ab - I found what here we add a `Connection: Keep-Alive` header for HTTP 1.1. This is not necessary - by default HTTP 1.1 is keep alive.
So, I propose to replace

result^.AppendShort('Connection: Keep-Alive'#13#10#13#10);

by

result^.AppendCRLF;

Or, if you preferring, add an option for this.

I checked - such replacement works correctly and improve plaintext performance (may be we even got beautiful 7M req/sec on TFB hardware)

Last edited by mpv (2023-04-06 17:49:26)

ab · 2023-04-07 12:35:12

Should be set with https://github.com/synopse/mORMot2/commit/b99133a466168

I also assume it won't break anything.

mpv · 2023-04-07 15:46:57

Thanks! I update TFB MR. Will sync all changes back in mormot after new update algo will be verified.

Starting from commit [2ae346fe11b91fbe6fa1945cf535abed3de99d37] (Mar 14, 2023) I observe memory problems.

Occurs randomly (sometimes after /db, sometimes after /json), but always after wrk session is finished (on sockets closing?)

I can't reproduce it in normal execution, only during tfb --bencmark.
Also reproduced once 2023-03-30 on TFB environment - this is why this round not contains cached-queries results

glibc MM messages are:
- corrupted size vs. prev_size while consolidating
- double free or corruption (!prev)

ab · 2023-04-07 16:07:25

I don't see why https://github.com/synopse/mORMot2/comm … ed3de99d37 would generate memory problems.
Around this date, I don't see many potential memory issue - perhaps https://github.com/synopse/mORMot2/comm … 9680d09016 was faulty but it has been fixed afterwards.

Could you try to come closer to a faulty commit?

ab · 2023-04-08 09:36:37

I am not able to reproduce the issue with the latest version of mORMot 2, even with a new 20 cores CPU I now have access to.

The only potential memory problem I could see in the /json context, with high multi-threading, is in TRawByteStringStream.GetAsText, but it seems fine with the current pattern:

...
  begin
    Text := fDataString;
    fDataString := ''; // release it ASAP to avoid multi-threading reuse bug
    EnsureRawUtf8(Text);
  end

It is the version from https://github.com/synopse/mORMot2/commit/6dbc8b811

mpv · 2023-04-10 15:48:57

Memory problem is reproduced when I run `./tfb --test mormot --query-levels 20 -m benchmark` - not for every run, randomly. And in random places..

I hadn't seen it before 2023-03-08 [46f5360a66], first time It appears when I checkout to commit [2ae346fe11b] (2023-03-14), so it introduced somewhere in-between 08-14 March

I'll try to come closer to a faulty commit using bisect technique, but this is long process.....

mpv · 2023-04-10 18:05:30

I found we do not initialize global flags variable in raw.pas - may be unexpected flags is added and this is a reason of our memory problems.... Will fix it in next MR (to both TFB and mORMot).

Also I verify my new idea - we create 28 servers with 8 threads each, I binds all threads for each server to the same CPU and on my hardware it gives 1 002K -> 1200k boots for /json. Please, give me access to TAsyncConnections.fThreads - MR #171 - it's allow me to set affinity mask from TFB test program

Last edited by mpv (2023-04-10 18:06:43)

ab · 2023-04-10 19:44:39

I have merged your MR in mORMot 2.
Setting CPU affinity does perfectly make sense in our context.

About flags, any global variable are always initialized with zeros at startup - this is one requirement on all systems.
So "flags"' is always [].

Edit:
If you want, you can give me more information about where the memory errors occur.
It could help reduce the scope of the investigation.

mpv · 2023-04-11 16:38:51

Added CPU pinning feature to the TFB example - see mORMot2 PR #172. I will do the same PR for TFB after got results for next (should starts on 2023-04-13) run (with new update algo and removed keep-alive header)
In this PR I add accessible CPU analyzing - for testing purpose, when we limit CPUs using `taskset`..

About memory error - unfortunately this is all I currently have. If I enable logging it not reproduced, currently reproduced ONLY during `./tfb --test mormot --query-levels 20 -m benchmark`. but not for every run

Last edited by mpv (2023-04-11 21:48:50)

ab · 2023-04-13 16:35:44

Pinning a core to a given server instance is a bit weird to me.
I am afraid it would not scale as expected, because we will loose most of the concurrent work of the server process.
It may help with /json on your HW but does it too with other endpoints?

I have indeed worse numbers with pinning: (run on a 20 cores machine with a single CPU)

              /json          /plaintext      /cached-queries?count=20
pinning       975549         1022849         779124
default       1234377        1306604         983743

IMHO we should rather pin on HW cores, not on SW/logical core.
That is, calling SetThreadSocketAffinity() over each HW CPU sockets.

mpv · 2023-04-13 19:13:37

TFB hardware is 1 socket CPU....
I run tests on 48 cores server (2 sockets * 24 cores each) using

taskset -c 0-15 ./raw
 num thread=8, total CPU=48, accessible CPU=16, num servers=16, pinned=TRUE, total workers=128, db=PostgreSQL

Postgres is limited to cores 15-31 by adding systemd-dropin /etc/systemd/system.control/postgresql.service.d/50-AllowedCPUs.conf with content

[Service]
AllowedCPUs=15-31

and wrk limited to last 16 cores

taskset -c 31-47 ./wrk

In this case results are

json                    1,207,744
rawdb                     412,057
rawfortunes               352,382
rawqueries?queries=20      48,465
cached-queries?count=100     483,290
db                        376,684
queries?queries=20         32,878
updates?queries=20         22,016
fortunes                  300,411
plaintext               3,847,097

while the same without pinning are

json                    1,076,755
rawdb                     409,145
rawfortunes               359,764
rawqueries?queries=20      47,887
cached-queries?count=100     456,215
db                        395,335
queries?queries=20         33,542
updates?queries=20         22,148
fortunes                  306,237
plaintext               3,838,749

There is a small degradation in db-related tests, but composit scores is better. I plane to check pinning on TFB hardware and decide what to do - depending on results. We can, for example, create separate docker file with pinning for non-db endpoints and w/o pinning for db related (as @ttomas propose)

Last edited by mpv (2023-04-13 19:13:53)

ab · 2023-04-13 19:30:49

I think we could enhance the /json performance without changing the thread affinity.

There is no reason /json is 4 times slower than /plaintext, because it is some pure code with no syscall - just a few memory allocations with minimal JSON process.
IIRC there are only two or three memory allocations during the process (one for the TJsonWriter, one for TRawByteStringStream when heavily threaded, one for the result RawUtf8), then O(1) linear JSON serialization work.
Perhaps valgrid could help find the bottlenecks.

mpv · 2023-04-13 20:51:39

Actually json is not x4 slower, because plaintext is pipelining with 16 HTTP requests in one package, so there is 7000000/16 packages, and performance is limited by 10G network.
I analyse json valgrind many times and currently do not see any possible improvements, except minimizing a cpu-migrations and conttext-switch'es using CPU pinning.
Your results is strange for me.. Did you try to use first 10 CPU for app and second 10 for wrk ? And please, check what you use a cmem.

mpv · 2023-04-14 09:42:19

W/O pipelining (with cmem) results are (node 100 for cached queries - as in TFB test):

              /json          /plaintext      /cached-queries?count=100
pinning      1,281,204          1,301,311         493,913
default      1,088,939          1,168,009         471,235

I put program, I use to create load for smoke tests in this gist. CORES2USE and CORES2USE_COUNT should be edited to match CPUs used by wrk

ab · 2023-04-14 10:03:34

My guess is that there is some potential improvement in the THttpAsyncServer thead usage.

ab · 2023-04-14 12:11:10

The .sh fails as

./tfb-smoke.sh: line 20: unexpected EOF while looking for matching ``'

I am no bash expert so I can't understand what is wrong here...

Edit:
On my 20 cores CPU:

taskset -c 10-19 wrk -d 5 -c 128 -t 10 http://localhost:8080/json

./raw -s 10 -t 8
Requests/sec: 1269521.04

./raw -s 10 -t 8 -p
Requests/sec: 1537866.42

So with proper taskset I got better results with pinning on my HW too.

But weirdly enough, /plaintext numbers are lower than /json when pinning is used.
So we have still room for improvement in the HTTP server.

ttomas · 2023-04-14 12:19:10

@mvp, nice gist. Just some comment about concurrency, maybe add as param to perform, for all db test to use 512 (will have impact on active connections to postgers), for plaintext to use 1k or 4k, the best concurrency for each test.

Last edited by ttomas (2023-04-14 12:19:33)

mpv · 2023-04-14 19:06:39

@ttomas - thanks for idea - added CONN param into gist - a connection count for wrk, for plaintext 1024 is used (all fw shows best results for 1024)

@ab - I add shebang to gist (first line) - may be your default shell is not bash.. Also ensure you have `bc` utility (apt install bc)
Nice to head what our measurement with pinning match now... I do not understand why in your case json is better than plaintext - in my case plaintext is always better.

I will made PR to TFB on Sunday (when current run result for mormot appears) - we can see what pinning give us on real hardware.. BTW pinning is a common practice for acync servers - even nginx have worker affinity option in config. In TFB tests pinning is used at least by libreactor and H2O

Last edited by mpv (2023-04-14 19:07:16)

mpv · 2023-04-16 18:10:06

@ab - HTTP pipelining is currently broken. Introduced by feature "added Basic and Digest auth".

Last good commit is [1434d3e1] prepare HTTP server authentications - 2023-04-13 1:48. After that series of commits what not compiles die to new param aAuthorize for THttpServerRequestAbstract.Prepare, and first commit what compiles responds only for first pipeline request.
Can be verified using console commad below - should return 2 Hello, World!

(echo -en "GET /plaintext HTTP/1.1\nHost: foo.com\nConnection: keep-alive\n\nGET /plaintext HTTP/1.1\nHost: foo.com\n\n"; sleep 10) | telnet localhost 8080

Last edited by mpv (2023-04-16 18:11:01)

ab · 2023-04-17 08:27:32

You are right.

Should be fixed by https://github.com/synopse/mORMot2/commit/1eb4ac4e

State machines are great, but it is sometimes difficult to track their logic.

mpv · 2023-04-17 15:42:21

HTTP pipelining fixed - thanks! I made a TFB PR 8153 with CPU pinning - let's wait for results.

Memory problems still exists. Today I catch it twice (from 5-6 runs) - once after /db and once - after /rawqueries while running

./tfb --test mormot mormot-postgres-raw --query-levels 20 -m benchmark

Still can't reproduce in more "debuggable' way

Also synced my latest changes to TFB with ex/techempower-bench/raw.pas - see PR 175 for mORMot2

ab · 2023-04-18 10:02:45

About the memory problems.
Perhaps it may be due to the fact that we run several THttpAsyncServer instances.
Do you confirm it occurs at server shutdown?
And that it occurred also after a /json set of calls?

Anyway, I tried to rewrite some memory allocation code used during /json
https://github.com/synopse/mORMot2/commit/412c9deb

Edit:
Look at https://github.com/synopse/mORMot2/blob … w.pas#L665
using the new TExecuteCommandLine parser - I tried to use the best ideas from https://pkg.go.dev/flag
Its usage seems easier (and more powerful) than FPC TCustomApplication command line parsing.

mpv · 2023-04-18 19:19:04

Today I run TFB tests 5 times (each run takes ~30 minutes) and memory error not occurs (with old sources, without GetAsText), so it's really a hisenbug.
It occurs NOT on server shutdown, but just after wrk command ends, I think - when sockets are closing... Will continue to investigate...

About command line parameters - nice code. Please - look at PR 176 - I made a more Unix-way formatting of help message

ab · 2023-04-18 20:03:36

Merged - and fixed.

Thanks for the feedback.

mpv · 2023-04-20 08:11:49

Current TFB status

Weights	1.000	1.737	21.745	4.077	68.363	0.163
#	JSON	1-query	20-q   Fortunes Updates Plaintext  Scores
38 	731,119	308,233	19,074	288,432	3,431	2,423,283  3,486  2022-10-26 - 64 thread limitation
43 	320,078	354,421	19,460	322,786	2,757	2,333,124  3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	317,009	359,874	19,303	324,360	1,443	2,180,582  3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	563,506	235,378	19,145	246,719	1,440	2,219,248  2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	394,333	285,352	18,688	205,305	1,345	2,216,469  2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	859,539	376,786	18,542	349,999	1,434	2,611,307  3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	948,354	373,531	18,496	366,488	11,256	2,759,065  4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 	957,252	392,683	49,339	393,643	22,446	2,709,301  6,293  2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 	963,953	394,036	33,366	393,209	18,353	6,973,762  6,368  2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required,  -O4 optimization
17      915,202	376,813	30,659	350,490	17,051	6,824,917  5,943  2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17    1,011,928	370,424	30,674	357,605	13,994	6,958,656  5,871  2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then
11    1,039,306	362,739	29,363	354,564	15,748	6,959,479  5,964  2023-03-16 - 224 threads (8*28 eft, ts), update with unnest, binary binding
17    1,045,953	362,716	30,896	353,131	16,568	6,994,573  6,060  2023-04-13 - 224 threads (8*28 eft, ts), update using VALUES (),().., removed Connection: Keep-Alive resp header

We still #17, but Composite scores improves for every new run. Also we moved up form #7 to #3 in cached-queries test
Now we tries with CPU pinning - I expect good improvement in /json and /cached-queries....

mpv · 2023-04-23 18:52:56

In current round we moved above actix and .NETCore
The final results will be in 3 days, I expect we will be #15

@ab - is it correct to calc POrmCacheTable once in TRawAsyncServer constructor (instead of calc it every time here)? This should give us +few request we need to be #1 in cached queries...

Last edited by mpv (2023-04-23 18:53:49)

ab · 2023-04-24 07:30:42

Nice!

About POrmCacheTable of course we could put it as a field. But I doubt it would make any performance change: it is a O(1) lookup process.

Perhaps more performance could be achieved for the benchmark composite scores if we include the /rawcached endpoint too, in addition to the /cached_queries enpoint.
https://github.com/synopse/mORMot2/blob … w.pas#L495
It has no per-ID lookup, as POrmCacheTable does, so it is perfectly O(1) whereas POrmCacheTable.Get() uses binary search so a few iterations as O(log(n)).

Now the bottleneck seems to be the endpoints which makes a single DB request i.e. 1-query and Fortunes.
Perhaps we could feed an internal request queue to execute those requests in a pipelined DB request.
I would need to modify the HTTP server to be able to return its answer later, from a callback.

mpv · 2023-04-24 09:50:25

ab wrote:

About POrmCacheTable of course we could put it as a field. But I doubt it would make any performance change: it is a O(1) lookup process.

It called 400k times per second. Caching can give us +0.1% performance boots we need to be #1... At least in my environment, this is happening.

Unfortunately, rawcached is brake a rules. There is already discussions in TFB issues what such implementations should be banned - I don't want to take risks.

About pipelining DB requests for /db and /fortunes - this is interesting idea. Actually top rated frameworks did such..
In this case we need
- callback on HTTP server level and
- callback on DB level
Each server can use single per-server DB connection and new method on DB layer stmt.ExecutePipelining(maxCnt, timeout, callback);
stmt.ExecutePipelining can do buffering up to maxCnt statements or until timeout, run they in single pipeline and notify callback for each caller.

And finally we got a callback hell (especially while handling exceptions) - I've seen this in old .NET and JavaScript before they implemented async/await at the runtime level.
But for benchmark purpose we can try

mpv · 2023-04-25 17:51:17

While looking on cached-queries performance I found *VERY* unexpected thing:

TTextWriter.Add(Value: PtrInt) uses fast lookup table for values < 999.
I decide to increase it to 9999 (TFB ID's are 0..10000) and..... performance has gotten worse
If I comment lookup code - performance increases.

For cached-queries?count=100:

 -        no lookup: 511k RPS 
 -  999 lookup size: 503k RPS
 - 9999 lookup size: 466k RPS

@ab - do you have any ideas why so? Relative numbers do not depends on CPU pinning, server count, thread count.....

Last edited by mpv (2023-04-25 17:51:53)

ab · 2023-04-25 19:16:48

I guess it is because of cache pollution.
The StrInt32 asm code uses a two digit lookup table of 200 bytes, and a multiplication per reciprocal which is a very fast.
Whereas a big table is likely to quickly fill and pollute the L1 cache, which is a bad thing for performance.
The biggest SmallUInt32Utf8[], the more L1 cache is filled, the slowest it becomes.

My guess is that if we have fast multiplication per reciprocal for the "div 100", then we could bypass the cache for values < 999.
Only for RawUtf8 generation it makes sense to have a pre-computed array of ref-counted RawUtf8 instances.

We may try https://github.com/synopse/mORMot2/commit/c4256371

mpv · 2023-04-25 21:06:18

We can elso avoid tmp and MoveFast, isnt't it? At least in writer.Add.

ab · 2023-04-25 21:48:01

No, because StrInt*() write the digits backwards in the temp buffer, so need a copy once the length is known.
Pre-computing the length is possible in constant time (with no branch) but it is actually slower than the current pattern with movefast.

About callbacks, I find this worth reading:
https://devblogs.microsoft.com/dotnet/h … lly-works/

mpv · 2023-04-26 09:59:59

I know this cool article about async in .NET. In fact, the same steps were taken in JS. In browser client for UnityBase I started with callbacks 13 years ago, then moved to iterator-based Promises poly-fill, then to Promises and finally - to async/await.
In Pascal we need at least iterator support on compiler level, without this the only option is callbacks, but this is hell.... Callback-based implementation example is h2o

I like our current implementation - at the app level, everything is quite simple. Complicating it to the level of manual implementation of asynchronization is likely to alienate potential users.
I'm still confident that we can find a way to improve the current implementation (and I'm working on it periodically) - we only need +200 composite points to get into the top 10 TFB...

TFB PR 8182 is ready - should improve /cached-queries and may be /queries also.

ab · 2023-04-27 16:20:13

I have written some new methods to raw.pas :
https://github.com/synopse/mORMot2/commit/510ef7b3

It should add new asynchronous pipelined-based /asyncdb and /asyncfortunes endpoints.
The code is not difficult to follow. It is not a true asynchronous system, just the basic callbacks we want for our purpose.
https://github.com/synopse/mORMot2/commit/4dea86cd

Note that there are some corresponding changes in THttpAsyncServer too.
https://github.com/synopse/mORMot2/commit/bd33c64e

But I did NOT test it at all yet.
I will try to do the tests and debugging tomorrow or this week-end.

What needs still to be done is to let run the process not always asynchronously, but within some conditions (e.g. the number of connected clients, or the last requests per second).
I am almost sure /asyncdb will run awfully from a few connections.

mpv · 2023-04-27 18:36:44

There is missed connect:

--- a/src/db/mormot.db.sql.postgres.pas
+++ b/src/db/mormot.db.sql.postgres.pas
@@ -1379,6 +1379,7 @@ begin
   fProperties := Owner;
   fStatements := TSynObjectListLightLocked.Create;
   fConnection := fProperties.NewConnection as TSqlDBPostgresConnection;
+  fConnection.Connect;
   fConnection.EnterPipelineMode;
 end;

And should be tested, because currently result is always `{"id":0,"randomNumber":0}` and only 19 RPS per server

Looking forward to it! We still have at least 7 days until the next merge request...

Last edited by mpv (2023-04-27 18:38:02)

ab · 2023-04-28 10:41:15

With some fixes to TSqlDBPostgresAsync
https://github.com/synopse/mORMot2/commit/c362604b

What I see:
- it seems more stable
- for a low number of connection, numbers are awful - but fine with 512 concurrent clients, which is the point of TFB
- now, on a local docker PostgreSQL instance, I reach the same performance level than /rawdb and /rawfortunes - only slightly slower
- I expect /asyncdb and /asyncfortunes to scale better on a remote PostgreSQL instance, with a slower network

The next step may be to use a dedicated thread, polling the PostgreSQL socket for reading, then executing the callbacks from this thread.
It could help scale better with low numbers of connection, and also simplify the whole process and - I hope - enhance the responsiveness even with 512 clients.

ab · 2023-04-28 14:13:04

I have implemented the dedicated thread for TSqlDBPostgresAsync, and got rid of the previous complex (and not efficient) scaling algorithm.

Now I have pretty good numbers:

b@dev-ab:~/Downloads$ wrk -c 512 -d 5 -t 8 http://localhost:8080/rawfortunes
Running 5s test @ http://localhost:8080/rawfortunes
  8 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.82ms    1.12ms  21.49ms   74.29%
    Req/Sec    27.91k     5.03k   47.72k    70.28%
  1104299 requests in 5.07s, 1.39GB read
Requests/sec: 218023.57
Transfer/sec:    280.90MB
ab@dev-ab:~/Downloads$ wrk -c 512 -d 5 -t 8 http://localhost:8080/asyncfortunes
Running 5s test @ http://localhost:8080/asyncfortunes
  8 threads and 512 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.64ms    1.07ms  25.53ms   91.41%
    Req/Sec    38.84k     2.43k   61.20k    94.78%
  1553960 requests in 5.06s, 1.96GB read
Requests/sec: 307239.69
Transfer/sec:    395.85MB

It means that I got 50% more requests!!!!!
giphy.gif?cid=ecf05e474dxzllwop4nd6uhkc3ej79k0zqerm08inlt8u45x&ep=v1_gifs_search&rid=giphy.gif&ct=g

Please try https://github.com/synopse/mORMot2/commit/b6dc0c81
There is still some issue with the statement initialization - perhaps some more % to come.

ab · 2023-04-28 14:30:19

Yes, the statement was not cached on server side...
https://github.com/synopse/mORMot2/commit/5e502ff4

(it was obvious using the strace tool: I could see the whole SQL "select" in the output frame)

Now /asyncfortunes is almost twice faster than /rawfortunes !
I am confident we are in top #10 now.
And the raw.pas code is still very readable, from my POV.

Next step is to try to apply asynchronous requests for two new /asyncqueries and /asyncupdates endpoints.
For this, we would need a small class implementing a state machine for the various SELECT (+ UPDATE) steps to follow. But it should be feasible.

@mpv
I also guess:
1) we may need to review the whole pinning / thread count algorithm.
There is no DB latency any more, so perhaps we may need another pass about finding the best parameters possible with the /async* endpoints - which should shine on TFB ranking.
2) perhaps the whole point of having a modified libpq is pointless with the new async thread.
3) we should be both very proud now.

mpv · 2023-04-28 14:49:42

I'll test it over the weekend or tonight - the server I'm testing on is busy (end of the month - people are creating reports, etc.)

mpv · 2023-04-28 15:15:27

BTW - nice pictures
About /asyncupdates - from my POV it is not correct to pipeline updates - in realistic /updates scenario we should do all select's together with update in one transaction (even if TFB do not require this). But in our "acync" model we can't (actually can, but from consistency POW it is not correct) do transactions at all - only atomic select operations.
This is why I consider to do /async* endpoints only for db queries and fortunes - I'll create a separate test-case in benchmark_config.json with "approach": "Stripped" for such endpoints.
I'm right, or I miss something?

Last edited by mpv (2023-04-28 15:19:24)

ab · 2023-04-28 17:32:11

I have finally implemented /asyncqueries and /asyncupdates too.
- so we have full ORM + RAW + ASYNC endpoints coverage for comparison
- it was a nice showcase of the new callback mechanism: we can in fact run UPDATE statements in pipelined mode
- I agree that the TFB scenario of SELECT + UPDATE without a transaction is not realistic
- see https://github.com/synopse/mORMot2/commit/252b256d

From my measurements, /asyncqueries is also faster than /rawqueries :
- it is twice faster for ?queries=1 - as expected
- it is only slightly faster for ?queries=20
- and it seems to be also much more reliable - the first run of wrk over /rawqueries always make a timeout due to the high number of connections, whereas /asyncqueries is much more stable due to its single connection

On the other hand, /asyncupdates is not faster than /rawupdates.
I suspect this is because
1) we start the UPDATE statement from a callback inside the async thread, so it is less scaling
2) we can only use the array binding + unnest SQL statement because the async statements are to be prepared ahead of time (they can not be prepared in the middle of some pipelined work)
Anyway, the updates test was not the test in which we had a lot of benefit to expect, in regard to the best frameworks in top #10.
I also tried a set of naive 'update World set randomNumber=? where id=?' but the unnest SQL is way faster.

My first remark is that we should better restart the /raw server between PLAIN+JSON / ORM / RAW / ASYNCH modes of queries.
The thread and DB connection usage is not the same at all.

BTW I was able to seldom reproduce some unexpected GPF during my tests, after a few dozen of runs.
They seem to appear within the THttpAsyncServer, in some obscure case. I hope I would be able to find something.

mpv · 2023-04-28 18:07:58

What cmd line parameters do you use to test (threads/servers/pinning)? On my server hw acync* results (with servers=CPUCount threads=8 pinning) is a little worse compared to raw*

num servers=16, threads per server=8, total threads=128, total CPU=48, accessible CPU=16, pinned=TRUE, db=PostgreSQL

taskset -c 31-47 ./wrk -H 'Host: 10.0.0.1' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 10 -c 512 --timeout 8 -t 16 "http://localhost:8080/asyncfortunes" 
Requests/sec: 353990.97

taskset -c 31-47 ./wrk -H 'Host: 10.0.0.1' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 10 -c 512 --timeout 8 -t 16 "http://localhost:8080/rawfortunes" 
Requests/sec: 393226.26

Last edited by mpv (2023-04-28 18:11:59)

mORMot Open Source

#151 2023-03-21 19:10:00

Re: High-Performance Frameworks

#152 2023-03-21 20:11:49

Re: High-Performance Frameworks

#153 2023-03-22 09:56:22

Re: High-Performance Frameworks

#154 2023-03-23 12:54:39

Re: High-Performance Frameworks

#155 2023-03-23 20:46:58

Re: High-Performance Frameworks

#156 2023-03-24 08:22:58

Re: High-Performance Frameworks

#157 2023-03-25 08:18:47

Re: High-Performance Frameworks

#158 2023-04-05 11:01:09

Re: High-Performance Frameworks

#159 2023-04-06 17:48:31

Re: High-Performance Frameworks

#160 2023-04-07 12:35:12

Re: High-Performance Frameworks

#161 2023-04-07 15:46:57

Re: High-Performance Frameworks

#162 2023-04-07 16:07:25

Re: High-Performance Frameworks

#163 2023-04-08 09:36:37

Re: High-Performance Frameworks

#164 2023-04-10 15:48:57

Re: High-Performance Frameworks

#165 2023-04-10 18:05:30

Re: High-Performance Frameworks

#166 2023-04-10 19:44:39

Re: High-Performance Frameworks

#167 2023-04-11 16:38:51

Re: High-Performance Frameworks

#168 2023-04-13 16:35:44

Re: High-Performance Frameworks

#169 2023-04-13 19:13:37

Re: High-Performance Frameworks

#170 2023-04-13 19:30:49

Re: High-Performance Frameworks

#171 2023-04-13 20:51:39

Re: High-Performance Frameworks

#172 2023-04-14 09:42:19

Re: High-Performance Frameworks

#173 2023-04-14 10:03:34

Re: High-Performance Frameworks

#174 2023-04-14 12:11:10

Re: High-Performance Frameworks

#175 2023-04-14 12:19:10

Re: High-Performance Frameworks

#176 2023-04-14 19:06:39

Re: High-Performance Frameworks

#177 2023-04-16 18:10:06

Re: High-Performance Frameworks

#178 2023-04-17 08:27:32

Re: High-Performance Frameworks

#179 2023-04-17 15:42:21

Re: High-Performance Frameworks

#180 2023-04-18 10:02:45

Re: High-Performance Frameworks

#181 2023-04-18 19:19:04

Re: High-Performance Frameworks

#182 2023-04-18 20:03:36

Re: High-Performance Frameworks

#183 2023-04-20 08:11:49

Re: High-Performance Frameworks

#184 2023-04-23 18:52:56

Re: High-Performance Frameworks

#185 2023-04-24 07:30:42

Re: High-Performance Frameworks

#186 2023-04-24 09:50:25

Re: High-Performance Frameworks

#187 2023-04-25 17:51:17

Re: High-Performance Frameworks

#188 2023-04-25 19:16:48

Re: High-Performance Frameworks

#189 2023-04-25 21:06:18

Re: High-Performance Frameworks

#190 2023-04-25 21:48:01