High-Performance Frameworks

ab · 2023-02-28 16:43:45

WaitFor() is not the bottleneck on high contention, because most of the time is spent within the "while GetNext and GetOnePending do ProcessRead" loop.
In fact, GetNext() will make a syscall read() to decrement the evenfd counter.

This GetNext() / read() is mandatory in EFD_SEMAPHORE mode.

Edit:
I managed to run the eventfd() in blocking mode, then having some clean syscalls: read(eventfd) + recvfrom(socket) + sendto(socket) in loop in the threads.
But the numbers were slower than my previous attempt with poll()... and I had to tweak the R0 thread to consume more CPU...
So eventfd + EFD_SEMAPHORE sounds not a good option. I don't see how we could use the non-semaphore mode anyway.

mpv · 2023-02-28 18:24:49

Can you, pls, commit blocking mode in separate brunch - I want to play with it - seams what it very depends n workers count

ab · 2023-02-28 18:57:58

I already crashed/deleted my modifications...

I tried to reduce the number of threads count, and it did not change much.
Always slower than the current pattern.

ab · 2023-03-01 10:48:24

But the correct solution may be that on Linux, TSynEvent will use eventfd() instead of PRTLEvent
- eventfd() in blocking and non-semaphore mode reduces the syscalls to simple read/write/poll
- seems to be stable in practice - feedback is welcome!
https://github.com/synopse/mORMot2/commit/c551e487

Now, in the syscall traces, we have no futex call any more (unless there is a real lock contention) during the sub-thread process.
Just some read() and write() for inter-thread communication.

My guess is that the whole previous acoEventFD code could be dropped.
It will always be slower than this solution.

mpv · 2023-03-01 13:31:45

Smoke test on server shows +4% (+50k) RPS for /json for new HASEVENTFD. I'll check in more detail later.

mpv · 2023-03-01 16:14:07

Digging deeper, I found what with new implementation we got better result with lower total threads. This is very good - I expect db will be faster. Moreover - 7th*14server is better when 14th*7server.
I will investigate more.
BTW - TFB guys plane to begin setup a NEW servers with Ubuntu 22.04 this week (see https://github.com/TechEmpower/Framewor … 449036999), so we have some time...

ab · 2023-03-01 19:06:03

Nice to read!

Did you try with hsoThreadSmooting now?
It could help focus the server on a few threads for /plaintext or /json, but spread the flow to all the threads in case of slowest DB access.

Perhaps the next step may be to implement DB pipelining for the ORM, maybe using a batch....
This is where the "raw" results are really behind the "ORM" results.

ttomas · 2023-03-01 23:54:49

Playing with mormot2 master from today. Best results on 24 vCpu are 2 servers with 12 threads and include flags hsoThreadSmooting and hsoReusePort. Some strange results when run docker -p 8080:8080 vs --net=host, compared with ntex. ntex run faster with -p, mormot with --net=host. Also result running raw native on host without docker. mormot is faster then ntex if run with --net!!!

                        T1       T2       T3       AVG
Native raw 12 24 2      196.817  196.177  196.558  196.517
docker run -p 8080:8080
raw 12 24 2             132.468  123.207  126.877  127.517
ntex                    144.824  148.111  146.022  146.319
docker run --net=host
raw 12 24 2             186.369  188.466  186.378  187.071
ntex                    137.510  136.848  136.738  137.032

ab · 2023-03-02 07:24:48

Nice numbers!

Is there a way to ensure TFB use --net=host for the mORMot image?

Final numbers of previous round are available.
mORMot is #16 as expected - no regression.
https://www.techempower.com/benchmarks/ … =composite

The previous PR from Pavel has been merged, so next round would include the code cleaned before our eventfd() modifications.
https://synopse.info/forum/viewtopic.ph … 154#p39154

Then the following round, maybe on new Hardware, and new Linux/Ubuntu version, could include the latest TSynEvent / eventfd() ideas.
Sounds we are in good tracks.
We need to validate if hsoThreadSmooting is really better on the new HW/SW stack too. I wonder how many cores the new HW will have.

Perhaps io_uring may be the next big step. We could statically link liburing-ffi.a wrapper from https://github.com/axboe/liburing to try it.
If the new Ubuntu kernel supports io_uring. Ubuntu 22.04.2 LTS seems to have a 5.19 HWE kernel... don't know if they would use this just released update.

Side note:
I looked at ntek source, and I found
https://github.com/TechEmpower/Framewor … /db.rs#L54
my guess is that pl += 1; is incorrect in this loop - it should be pl += 2;
So that ntek updates are clearly wrong for half of them.
The TFB validator should check that the /updates actually do what is expected.

ab · 2023-03-02 09:44:09

Update:

I propose to change the /rawupdates SQL query
https://github.com/synopse/mORMot2/commit/725087dc and
https://github.com/synopse/mORMot2/commit/0d3804cf
- ORM-like nested SELECT is replaced by " update table set randomNumber = CASE id when $1 then $2 when $3 then $4 ... when $9 then $10 else randomNumber end where id in ($1,$3,$5,$7,$9) "
- this weird syntax gives best number for TFB /rawupdates?queries=20 but is not good for smaller or higher count
- we won't include it in the ORM but only for our RAW results - as other frameworks (e.g. ntex) do

mpv · 2023-03-02 18:42:25

Configuring host for such heavy bench-marking as TFB is a separate issue. I recommend:
- kernel - turn off mitigations patches on kernel level (by adding mitigations=off into /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT)
- for docker- disable a userland proxy (by adding "userland-proxy": false into /etc/docker/deamon.json) @ttomas - try this instead of --net
- do not use any VM - on VM results are always unpredictable
- increase max open files limits

I tries with hsoThreadSmooting - for /json result is better - 1450K vs 1300K w/o, for /plaintext - 4220K vs 4200K w/o, for /rawdb and /rawfortunes - near the same (but not slower), queries and updates are a little <1K slower)

About updates SQL - your version not works for Postgres (syndb layer expect ? as a parameter placeholder). When I fix it to generate "when ? then ? ... where id in (?....)" and binds 60 parameters instead of 40 performance is not changed (+-100) compared to array-binded version. Sorting of words array dose not change anything

mpv · 2023-03-02 18:49:03

Fixed /rawupdates - https://github.com/synopse/mORMot2/pull/152, but, since performance is the same, I propose to revert back to sub-query as more realistic

mpv · 2023-03-02 18:57:41

About ntex - seams they generates update when $1 then $2 when $3 then $4 where id in ($5, $6) as in my version...

ab · 2023-03-02 20:34:46

Is /rawupdates really with no benefit of using UPDATE ... CASE WHEN ... ?
It sounds like the frameworks ranking better than mORMot uses this pattern.
And on my side, with SQLite3 it is even better for /rawupdates?queries=20 which is the one used for the TFB ranking IIRC.
We could at least try on the TFB hardware.

About ntex it was my bad - but the main issue remains: I don't think they validate that updates do actually take place.

Perhaps we could try enabling hsoThreadSmooting by default.

I made some small refactoring today:
does https://github.com/synopse/mORMot2/commit/6e242a7149 make any difference on server HW?
I had to switch the timer on for Windows, because the select() polling is not the same as epoll().
But on Linux, the syscall trace seems cleaner now. There is no poll() any more in the R0 thread, and this thread uses less CPU than before AFAICT.
I also fixed the TAsyncConnections.ThreadPollingWakeup() algorithm, so the R1..Rn threads should better be awaken.

mpv · 2023-03-03 10:45:08

For /rawupdates - where is really no benefit, at last on my server.
I think our problem is not in SQL clause, but in concurrency. Other better ranking frameworks are async and creates connection count = CPU count. We need at last connections=CPU*3. Hope with eventFD I decrease connections from current CPU*6 to CPU*3 without performance lost. And also apply new rawupdate - just to verify..

I will verify latest changes today evening (server I use for tests is someone's production and additional load exists during workday). But the code is much cleaner now, for sure.

Abut io_uring - let's wait while some framework implements it and see a numbers....

ab · 2023-03-03 12:52:49

The other frameworks use coroutines/futures to manage the slow DB process, whereas we use blocking access to the DB, so we need more threads than they do.

Good idea to hear about io_uring: in practice, the 7M RPS on /plaintext is a maximum about the current HW network IIRC.
It may be possible to scale better with /json but it is not certain, since benchmarks with io_uring vs epoll are not really consistent...
What is sure is that we spend a lot of time in the kernel during read/write operations, whereas io_uring scheme may help by reducing a lot the number of syscalls.

Ensure you include https://github.com/synopse/mORMot2/commit/20b5f3c1 in your tests.

mpv · 2023-03-05 18:59:32

I rethink a way how I run a test on server hardware (2socket * 12core*2thread = 48 logical CPU).

All my prev. tests are executed on the same server with:
- app. server limited to first 28 CPU (taskset -c 0-28 ./raw ...) to match a TFB hardware logical CPU count
- wrk limited to last 20 cores
- Postgres w\o CPU limits

This is totally wrong for several reasons:
- the good network is local, but we can't do anything with this. TFB uses 10G switch, so we can expect this is near the same as local network
- the badPostgres is not limited to CPUs - this is bad and we got unexpected results for db-related tests (out numbers not match TFB, for example I got better /rawqueries)
- the uglyapp server uses CPUs from different socket - this is ugly, for sure

So I decide to set CPU limits to 16 cores for each part - app 0-15 : Postgres 16-31 (systemctl set-property postgresql.service AllowedCPUs=15-31 + restart) and wrk 32-48 and repeat all tests. The numbers is lower, but proportionally should be more close to TFB HW
Will publish numbers soon..

Last edited by mpv (2023-03-05 19:15:40)

mpv · 2023-03-05 21:27:37

Results for 16-16-16 CPU is on google drive

The best mode is 8-16 mode - 8 thread server * 1 server per CPU;
Mode I select for TFB (16 4 in this case 28 6 for their HW) is one of the worst "(

threadSmoothong aviability and/or eventfd vs RLTEvent results are very close;
In all tests app server consumer ~98% CPUs

@ab - take a look into cached-queries x2 difference for many threads per server

Last edited by mpv (2023-03-05 21:28:31)

ab · 2023-03-05 21:52:09

Very good fully proven testing!
A lot of time spent, I suppose....

About the cached-queries x2 difference, this is weird.
There is a R/W lock, so multiple threads should be able to access the cache in parallel.
Perhaps the O(log(n)) algorithm of the ORM cache could be improved - it was the easiest and good enough in terms of memory. I could use a O(1) TDynArrayHasher instead of a O(log(n)) binary/sorted search. For 10,000 items, it makes a difference.

Here I tried to follow your findings:
https://github.com/synopse/mORMot2/commit/213542ca

mpv · 2023-03-06 19:51:49

Your changes is adopted for TFB PR, see small fixes in pull/153. Will made a pull request to FTB (based on mORMot2.0 stable ) after we got a results for current run (they can stop it for tuning server, hope our results will be ready before this happens). Many thanks for statics.tgz!

I fund a Thread-safe lock-free multiple producer multiple consumer queue implementation by @BeRo1985
@ab - have you seen it? May be it will be better for waking threads than eventfd/RTLEvent?

ab · 2023-03-06 19:59:53

I already looked at Bero code for the queue, but it sounded a bit too complex for the task.
There are alternate lock-free queues around, which just use indexes and not double-pointer slots - which are not native in the FPC RTL.

AFAIR the pending list is not a contention.
There is no switch to kernel futex during the process, and I looked at the logs timing which showed only a few microsecond run. The TOsLightLock protects a O(1) branchless algorithm which access data in L1 CPU cache.

ab · 2023-03-07 19:36:09

I have rewritten the TOrmCache process.
https://github.com/synopse/mORMot2/commit/abb5e068

Now we can directly retrieve a ready-to-be serialiazed TOrm instance from the cache.
Numbers should be better now for /cached-queries - even if they are not included in the main composite ranking.

Edit: ensure you got at least https://github.com/synopse/mORMot2/commit/46f5360a with the latest ORM cache optimizations.

mpv · 2023-03-08 08:27:12

TFB MR 7994 is ready - based on 46f5360a commit:
- thread pool auto-tuning: use 1 listening socket per logical CPU and 8 working thread per listener socket
- for /updates with count <=20 use 'case .. when .. then' pattern
- [mORMot] update to mORMot 2.0.stable
- [mORMot] improved cached queries performance

Our db layer do not support parameters count > 999, so I add fallback to UNNEST for /rawupdates (TFB tests uses count=500 for verification = 1500 parameters in query) - will do MR to mORMot ex/ later...

Cached queries performance increased 788k -> 923k for 8th*16servers mode

ab · 2023-03-08 17:10:04

Your TFB PR has been merged.
Let's see!

Currently we are #10 for the current round.
The numbers seem almost the same as with the old HW. I guess they did only upgrade the OS, not the HW?

mpv · 2023-03-08 21:35:18

Yes, they updates only OS (Ubuntu 18.04 -> Ubuntu 22.04). Numbers are lowed ~10-12% for all frameworks, I think they do not turn off new all mitigations patches..
In any case we decrease the gap with .NET.

Let's wait for next round with improved thread pool and new updates query. Hope my thread pool size investigations is correct....

mpv · 2023-03-09 16:29:21

BTW - for /cached-queries TFB uses count=100, my previous measurements were made with count=20. In future I also will measure /cached-queries with 100 values.

mpv · 2023-03-11 16:37:59

Current TFB status:
in round completed 2023-03-10 we move down in composite score #16->#17 (new rust(tokio) viz framework is added), but in endpoints tests we improve our results:
- /json #67 -> #65
- /db #31 -> #25
- /queries #30 -> #29
- /cached #46 -> #38 (we expect huge improvement in next results)
- /fortunes #19 -> #20 (unfortunently)
- /updates #29 -> #25 (some improvements expected in next round with updates w/o unnest)
- /plain #23 -> #21

Some of results improved because of our changes, some - because other frameworks are more affected by new kernel mitigations patches.

Next results expected on 2023-03-15

ab · 2023-03-13 12:57:53

I have stlightly updated TFB benchmark sample:
- introducing /rawcached entrypoint
- updated README.md content
See https://github.com/synopse/mORMot2/commit/8e9313c4

I have added /rawcached because some frameworks with no ORM do implement it, so it could be a good idea to include it in the tests.
It is implemented just by using an in-memory dynamic array of TOrmWorld instances - it seems to be allowed by the specifications, because we are allowed to assume that the IDs will be in the range 1..10000 so we could just use an array as caching mechanism. On my notebook, numbers are better than the ORM cache, because the O(log(n)) binary lookup clearly has a cost.

mpv · 2023-03-13 13:50:24

As noted in rule vii for cached requrements: Implementations should not create a plain key-value map of objects.
I saw what other frameworks did such, but IMHO this brake the rule, isnt it?

ab · 2023-03-13 15:06:59

For the /raw kind of endpoints, I don't see why it should be avoided. And we don't use a key-value map but an array.
But you are right, it seems to break the rule - even if a cached query without ORM does not make much sense.

For the ORM, we use the standard/canonical mORMot ORM cache: this is what matters IMHO.

mpv · 2023-03-14 06:53:28

Weights	1.000	1.737	21.745	4.077	68.363	0.163
#	JSON	1-query	20-q   Fortunes Updates Plaintext  Scores
38 	731,119	308,233	19,074	288,432	3,431	2,423,283  3,486  2022-10-26 - 64 thread limitation
43 	320,078	354,421	19,460	322,786	2,757	2,333,124  3,243  2022-11-13 - 112 thread (28CPU*4)	
44 	317,009	359,874	19,303	324,360	1,443	2,180,582  3,138  2022-11-25 - 140 thread (28CPU*5) SQL pipelining
51 	563,506	235,378	19,145	246,719	1,440	2,219,248  2,854  2022-12-01 - 112 thread (28CPU*4) CPU affinity	
51 	394,333	285,352	18,688	205,305	1,345	2,216,469  2,586  2022-12-22 - 112 threads CPU affinity + pthread_mutex
34 	859,539	376,786	18,542	349,999	1,434	2,611,307  3,867  2023-01-10 - 168 threads (28 thread * 6 instances) no affinity
28 	948,354	373,531	18,496	366,488	11,256	2,759,065  4,712  2023-01-27 - 168 threads (28 thread * 6 instances) no hsoThreadSmooting, improved ORM batch updates
16 	957,252	392,683	49,339	393,643	22,446	2,709,301  6,293  2023-02-14 - 168 threads, cmem, inproved PG pipelining
15 	963,953	394,036	33,366	393,209	18,353	6,973,762  6,368  2023-02-21 - 168 threads, improved HTTP pipelining, PG pipelining uses Sync() as required,  -O4 optimization
17      915,202	376,813	30,659	350,490	17,051	6,824,917  5,943  2023-03-03 - 168 threads, minor improvements, Ubuntu 22.02
17    1,011,928	370,424	30,674	357,605	13,994	6,958,656  5,871  2023-03-10 - 224 threads (8 thread * 28 instances) eventfd, ThreadSmooting, update use when..then

Conclusions:
- because of ThreadSmooting scores changed: json +96, db -6, fortunes +28. = +118

- rawupdate with new algo when..than: -4k RPS (-272 scores). But ORM updates improved +3k. The good news here is what I am 95% sure this is because we bind Int params as string. I will implement a binary binding today (retrieve textual results is OK - I verify several times)

- cached-queries improved 148K -> 349K. Mostly because of 8*28 threads + ready-to-be serialiazed TOrm instance

Last edited by mpv (2023-03-14 06:54:08)

mpv · 2023-03-14 12:47:43

I add binary parameter binding - please, see https://github.com/synopse/mORMot2/pull/159
This MR break parameters logging, since I rewrite a p^.VInt64 value to be in htonl byte order as required by PG binary protocol.

Can we extend a TSqlDBParam by adding VInt64BE? because comment below say don't to extend....

// - don't change this structure, since it will be serialized as binary
  // for TSqlDBProxyConnectionCommandExecute
  TSqlDBParam = packed record

This speed up a little all endpoints what retrieve one row (for both ORM and raw level). May be on TFB environment the increase will be more significant.

For /rawupdates on my server there is almost no difference between unnest / when..then / when..then+bin_bind (I always got +-25k). So I can`t explain why TFB /rawupdates is so pure.
Structure for bind array in binary format is undocumented, quite complex and uses x4 more traffic than string representation we currently use, so I do not see any reason to implement it for binding binary arrays into unnest (see array representation here https://stackoverflow.com/a/66499392)
My propose is to use when..than for count <=10 and fallback to unnest otherwise - this gives the best speed for any parameters count.

ttomas · 2023-03-14 14:16:18

Great results for /json /plaintext.
Looking again on read only connection table https://techcommunity.microsoft.com/t5/ … 06462#fn:1
for 96vCpu compare 100 and 800 connections.
I don't find similar table for write/update connections, but for sure impact will be much more worse for shared I/O (hdd/ssd).
Maybe you can create one more test (docker) with same code but reduced threads/connections (2*CPU) and in one round see differences.

Last edited by ttomas (2023-03-14 14:17:13)

ab · 2023-03-14 14:26:58

@mpv
I have merged the PR.
- IMHO the logs should be correct, as I wrote as comment in the PR: the log text is computed BEFORE BindParams within SqlDoBegin(sllSQL) method call.
- Makes sense for array binding to keep the text format.
- about <= 10 limit: why not, but we much be sure it is not worth it on TFB HW to use CASE WHEN THEN as other frameworks.

@ttomas
IIRC the docker images makes writing tests after reading tests, so it should not affect the performance.
Do you mean than 228 threads is slower than less threads?
Anyway 2*CPU is very likely to be slower for the DB access - but should not change much for plaintexT/json; anyway mpv did a lot of testing to achieve to this 8*CPU count, which makes sense. Our "thread smoothing" algorithm won't use the more threads unless really necessary, in all cases.

mpv · 2023-03-14 15:25:28

@ttomas - I really-really verify different threads count. We work with PG in blocking mode, so need at last connections = CPUx3 (yes, some of them will be idle periodically). Take a look into first 3 rows of results table 3 post above: for /db 3d row with 140 connections is better then 2d with 112 connections, and 2d better than 1t with 64.

@ab - logging problem is only for /rawqueries - see my comment on github

ab · 2023-03-14 15:31:14

@mpv
Please try https://github.com/synopse/mORMot2/commit/2ae346fe about TSqlDBPostgresStatement.GetPipelineResult logging.

mpv · 2023-03-14 15:35:44

About /rawupdates - on last round with when..than we got only 4k RPS (13K is an ORM /updates result). I don't have any ideas why on TFB hardware /updates results are so strange and do not match my tests, but in fact unnest is better for mORMot

mpv · 2023-03-14 15:46:21

ab wrote:

@mpv
Please try https://github.com/synopse/mORMot2/commit/2ae346fe about TSqlDBPostgresStatement.GetPipelineResult logging.

It's a little strange. Was this the intention to have q= on second result retrieving?

17:40:07.957 28 DB      mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Prepare t=2.98ms c=01 q=select id,randomNumber from World where id=?
...
17:40:07.958 28 SQL     mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=2.99ms c=01 q=select id,randomNumber from World where id=7991
17:40:07.958 28 SQL     mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=2.99ms c=01 q=select id,randomNumber from World where id=4057

17:40:07.958 28 Result  mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=3.15ms c=01 r=1 q=select id,randomNumber from World where id=4057
17:40:07.958 28 Result  mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=3.24ms c=01 r=1 q=
17:40:07.958 28 Result  mormot.db.sql.postgres.TSqlDBPostgresStatement(7fb9c80013c0) Execute t=3.25ms c=01 r=1 q=
...

ttomas · 2023-03-14 15:49:09

ab wrote:

@ab
Do you mean than 228 threads is slower than less threads?

Yes looking in the pgbench test table for 96Cpu server, best result when connections=cpu, for read only. We are testing postgres server and depend on I/O shared resource speed (hdd), the bigger bottleneck, too many threads, too many connections drop performance. Write/update connections have more stress to IO.
@mvp, different HW, hdd/ssd/raidX, maybe tfb HW. I know we miss async client. On slow hdd threads count is more important as in my old server.

ab · 2023-03-14 16:01:59

@mpv
With https://github.com/synopse/mORMot2/commit/24ef0cbc there is no q= any more.

@ttomas
Sadly, we don't have any async coding in FPC. Even with anonymous methods it won't change much...
Only possibility may be to use a state machine instead of blocking code for the DB execution in /raw endpoints... such a state machine (as we use in the async http/websockets sever) is even better than modern languages async code (less overhead).

mpv · 2023-03-14 16:02:07

Upss. @ab - with latest sources TFB tests fails (server crashed) with message "mormot: double free or corruption (!prev)"
Do your regression tests passed?

Updates - crashes on commit "fixed TSqlDBPostgresStatement.GetPipelineResult logging"

Last edited by mpv (2023-03-14 16:03:00)

ab · 2023-03-14 16:05:36

No problem with the regression tests.
I will investigate further.

Using mORMot 2.1.5100 x64MMs
    TSqlite3LibraryStatic 3.41.0 with internal MM
Generated with: Free Pascal 3.2.3 64 bit Linux compiler

Time elapsed for all tests: 2m11
Performed 2023-03-14 17:12:19 by abouchez on tisab

Total assertions failed for all test suits:  0 / 75,969,743
! All tests passed successfully.
Heap dump by heaptrc unit of /home/abouchez/dev/lib2/test/fpc/bin/x86_64-linux/mormot2tests
66769342 memory blocks allocated : 17206425493/17380405688
66769342 memory blocks freed     : 17206425493/17380405688
0 unfreed memory blocks : 0
True heap size : 0
True free heap : 0

Ensure you include my latest fix.

mpv · 2023-03-14 17:20:03

It's strange. Another 3 run on latest sources, and everything is OK. May it was be my PC problem...

I made a TFB PR 8031:
- for /updates with count <=15 using 'case .. when .. then' pattern, for count > 15 - 'unnest' pattern
- using binary parameter binding format for Int4/Int8 parameter types - should be faster than textual

mpv · 2023-03-14 21:51:12

Our test is used by TFB team to verify their CI, just because we builds quickly
See https://github.com/TechEmpower/Framewor … 1467058544

mpv · 2023-03-17 12:05:12

Good news - I found a way to improve PG pipelining performance (rawqueries, rawupdates)

libpq PQPipelineSync function flush socket for each call (do write syscall).

I trace justjs implementation and observe whey send all pipeline commands in one write syscall. After this I rebuild libpq with commented flush and all works correctly and performance increased +4k(10%) RPS for rawqueries and +1k for rawupdates. (I'm using local Postgres, it should increase more over the network). This should add ~150 composite scores

Now I either implement PQPipelineSync in Pascal (need access to internal libpq structures) or, if I can`t, add a modified pibpq into docker file

ab · 2023-03-18 08:41:13

Or even a pull request to the official libpq ?

I just add a thought: could it be possible to use a single connection among threads, but make it use pipelining instead of blocking commands when running Bind and ExecutePrepared?

mpv · 2023-03-21 16:28:55

TFB's requirement to call Sync() after each pipeline step is highly debatable. Moreover - this discussion was started by the .net team, and they may have their reasons for doing so (after all, MS sponsors citrine hardware, so they can do it).
My opinion is what we do not require sync() at all, the same opinion have postgres tests authors on their test_nosync.

So I decide not to do a PR to libpq but use a modified version (I place it on github). New TFB#8057 is ready. Based on latest mORMot sources.

About current state - mORMot results for round 2023-03-16 is ready, but seams all frameworks results are lower in this round, se we can`t ensure binary bindings helps. We move one place up in composite score.

mpv · 2023-03-21 16:51:53

About use one connection for several threads - I don't like this idea (even if it may improve performance), because it's not "realistic" in terms of transaction. In TFB bench we don't need transactions, but in real life - yes. Several thread may want to commit/rollback their own transactions (in parallel) and this is impossible in single connection.

In fact, I still don't understand why our /rawdb result is twice lower compared to top "async" frameworks. This is abnormally. DB server CPU load is ~70% for our test in this case, so bottleneck is on our side. But there? In libpq select call`s? Or because all our threads are busy? No answer yet, but solving this problem is for sure a way to top10. I sure we can solve it without going to "async"

mpv · 2023-03-21 17:14:07

And a small hack - set a server name to 'M' instead of default "mORMot (linux)". On 7 millions responses it's meter . Just saw that .net calls its server 'K' instead of 'Kestrel' for TFB tests

ab · 2023-03-21 18:36:49

Thanks for the feedback.

Nice numbers, perhaps a bit disappointing for the integer parameters binding.
We will see how your "hacked libpq" will make a difference - hacking the lib is a good solution.

We could try to make a round with more threads per core, e.g. 16 instead of 8.
Perhaps it makes a difference with their HW...

And on another subject, you were right about LDAP client: it is a nightmare to develop, I spent 2 days in WireShark to find out what Microsoft did NOT do for their AD in respect to the RFC.
They don't follow the RFC for the Kerberos handshake, then they make SASL encryption mandatory (whereas it should be customizable)... But at least it seems to work now from Linux and Windows clients. Up to the next bug/unexpected MS "feature".

mORMot Open Source

#101 2023-02-28 16:43:45

Re: High-Performance Frameworks

#102 2023-02-28 18:24:49

Re: High-Performance Frameworks

#103 2023-02-28 18:57:58

Re: High-Performance Frameworks

#104 2023-03-01 10:48:24

Re: High-Performance Frameworks

#105 2023-03-01 13:31:45

Re: High-Performance Frameworks

#106 2023-03-01 16:14:07

Re: High-Performance Frameworks

#107 2023-03-01 19:06:03

Re: High-Performance Frameworks

#108 2023-03-01 23:54:49

Re: High-Performance Frameworks

#109 2023-03-02 07:24:48

Re: High-Performance Frameworks

#110 2023-03-02 09:44:09

Re: High-Performance Frameworks

#111 2023-03-02 18:42:25

Re: High-Performance Frameworks

#112 2023-03-02 18:49:03

Re: High-Performance Frameworks

#113 2023-03-02 18:57:41

Re: High-Performance Frameworks

#114 2023-03-02 20:34:46

Re: High-Performance Frameworks

#115 2023-03-03 10:45:08

Re: High-Performance Frameworks

#116 2023-03-03 12:52:49

Re: High-Performance Frameworks

#117 2023-03-05 18:59:32

Re: High-Performance Frameworks

#118 2023-03-05 21:27:37

Re: High-Performance Frameworks

#119 2023-03-05 21:52:09

Re: High-Performance Frameworks

#120 2023-03-06 19:51:49

Re: High-Performance Frameworks

#121 2023-03-06 19:59:53

Re: High-Performance Frameworks

#122 2023-03-07 19:36:09

Re: High-Performance Frameworks

#123 2023-03-08 08:27:12

Re: High-Performance Frameworks

#124 2023-03-08 17:10:04

Re: High-Performance Frameworks

#125 2023-03-08 21:35:18

Re: High-Performance Frameworks

#126 2023-03-09 16:29:21

Re: High-Performance Frameworks

#127 2023-03-11 16:37:59

Re: High-Performance Frameworks

#128 2023-03-13 12:57:53

Re: High-Performance Frameworks

#129 2023-03-13 13:50:24

Re: High-Performance Frameworks

#130 2023-03-13 15:06:59

Re: High-Performance Frameworks

#131 2023-03-14 06:53:28

Re: High-Performance Frameworks

#132 2023-03-14 12:47:43

Re: High-Performance Frameworks

#133 2023-03-14 14:16:18

Re: High-Performance Frameworks

#134 2023-03-14 14:26:58

Re: High-Performance Frameworks

#135 2023-03-14 15:25:28

Re: High-Performance Frameworks

#136 2023-03-14 15:31:14

Re: High-Performance Frameworks

#137 2023-03-14 15:35:44

Re: High-Performance Frameworks

#138 2023-03-14 15:46:21

Re: High-Performance Frameworks

#139 2023-03-14 15:49:09

Re: High-Performance Frameworks

#140 2023-03-14 16:01:59