You are not logged in.
Good idea
For my part, I'll try to use `mimalloc` in the next run - the top rust frameworks `ntex` (made by MS, by the way), `may-minihttp`, `xitca-web` use mimalloc. On my computer, mimalloc works a little worse, but on FTB the situation may change
Offline
From https://tfb-status.techempower.com :
Update 1.10.2024: We have 3 new servers on the way (and will be sending back the current servers). Will have the upgraded specs posted shortly. It may take some time to set up the new environment. We appreciate your patience!
This is good news!
We will have some new HW to play with.
@pavel Hope you are not too bad. We think daily about you, your family and your country.
Offline
@pavel Hope you are not too bad. We think daily about you, your family and your country.
True, hope everything is as OK as it just can be...
Offline
It is not easy here, in Ukraine, but this winter is so far easier than the previous one - we were ready for the russian terror.
As for TFB, let's look at the new HW. It's a shame that our last commit didn't play on the old HW - we don't have a base for comparison. After one run on the new HW, I plan to switch our test to mormot@2.2 (and possibly change threads/servers). Life goes on..
Offline
After almost a year of discussions our idea for PostgreSQL pipe-lining mode improvement is merged into PostgreSQL upstream. After Postgres 17 release on September 2024 we will switch to it - for a while I do not want to increase a container build time by adding a libpq compilation from sources
Offline
There is update with new TFB environment https://github.com/TechEmpower/Framewor … 1973835104
56 logical CPU and 40Gb network.
Still waiting for test to be resumed....
Offline
Not my aunt Suzy's computer configuration...
Even for a corporation, it is a pretty huge and unusual setup, especially the network part.
Only the SSD is a weird chose: a SATA version for benchmarking database process? In 2004, why not, but in 2024? Really?
Offline
Have Mormot2 not taken the test recently?
Offline
The TFB benchmarks are not properly running.
After 'martini' (no joke), all tests were aborted by their script.
https://tfb-status.techempower.com/resu … 09bab0d538
They have configuration issues on their side.
But the people behind the TFB seem not very good at IT work and they have difficulties about their setup.
They received new HW, but seem still to fight against it up to now.
Offline
The TFB run on the new hardware just reached mORMot.
And we are not badly ranked:
https://www.techempower.com/benchmarks/ … =composite
25,000,000 pipelined requests per seconds for the /plaintext test, and 3,000,000 RPS for /json - nice!
Offline
Congratulations!
Offline
Currently, with 56 CPUs, mormot [async,nopin] (1 process, 112 threads, no pinning) is good for updates, but bad for other scenarios (compared to async with pinning and direct pinning).
Let's wait for the next launch with the latest PR (the current launch is based on the same sources as Round22).
The bad thing is that we don't have a test server that matches the TFB configuration now, so we can't test different thread/server/pin combinations to find the best one.
Offline
Hello,
I've been wondering how close mORMot itself is in the TFB compared to Dev version.
Is it synced all the time or feature by feature. Or is some functionality and changes developed over there and merged back etc...
-Tee-
Offline
It sounds like if we will be in the top 10 with this high-end hardware, with the same exact software (rank #6 eventually I guess).
Nice seeing that we went from #12 to #6 just by upgrading the hardware.
@TPrami We use the very same source code version than last year official round 22. It is an old version, and current trunk may be slightly faster.
What we can observe is that Java did not scale as well as Rust and Pascal on this new hardware: the best Java frameworks went behind mORMot.
I find it interesting. Especially for long-term running servers, I would not consider Java as a viable solution for leveraging resources (mainly memory). On high-end hardware, Java has structural bottlenecks which prevent it from scaling.
Just.js is more like an experiment, a very clever one, very fast for JavaScript, but not usable on production. We already saw that.
libh2o claims to be the fastest c library for HTTP networking, and mORMot numbers are very close to it (even better for pipelined /plaintext).
For its DB access, it is no framework, but raw access to the PQlib pipelined API, interleaved with the h2o socket layer. But a good reference code to read https://github.com/TechEmpower/Framewor … database.c perhaps if we want to reuse our async socket layer over the postgresql connection.
It shines on the /fortunes endpoint, but at the expense of very complex low-level code, with manual variables and sections names lookup. Imagine the work for a realistic template.
So, performance aside, the libh2o entry is not comparable to mORMot: it is also an experimental solution.
On the contrary, ntex / may-minihttp / xitca-web Rust frameworks are not so experimental.
They leverage the async nature of Rust (its .await syntax), and code is still readable - especially for Rust.
The framework behind most of those frameworks seems to be https://github.com/tokio-rs/tokio
It sounds like if their http server uses "io-uring" with a modern Linux kernel, but with no dramatic speed improvement for the /json endpoint. Perhaps we won't need to support io_uring in mORMot immediately, if it makes no difference on such high-end HW with our regular epoll-based server.
Offline
One more observation is that may-minihttp, ntex, and xitca-web use mimalloc. I also plan to try to use it in one of the next rounds. I've already tried mimalloc on my server - performance is the same as with libc alocator, but on the new TFB hardware it might change the numbers a bit.
P.S.
#6 is a good place IMHO
Last edited by mpv (2024-04-10 15:21:17)
Offline
Yes, #6 is already great!
Perhaps mimalloc could help a little.
I have just tried to optimized TFB /rawfortunes and /asyncfortunes
- it is where we are the most behind other frameworks (only 66% of the best);
- we won't allocate TFortune.Message string but use a PUtf8Char (avoid some locked refcnt process);
- reuse a per-thread Mustache rendering context (stored within the per-thread connection instance).
https://github.com/synopse/mORMot2/commit/1a71fbf0
Please try it on your side with PostgreSQL.
But I suspect it won't be enough.
The problem seems to be in the DB request itself.
We also are around 66% below the best frameworks with the single query /rawdb endpoint.
Perhaps we could improve both entries, if we can fix the bottleneck of a single SELECT.
It is weird that [mormot-async] is slower than [mormot-direct] and [mormot-orm] for these single SELECT requests.
We may be able to do better. My guess is that TSqlDBPostgresAsyncThread.Execute could be done in a better way.
Offline
Edit:
Maybe https://github.com/synopse/mORMot2/commit/c4c43f03 could help a little.
The TSqlDBPostgresAsyncThread.Execute method was perhaps calling sleep() too much.
I hope [mormot-async] could benefit from this.
Offline
Yes, disabling sleep should help for the asynchronous server, because according to dstat (now it's running) the processor is idle ~66% (only 40% is used). By the way, for a raw, idle is ~44%, which is also too high, for example, in h2o idle is ~20%.
In today's result based on mORMot2.2.stable, our score increased from 18,281 to 19,303, I hope we will be #5.
I'll test fresh sources on my computer (my test server is in a region with power outages and is unavailable) and do a MR (hope today)
Offline
@ab - it seems escaping is broken in latest Mustache implementation
Valid fortune resp should be (<script>)
<tr><td>11</td><td><script>alert("This should not be displayed in a browser alert box.");</script></td></tr>
but current implementation do not escape {{ }} template and result is
<tr><td>11</td><td><script>alert("This should not be displayed in a browser alert box.");</script></td></tr>
Last edited by mpv (2024-04-15 18:25:19)
Offline
Oups...
The logic was inverted in https://github.com/synopse/mORMot2/comm … 39b7438856
Please try with https://github.com/synopse/mORMot2/commit/2ada2b4b
BTW I am not sure than mimalloc could make a real difference with the latest version of /rawfortunes, because there is almost no memory allocation any more with my latest version of https://github.com/synopse/mORMot2/commit/1a71fbf01a770
Offline
Verified with 2.2.7351 - now escaping is OK. TFB MR #8883 is ready.
I don't bother with mimalloc either - we'll see how it goes with the current improvements.
Offline
The last MR gives better results.
Not changing the rank, but higher numbers, especially for the unique SELECT runs (we are now 70% of the best instead of 60%).
https://www.techempower.com/benchmarks/ … =composite
Perhaps we could try another MR including some last commits:
https://github.com/synopse/mORMot2/commit/72934e6609 (avoid memory alloc of the TTextWriter)
and
https://github.com/synopse/mORMot2/commit/15e284e9b05 (small optimization of raw.dpr)
and perhaps also
https://github.com/synopse/mORMot2/commit/d312c00d (faster random world id)
If memory is still a bottleneck, TTextWriter reuse could help a little more, for all entries but /fortunes.
And using our own TLecuyer instance avoid a somewhat slow access to a threadvar.
@mpv question #1
BTW, I have seen that some frameworks use a small parameter name for the queries, e.g. ?q= in https://github.com/TechEmpower/Framewor … ain.rs#L18
Perhaps we could also use this trick, if it is allowed, and always use a fixed search = 'Q=' parameter for GetQueriesParamValue(ctxt)?
function GetQueriesParamValue(ctxt: THttpServerRequest): cardinal; inline;
begin
if not ctxt.UrlParam('Q=', result) or
...
and change test URIs to use ?q= encoded parameter, instead of ?queries= and ?count=... with
"query_url": "/queries?q=",
"fortune_url": "/fortunes",
"update_url": "/updates?q=",
"plaintext_url": "/plaintext",
"cached_query_url": "/cached-queries?q=",
in the config file.
@mpv question #2
It is interesting also how connection pools are implemented in may-minihttp.
They allow a connection pool of 1000 instances, and affect one to each connection, using a modulo of the connection ID (sequence I guess).
It may be a better way that our per-thread connection pool, for single queries... and it may help not to fork the executable.
IIRC the max number of connections for DB benchmarking is up to 512 concurrent clients, so each HTTP client connection would have its own DB access. Only /plaintext scales up to 16384 clients, but with no DB involved.
Offline
I'll do MR with the latest changes..
As for `q=`, I tested this case a year ago and it gives nothing. I'm afraid that even if we do this, we will get a ton of criticism, just like with Server-Name. I'll make it a separate MR (after MR with a latest changes)
About `using a modulo of the connection ID` - but what if we have 1001 client - we can`t use one connection for both 1 and 10001 . As far as I remember I test an implementation with per-worker connection (call ThreadSafeConnection once and memorize it into worker context) and performance is near the same as with ThreadSafeConnection
BWT currently our raw server creates 448 DB connections (num servers=56, threads per server=8, total threads=448, total CPU=56, accessible CPU=56, pinned=TRUE, db=PostgreSQL), maybe I'll increase `threads per server' to 10 to get 560 connections, so each concurrent client will have its own - that might work (after MR with q=)
Last edited by mpv (2024-04-23 14:58:51)
Offline
@mpv
Makes sense!
So we will try with the latest changes in the next round, and I won't investigate any further about the connections pool. Perhaps threads per server to 10 may help, since all tests are up to 512 connections.
Note that the ./async endpoints create an additional connection per existing connection: so it would create 1024 connections in all - I don't know if it won't be too much.
Offline
56*(8-1)=392 db connections
Offline
Looking at Single query-Data table, async servers have worse result then ORM!
[async,nopin] mormot-postgres-async2 use 1 server, 56*2 threads (56*2-1=111 db connections)
all other tests use (56*(8-1)=392 db connections)
Not fair for [async,nopin]!
Edited:
Looking at Data updates-Data table, for multiple updates 10,15,20, [async,nopin] benefit from lower db connections
Last edited by ttomas (2024-04-24 12:11:09)
Offline
@ttomas - I'll increase threads 2 -> 4 for [async,nopin]
@ab - Please, fix raw.pas, because it do not compiles (TRawAsyncServer.ComputeRandomWorld is not accessible from inside TAsyncWorld.DoUpdates / Queries) - I'll follow your fix and made an MR
Offline
@mpv
Try with https://github.com/synopse/mORMot2/commit/7b76af41
Offline
Now it compiles, all test except update passed, but for /update (and /rawpdate) ftb validation are not passed.
FAIL for http://tfb-server:8080/updates?queries=501
Only 1 items were updated in the database out of roughly 500 expected.
See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
PASS for http://tfb-server:8080/updates?queries=20
Executed queries: 10752/10752
See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
PASS for http://tfb-server:8080/updates?queries=20
Rows read: 10635/10240
See https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Framework-Tests-Overview#specific-test-requirements
FAIL for http://tfb-server:8080/updates?queries=20
Only 506 rows updated in the database out of roughly 10240 expected
@ab - can you imagine what could have happened? The methods themselves have not changed... (Sorry, I'm not able to debug at the moment, only run tests in docker)
Offline
I guess this is because of collision when the TLecuyer random generator is shared between the threads.
Should be fine now with
https://github.com/synopse/mORMot2/commit/4f100299
Offline
Sorry for delay. Now crashes on TRawAsyncServer.rawqueries because uninitialized PLecuer is passed to GetRawRandomWorlds.
@ab - maybe we should go back to https://github.com/synopse/mORMot2/commit/d312c00d ? because these lecturers who are now everywhere have made the code unreadable (and not necessarily faster)
Offline
Offline
Previous run was finished.
We went higher, so we are #6 now, over redkale.
https://www.techempower.com/benchmarks/ … =composite
Sadly, the pending pull/modification requests have not been integrated to the new run.
https://github.com/TechEmpower/Framewor … arks/pulls
We will be able to see the stability of their HW installation.
Numbers should stay the same with no software update.
Offline
Unfortunately, with MR 8949, it seems that the server has crashed on /cached-queries (orm), so plaintext is not executed and we do not appear in the composite score this time. We will see what happens after round ends in text logs.
Increasing the number of threads for an asynchronous server cpuCount * 2 -> cpuCount * 4 does not increase the speed of async(nopin) (becomes slower)
But /cached_query implementation for raw server is now #1
Offline
I will look into /cached-queries for ORM.
It should not make any GPF at all when only reading values.
I am not sure https://github.com/synopse/mORMot2/commit/2b2957f0 would be enough. I don't understand why Value could be nil in any case during our run...
Since I did not change anything on this part of the ORM, I suspect it was a problem already existing, which we already faced randomly (AFAIR).
Offline
This time we crashes on plaintext (not on cached queryes) on 256 concurrecy. See https://tfb-status.techempower.com/unzi … xt/raw.txt
Let's wait for a next round to detect is this a heisenbug or not
Last edited by mpv (2024-05-13 20:26:54)
Offline
Sounds like there is a problem with /plaintext - we got 600K instead of 2M rows / seconds, certainly because the server crashed and was not able to finish all requests.
/json and others seems stable, so I guess it is fine to assume the problem is with pipelinging.
I was not able to reproduce it yet.
Tomorrow I will try again.
Offline
Yes, crash in plaintext for 56 pinned server 8 thread each, this is why orm cached_queries not executed at all.
Error is
malloc(): invalid next size (unsorted)
Last edited by mpv (2024-05-16 21:25:28)
Offline
From the error message, it sounds like if overwriting the end of the allocated memory by 1 byte, is enough to kill malloc().
It should be the case, because we always allocate 4 chars more than needed for strings in our framework...
I will check the code again, and try to reproduce it.
Edit: I tried for hours to reproduce the issue, and I was not able to let it break...
Perhaps we could at least update our source code to the today's version for TFB.
Edit 2: I got one error once when disconnecting clients:
$ ./raw --nopin -s 4
THttpAsyncServer running on localhost:8080
num servers=4, threads per server=8, total threads=32, total CPU=20, accessible CPU=20, pinned=0, db=SQLite
options=hsoReusePort
Press [Enter] or Ctrl+C or send SIGTERM to terminate
corrupted size vs. prev_size while consolidating
Aborted
Then I enabled the logs to have more context, then it... did not break...
Offline
Ahh.. years ago I also tries to catch this bug without success.. I even tries with valgrind in memcheck mode, but there is so many warnings (mostly in mormot crypto) what the root of evils hidden.
For a while I made a MR 9036 what replace glibc MM with mimalloc.
I suggest we start looking into the valgrind warnings (see https://valgrind.org/docs/manual/mc-manual.html). Maybe if we remove them one by one, we will find the problem. By the way, the heisenbug also exists in mORMot1 - very rarely, once a week with an average load of 1000 RPS, I get an AV in my mORMot1 application in production.
Offline
Is the LD_PRELOAD still trick working on Ubuntu 22.04?
IIRC it was disabled on recent systems because replacing the memory manager is a typical unsafe pattern.
Do you have any valgrind report already available - not in mormot2tests but in our raw TFB sample?
Could you send me one?
There is no crypto invoked in our process, so I doubt we need to look at all what is in mormot.core.crypto.
Valgrind was reporting some undereferenced variables
https://synopse.info/forum/viewtopic.ph … 932#p38932
but it sounds like a perfect valid assumption to have all threadvar and var global instances filled with zeroes. This is even what FPC expects.
Offline
I expect that LD_PRELOAD works, because in case of invalid path to library I have been getting a warning.
I still can't run a ./raw on my machine, only in docker. Hope repair it and check valgrind again.
Offline
I think current run fails again because of /plaintext (therefore cached_query is not started at all).
BTW, previous run crushes on /plaintext with message munmap_chunk(): invalid pointer (pre-previous with malloc(): invalid next size (unsorted) )
PR with mimalloc just merged, let's see how we run with it in next round.
Offline
Result with mimalloc is ready. Server crushes again on /plaintext (but runs a little longer)
Previous round craches with double free or corruption (out), for round with mimalloc log is not ready yet - may be it output gives us some ideas
But in general, mimalloc is slower than glibc, so I remove it and move /plaintext to be a last test (currently it is behind /cached-queries)
Offline
This is valgrind warnings I got on /plaintext with pipelining
==1515797== Thread 14 R1:8080:
==1515797== Use of uninitialised value of size 8
==1515797== at 0x527D43: MORMOT.CORE.DATA$_$TDYNARRAYHASHER_$__$$_FINDORNEWCOMP$LONGWORD$POINTER$TDYNARRAYSORTCOMPARE$$INT64 (mormot.core.data.pas:9774)
==1515797== by 0x9CC166DE: ???
==1515797== by 0x6D437AF: ???
==1515797== by 0x492B7F: ??? (in /home/pavelmash/dev/mORMot2/ex/techempower-bench/exe/raw)
==1515797==
==1515797== Use of uninitialised value of size 8
==1515797== at 0x527BCF: MORMOT.CORE.DATA$_$TDYNARRAYHASHER_$__$$_FINDORNEW$LONGWORD$POINTER$PPTRINT$$INT64 (mormot.core.data.pas:9718)
==1515797== by 0xDE: ???
==1515797== by 0x9CC166DE: ???
==1515797==
==1515797== Conditional jump or move depends on uninitialised value(s)
==1515797== at 0x5281D5: MORMOT.CORE.DATA$_$TDYNARRAYHASHER_$__$$_FINDBEFOREADD$POINTER$BOOLEAN$LONGWORD$$INT64 (mormot.core.data.pas:9891)
==1515797== by 0xFFFFFFFFFFFFFF1F: ???
==1515797==
==1515797== Use of uninitialised value of size 8
==1515797== at 0x527E88: MORMOT.CORE.DATA$_$TDYNARRAYHASHER_$__$$_HASHADD$LONGWORD$INT64 (mormot.core.data.pas:9815)
==1515797== by 0xAF5034F: ???
==1515797==
==1515797== Conditional jump or move depends on uninitialised value(s)
==1515797== at 0x48881F: MORMOT.CORE.BASE_$$_BYTESCANINDEX$PBYTEARRAY$INT64$BYTE$$INT64 (mormot.core.base.asmx64.inc:2091)
==1515797== by 0x6BA1E1: MORMOT.NET.HTTP$_$THTTPREQUESTCONTEXT_$__$$_PROCESSREAD$TPROCESSPARSELINE$BOOLEAN$$BOOLEAN (mormot.net.http.pas:3693)
==1515797==
==1515797== Thread 104 R1:8080:
==1515797== Use of uninitialised value of size 8
==1515797== at 0x527E88: MORMOT.CORE.DATA$_$TDYNARRAYHASHER_$__$$_HASHADD$LONGWORD$INT64 (mormot.core.data.pas:9815)
==1515797== by 0xAF5198F: ???
.....
Please wait: Shutdown 12 servers and 96 threads
==1515797== Thread 64 R1:8080:
==1515797== Conditional jump or move depends on uninitialised value(s)
==1515797== at 0x48881F: MORMOT.CORE.BASE_$$_BYTESCANINDEX$PBYTEARRAY$INT64$BYTE$$INT64 (mormot.core.base.asmx64.inc:2091)
==1515797== by 0x6BA11C: MORMOT.NET.HTTP$_$THTTPREQUESTCONTEXT_$__$$_PROCESSREAD$TPROCESSPARSELINE$BOOLEAN$$BOOLEAN (mormot.net.http.pas:3672)
==1515797== by 0xAF52E8F: ???
==1515797== by 0x6DEA18: MORMOT.NET.ASYNC$_$THTTPASYNCSERVERCONNECTION_$__$$_FLUSHPIPELINEDWRITE$$TPOLLASYNCSOCKETONREADWRITE (mormot.net.async.pas:4283)
==1515797== by 0x1FFEFFF06F: ???
==1515797== by 0x49247CF: ??? (pthread_create.c:321)
==1515797== by 0x76B61FF: ???
==1515797== by 0xA2D9900: ???
==1515797== by 0x76B61EF: ???
==1515797== by 0x6DED35: MORMOT.NET.ASYNC$_$THTTPASYNCSERVERCONNECTION_$__$$_ONREAD$$TPOLLASYNCSOCKETONREADWRITE (mormot.net.async.pas:4334)
==1515797==
Offline