You are not logged in.
Fresh results
Max RPS:
┌──────────────┬────────────┬────────────┬────────────┬────────────┐
│ (index) │mormot(0720)│mormot(0730)│mormot(0801)│ lithium │
├──────────────┼────────────┼────────────┼────────────┼────────────┤
│ fortune │ 74318 │ 90500 │ 91287 │ 90064 │
│ plaintext │ 920198 │ 977024 │ 986253 │ 3388906 │
│ db │ 111119 │ 116756 │ 117624 │ 99463 │
│ update │ 10177 │ 10108 │ 10981 │ 25718 │
│ json │ 422771 │ 446284 │ 458358 │ 544247 │
│ query │ 106665 │ 113516 │ 114842 │ 94638 │
│ cached-query │ 384818 │ 416903 │ 419020 │ 528433 │
└──────────────┴────────────┴────────────┴────────────┴────────────┘
We achieved performance at which room temperature changes affects measurement So I upgrade my PS to middle-tower with manual cooler speed switch. During normal work I sets it minimal for silence, during measurement - to maximum to prevent CPU temperature growing.
Offline
Perhaps with hsoNoStats to disable low-level statistic counters, it may vary a little bit.
Or it may be the room temperature.
https://github.com/synopse/mORMot2/commit/59e1f82c
Offline
I add new hsoNoStats option.
Also another small (1%) improvements PR#111:
- our StrLen is twice faster compared to PQGetLength 0.2% vs 0.4% on TFB /db
- prevent unnecessary PQGetIsNull call - should be called only for empty string (to distinguish null and empty string result) 0.6%
Last edited by mpv (2022-08-01 14:45:44)
Offline
How fast is new MoveFast in mORMot? So fast what I decide to add a TFB #1 drogon for comparition (results are without latest Postgre improvements)
Max RPS:
┌──────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐────────────┐
│ (index) │mormot(0720)│mormot(0730)│mormot(0801)│ mormot(mf) │ drogon │ lithium │
├──────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤────────────┤
│ fortune │ 74318 │ 90500 │ 91287 │ 113073 │ 176131 │ 90064 │
│ plaintext │ 920198 │ 977024 │ 986253 │ 1436231 │ 3583444 │ 3388906 │
│ db │ 111119 │ 116756 │ 117624 │ 153009 │ 176776 │ 99463 │
│ update │ 10177 │ 10108 │ 10981 │ 15476 │ 90230 │ 25718 │
│ json │ 422771 │ 446284 │ 458358 │ 590979 │ 554328 │ 544247 │
│ query │ 106665 │ 113516 │ 114842 │ 148187 │ 171092 │ 94638 │
│ cached-query │ 384818 │ 416903 │ 419020 │ 547307 │ │ 528433 │
└──────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘────────────┘
Last edited by mpv (2022-08-01 18:10:18)
Offline
Please find some small changes in https://gist.github.com/synopse/83f4522 … 70d25b9c4d
- use the new TSynMustache.RenderDataArray() which allow direct use from the local variables, with no temporary TDocVariant conversion, for fortune
- tuned TRestBatch options for update
- use an external count variable for rawfortune
We need to note that drogon is an interresting C++ framework, much more realistic than lithium, even if what is called "ORM" is not really a runtime ORM - there is still a lot of bowerplate code to generate as in https://github.com/TechEmpower/Framewor … Fortune.cc at compile time. It does not start from an Object, it generates some class types from a .cc controller file.
Ahead-of-time code generation is of course very efficient in performance, but we loose the ORM approach.
There is still a lot of code to write for the methods, for instance, it seems to support only manual JSON serialization.
Offline
Today`s state: +10% for fortunes thanks to TSynMustache.RenderDataArray()
Max RPS:
┌--------------┬------------┬------------┬------------┬------------┬------------┐------------┐------------┐
│ (index) │mormot(0720)│mormot(0730)│mormot(0801)│mormot(0802)│mormot(0813)│ drogon │ lithium │
├--------------┼------------┼------------┼------------┼------------┼------------┤------------┤------------┤
│ fortune │ 74318 │ 90500 │ 91287 │ 113073 │ 126055 │ 176131 │ 90064 │
│ plaintext │ 920198 │ 977024 │ 986253 │ 1436231 │ 1373177 │ 3583444 │ 3388906 │
│ db │ 111119 │ 116756 │ 117624 │ 153009 │ 154033 │ 176776 │ 99463 │
│ update │ 10177 │ 10108 │ 10981 │ 15476 │ 15336 │ 90230 │ 25718 │
│ json │ 422771 │ 446284 │ 458358 │ 590979 │ 584294 │ 554328 │ 544247 │
│ query │ 106665 │ 113516 │ 114842 │ 148187 │ 149122 │ 171092 │ 94638 │
│ cached-query │ 384818 │ 416903 │ 419020 │ 547307 │ 551230 │ │ 528433 │
└--------------┴------------┴------------┴------------┴------------┴------------┘------------┘------------┘
Last edited by mpv (2022-08-13 14:09:15)
Offline
Hi ab,
I follow mORMot in a long time bcs your works are amazing! I'm not sure about runtime ORM. Here is my ORM does:
/// <summary>
///
/// </summary>
TDBException = class(TDBModel)
public
const FIELD_ID = 'id';
const FIELD_MESSAGE = 'message';
const FIELD_STACK_TRACE = 'stack_trace';
const FIELD_MODULE = 'module';
const FIELD_CREATED_AT = 'created_at';
private
fID : Int64;
fMessage : string;
fStackTrace: string;
fModule : string;
fCreatedAt : TDateTime;
procedure SetMessage(const aValue: string);
procedure SetStackTrace(const aValue: string);
procedure SetModule(const aValue: string);
procedure SetCreatedAt(const aValue: TDateTime);
protected
function DoGetFieldValue(const aField: string): TValue; override;
class function DoGetFieldType(const aField: string): TFieldType; override;
procedure DoSetInteger(const aField: string; const aValue: Int64); override;
procedure DoSetDouble(const aField: string; const aValue: Double); override;
procedure DoSetText(const aField: string; const aValue: string); override;
public
class function GetFieldList: TArray<string>; override;
class function GetDataFieldList: TArray<string>; override;
class function GetPrimaryKeyFields: TArray<string>; override;
class function GetTableName: string; override;
public
property ID: Int64 read fID;
property Message: string read fMessage write SetMessage;
property StackTrace: string read fStackTrace write SetStackTrace;
property Module: string read fModule write SetModule;
property CreatedAt: TDateTime read fCreatedAt write SetCreatedAt;
end;
class function TDBException.DoGetFieldType(const aField: string): TFieldType;
begin
if InArray(aField,
[
FIELD_ID
])
then
Result := ftInteger
else if InArray(aField,
[
FIELD_MESSAGE,
FIELD_STACK_TRACE,
FIELD_MODULE
])
then
Result := ftText
else if InArray(aField,
[
FIELD_CREATED_AT
])
then
Result := ftDouble
else
Result := ftNone;
end;
function TDBException.DoGetFieldValue(const aField: string): TValue;
begin
if aField = FIELD_ID then
Result := fID
else if aField = FIELD_MESSAGE then
Result := fMessage
else if aField = FIELD_STACK_TRACE then
Result := fStackTrace
else if aField = FIELD_MODULE then
Result := fModule
else if aField = FIELD_CREATED_AT then
Result := fCreatedAt;
end;
procedure TDBException.DoSetDouble(const aField: string; const aValue: Double);
begin
if aField = FIELD_CREATED_AT then
fCreatedAt := aValue;
end;
procedure TDBException.DoSetInteger(const aField: string; const aValue: Int64);
begin
if aField = FIELD_ID then
fID := aValue;
end;
procedure TDBException.DoSetText(const aField, aValue: string);
begin
if aField = FIELD_MESSAGE then
fMessage := aValue
else if aField = FIELD_STACK_TRACE then
fStackTrace := aValue
else if aField = FIELD_MODULE then
fModule := aValue;
end;
class function TDBException.GetDataFieldList: TArray<string>;
begin
Result := [
FIELD_MESSAGE,
FIELD_STACK_TRACE,
FIELD_MODULE,
FIELD_CREATED_AT
];
end;
class function TDBException.GetFieldList: TArray<string>;
begin
Result := [
FIELD_ID,
FIELD_MESSAGE,
FIELD_STACK_TRACE,
FIELD_MODULE,
FIELD_CREATED_AT
];
end;
class function TDBException.GetPrimaryKeyFields: TArray<string>;
begin
Result := [
FIELD_ID
];
end;
class function TDBException.GetTableName: string;
begin
Result := 'exceptions';
end;
procedure TDBException.SetCreatedAt(const aValue: TDateTime);
begin
if (CompareDateTime(fCreatedAt, aValue) <> EqualsValue) or IsNull(FIELD_CREATED_AT) then
begin
fCreatedAt := aValue;
MarkNotNullAndDirtyField(FIELD_CREATED_AT);
end;
end;
procedure TDBException.SetMessage(const aValue: string);
begin
if (fMessage <> aValue) or IsNull(fMessage) then
begin
fMessage := aValue;
MarkNotNullAndDirtyField(FIELD_MESSAGE);
end;
end;
procedure TDBException.SetModule(const aValue: string);
begin
if (fModule <> aValue) or IsNull(FIELD_MODULE) then
begin
fModule := aValue;
MarkNotNullAndDirtyField(FIELD_MODULE);
end;
end;
procedure TDBException.SetStackTrace(const aValue: string);
begin
if (fStackTrace <> aValue) or IsNull(FIELD_STACK_TRACE) then
begin
fStackTrace := aValue;
MarkNotNullAndDirtyField(FIELD_STACK_TRACE);
end;
end;
This is for example only. It does not use RTTI. It runs fast in my cases. If you would like to use the idea of my ORM, please let me know. I will post the full source here!
Offline
Hello nglthach and welcome.
But please follow the forum rules, and don't post huge set of code in the forum itself.
See https://synopse.info/forum/misc.php?action=rules
Use gist for instance to share some code.
So if I understand correctly, you don't use RTTI but you generate some pascal code from a data model?
This is not what an ORM is - because there is no Object mapping to be honest. It is close to what pseudo-ORM of lithium or drogon do in the benchmark.
The benefit of RTTI is that you can reuse a lot of code, once you handle the type mapping between pascal RTTI and SQL columns.
I am not sure your approach which is text-based for the field lookup will make a huge performance difference - on the contrary, using RTTI and a bitset is likely to be faster, as we do in TOrmProperties/TOrmModelProperties, especially if you have more than a few fields.
mORMot approach is not perfect, this is not a "pure ORM", for instance we are limited to an TID=Int64 primary key. But this limitation allows to use the SQLite3 virtual table mechanism, which is of a huge benefit. Or we have to inherit from the TOrm class, which has a lot of benefits too because we can use this base type in our ORM CRUD layer with no need to bloated generics, for instance.
You can share some code on gist or github: it is always interesting to share ideas.
Try to make something similar to https://github.com/synopse/mORMot2/tree … xtdb-bench with your ORM, and share the numbers.
Offline
Hi ab,
Thank you for the notice! I will follow the forum rules. I do know there are many things need to be optimized. I will do a benchmark and post into another thread. Return this thread back to TechEmpower Framework Benchmarks
Offline
I have included a copy of the TFB mORMot test in the repository.
See https://github.com/synopse/mORMot2/tree … ower-bench
With some additional information, and above numbers.
Offline
Please allow me an uneducated question (this is far outside my expertise):
Would building these same benchmarks on both Fpc and Delphi compilers and running them on the same hardware provide any meaningful comparison between the compilers as well or would the current tests not provide relevant data?
Offline
On Delphi, I expect it to be slower for such a benchmark.
Because
1. it will run only on Windows - and from our tests, both http.sys and socket server are slower than on Linux
2. it won't use our x86_64 memory manager, which scales better than the original FastMM4.
3. some of our AVX/AVX2 asm compiles only on FPC
So it won't be a fair comparison. Unless we compile the Windows version - but it is not the main point of such benchmarks.
On production, I would use a Linux server.
In practice, Delphi and FPC has similar code generation level on x86_64. On Win32, Delphi generates slightly better code from my experiment.
And mORMot bypass most of the RTL to use its own routines, so difference is not noticeable.
Offline
Our TFB pull 7481 is merged into master.
Next tfb-status check start in ~97 hours, so we got a results ~after 225 hours = 2022-09-01
Offline
We newer test mORMot before on such powerful hardware ( Intel Xeon Gold 5120 CPU, 32 GB, enterprise SSD. 3 servers (DB, app and load generator) connected using dedicated Cisco 10-gigabit), some unexpected things may happens, but hope everything will be OK.
Last edited by mpv (2022-08-22 16:07:04)
Offline
IF ZLib is used, happened to stumble upon this:
https://aws.amazon.com/blogs/opensource … lib-forks/
Quite a bit faster forks of ZLIb, some not API compatible, and some only for in memory operations, but anyhow. check those out.
-Tee-
Last edited by TPrami (2022-08-23 05:58:18)
Offline
For TFB tests compression is not permitted - see rule ix of requirements
In mORMot client <-> mORMot server scenarios proprietary synLZ is used.
In real life hi-load Web scenarios (mORMot <-> reverse proxy <-> browser) IMHO preferred compression is Brotli see comparison with gzip , and it can be enabled on reverse proxy level
For gzip compession mORMot uses libdeflate - @ab - it's a good idea to note a sources, used to build static libraries for mORMot in statics/README,.md - `/res/static/` is enough.
And inside /res/static/ libraries folders - link to original sources, because currently it is not clear what exactly implementation is used.
Last edited by mpv (2022-08-23 07:18:53)
Offline
Just follow the Readme link you will find https://github.com/synopse/mORMot2/tree … res/static
And there all the information and original C code.
And from my tests, libdeflate at level 1 (what we use for HTTP) is faster than brotli.
I will include the latest libdeflate 1.13 statics in the next days, which made level 1 even faster than before.
This AWS blog article is not very accurate.
They just missed the fastest zlib library around, which is libdeflate. Much faster than cloudflare fork, because it is a full rewrite.
It is exciting to wait for next September - and potential results on this configuration.
I am confident it should pass with this HW - which is not much more diverse than what we already used.
And since we reduced the memory allocation to bare minimum, and our MM has good scaling abilities, I expect almost linear progression.
The only blocking process is currently the ORM update, which has a critical section in the ORM core (to be optimized later). But all other process should be non blocking.
Offline
And from my tests, libdeflate at level 1 (what we use for HTTP) is faster than brotli.
I will include the latest libdeflate 1.13 statics in the next days, which made level 1 even faster than before.This AWS blog article is not very accurate.
They just missed the fastest zlib library around, which is libdeflate. Much faster than cloudflare fork, because it is a full rewrite.
Slightly off topic but made qp fro delphi for the ZLib implementation, if someone cares, please vote and give more info: https://quality.embarcadero.com/browse/RSP-38978
-Tee-
Offline
I have optimized TSynMustache, making some extensions and a code review:
- avoid searching for space in tag names (to see if it may be an helper)
- a lot of fixes and refactoring to enhance code generation
https://github.com/synopse/mORMot2/commit/117bd27c
Now on TFB benchmark /fortunes is very close to /db - as it should in a perfect world.
See https://gist.github.com/synopse/7ec565b … 435d7f7b3a
My notebook is not a production server for sure (only 2 cores), so we could guess very good numbers on dedicated HW.
Here are the server-side statistics on the above requests:
https://gist.github.com/synopse/6305b88 … 5397f37dd5
As we can see, a lot of requests were processed, and our MM did not sleep once, i.e. it had no multi-thread contention, and consumed only 7MB of RAM when allocating 3GB of small blocks. Nice.
Offline
You are right!
Should be fine with https://github.com/synopse/mORMot2/commit/1392558f
Numbers are good
abouchez@tisab:~/Downloads$ wrk -c 100 -d 15s -t 2 http://localhost:8080/db
Running 15s test @ http://localhost:8080/db
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.81ms 1.76ms 27.47ms 79.16%
Req/Sec 10.47k 581.61 11.71k 78.33%
312643 requests in 15.01s, 55.99MB read
Requests/sec: 20835.23
Transfer/sec: 3.73MB
abouchez@tisab:~/Downloads$ wrk -c 100 -d 15s -t 2 http://localhost:8080/fortunes
Running 15s test @ http://localhost:8080/fortunes
2 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.32ms 1.94ms 26.68ms 75.58%
Req/Sec 9.37k 706.97 11.15k 74.33%
279683 requests in 15.02s, 370.48MB read
Requests/sec: 18623.19
Transfer/sec: 24.67MB
Offline
On my 12 thread desktop (Rizen5 5600G overclocked to 4.2GHz, DDR4 memory DDR4-2666 (16)) there is no visible difference after mustache optimizations.
New test result are little slower, compared to one from 2022-08-13, but this is because I add 2 RAM module, and now can't overclock memory to DDR4-3200:
The best result as expected) is on -c 12 wrk mode (12 threads)
for ./ftb --benchmark:
- fortunes before mustache opt - 124114 RPS
- fortunes after mustache opt - 124510 RPS
If a run server on host and manually test
wrk -H 'Host: tfb-server' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 128 --timeout 8 -t 12 "http://tfb-server:8080/fortunes"
- fortunes before mustache opt - 126819
- fortunes after mustache opt - 127688
BTW when Postgres is in container and app is on host - `docker-proxy` become a bottleneck ( tfb runs all part in docker-compose, so there is no docket proxy in their case)
I disable it by forcing docker port forwarding to use iptables - add a { "userland-proxy": false } in /etc/docker/daemon.json and now results from tfb util and from manual test execution is near the same/
Offline
Our TFB pull 7481 is merged into master.
Next tfb-status check start in ~97 hours, so we got a results ~after 225 hours = 2022-09-01
The current work is still not finished. They seemed to have paused their "citrine" server instance.
It did not include the mORMot merge request in this round anyway (they grabbed the 19/8 commit, and mORMot was merged on 22/8).
So I don't know when we could have some official data including mORMot.
Let's wait as marmots do at the end of the Winter.
Offline
Test is finally running. TechEmpower have a lot of problems and stop tests, change HW and move office.
Thanks in advance to Pavel and Arnaud. Waiting for results...
Pavel I hope you are OK, I'm really admire to you and your work in this terrible time for you and your family.
Offline
Yes, tests is running
Here in Ukraine we currently have a huge problems with electricity - russian terrorists destroys our electric transformers across the country, so we save energy as much as possible. Positive - I'm reading paper books again during blackouts And we believe in the Armed Forces of Ukraine together with help from all civilized countries of the world.
Offline
#34 the mormot for the moment. I saw it also in #52
Offline
There is definitely a scalability problem. If we look at "Data table" tab for "Single query", we`ll se what we do not scales linear during increasing of connections count.
Con 16 32 64 128 256 512
RPS 69406 116648 217865 305213 308233 306508
If we sort data table by conn count (I copy it to Calc), when for 16 connection raw mORMot is #10, for 32 - #9, 64 - #22, 128 - #23, 256 - #37, 512 - #46. Near the same distribution is for "orm" mORMot.
While top frameworks shows its best on 512 connections.
So my expectation what on powerful hardware we cat got some unexpected results is unfortunately true (I test on 6core/12thread CPU, while TFB Citrine environment is 2x 14 core/28 thread CPU)
Hope the main problem is what we limit DB connection pool to 64.
PostgresSQL in TFB test suite is configured to 2000 max_connections (default is 100!).
I got a temporary access to server with 24 cores, and (when blackouts allow) will try to play with pool size.
Offline
Doing the same before @mvp post :-), rank/concurrency Fortunes test data.
No improvements between 256 and 512
Framework 16 32 64 128 256 512
Mormot-raw 10 9 13 20 35 38
Mormot-orm 33 32 30 47 54 51
Last edited by ttomas (2022-10-30 10:09:20)
Offline
Huge problems with electricity today in Ukraine. Horde is very active on Mondays.
But I test on 2 environment - first is
1) Wrk on Laptop -> 1Gb Ethernet -> DB+App on Raizen5 PC (12 thread CPU)
Second is
2) DB+wrk+APP on 2x Intel® Xeon® Processor E5649 ( 24 thread CPU total)
See full result in Calc on Google Drve
In short:
- on 6/12 Raizen5 CPU best result is for 64 App threads.
- on 2x6/12 Xeon CPU - for 128 APP threads
From TFB Citrine environment I do not understand Is their Dell R440 equipped with 2 CPU or with one (both is possible)
If Citrine is 2 CPU - we should limit app server thread pool size to 256, if one - to 128 (instead of 64).
P.S.
On Xeon processors test results are very reproducible in opposite to Desktop CPU (even with good cooling). On Laptop CPU performance tests is near to impossible because of trotting.
P.S.2
Does anyone understand how many CPU in on Citrine? Tomorrow I plane to prepare a PR for TFB with new thread pool limits and latest 2.0.4148 mORMot
Last edited by mpv (2022-10-31 21:05:12)
Offline
Testing on same HW will produce false thread/workers optimization. We need 2 servers (app and db) with same cpu/cores.
With this bench we also test posgresql. Looking at
https://pganalyze.com/blog/postgres-14- … monitoring
we can see that best TPS is when # connection=cpu cores (96 vCPU)
Looking at drogon source config.json https://github.com/TechEmpower/Framewor … onfig.json
app.threads_num: 0, mean worker threads=cpu cores.
Looking at actix-http source, no change of number of worker threads, actix default # is cpu cores.
For production server worker threads is Ok to be (1-4)*cpu cores, for this type of benchmark I will try with # cpu cores.
Last edited by ttomas (2022-11-01 09:22:51)
Offline
Testing on same HW will produce false thread/workers optimization. We need 2 servers (app and db) with same cpu/cores.
Yes, you are right (more over - we need 3 servers - third for wrk), but currently I do not have such hardware.
About "connection=cpu cores" - all framework on top of TFB rating are use HTTP and Postgres pipelining. This IMHO is good for benchmarking or to write tools like PgBouncer, but in real life this is a big pain (problems with transactions, debugging and so on).
So our approach with fixed thread pool of synchronous connections is OK. But in this case we need connection > cpu cores. Yes, some connections will be idle in this case (while app server parse HTTP headers and so on), but (as noted in article you provide) this is not a big problem for connection count < 500.
My propose for now (see my MR) is to remove 64 worker limitation, so on Citrine (in case there is one CPU there) we spawn 24 Cores*4 = 96 working threads. Let's wait for next run and see.
And another note - even with current results if we filter for ORM type=full in /fortunes tests mORMot is #6 ! . #1 is lithium, but it's not a true ORM as noted by @ab in this post. So actually we are #3, what is PERFECT IMHO.
Last edited by mpv (2022-11-01 11:10:16)
Offline
I think it's possible to implement a SQL pipelining for Postgres in synchronous mode. It should speed up a `/queries` endpoint. I will do it for raw mode, and may be even for ORM
Last edited by mpv (2022-11-02 11:22:27)
Offline
I implement an SQL pipe-lining for PostgreSQL in synchronous mode - see MR #127.
For TFB "/rawqueries?queries=20" endpoint it gives ~70% boots, for "rawupdates?queries=20" ~ 10%
When MR will be accepted (it's good to do a mORMot release after this) I'ill do MR to TFB.
@ab - please, look, may be in future you can add a pipelining to the ORM level also?
Offline
@mpv I wanted to say thank you and I am following the work.
Offline
Thanks a lot @mpv for your PR !
It is merged now.
I could make a release tomorrow, because I am currently on vacation, and should be away for the keyboard for two weeks.
So I guess I won't be able to finalize something stable for the ORM in a few hours - even if I suspect it is pretty feasible in the future (just integrate the pipelining as one official feature of the abstract TSqlDBConnection).
I spoke about you today (about TFB and PG) during my session at EKON 26 - some more Delphiers looking toward the Ukraine situation. Stay safe!
Offline
I made a PR to TFB based on [d2a2d829] commit. On my local 12-core environment (after editing "query_url": "/rawqueries?queries=" in benchmark_config.json because I don't know how to run raw test locally)
./tfb --test mormot --type query --query-levels 20
wrk -H 'Host: tfb-server' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 512 --timeout 8 -t 12 "http://tfb-server:8080/rawqueries?queries=20"
WITH PIPELINING
Requests/sec: 17927.58
W/O PIPELINING
Requests/sec: 10449.76
Today should starts a new TFB pipeline with our MR where threads is not limited to 64 and we will see is it helps with concurrency, hope next run will include pipelining. Let's wait. Have a good vocation!
P.S.
In my previous posts I miss `-query-levels 20` parameter, so /query endpoint is verified with 1 query, this is why numbers are so big
Last edited by mpv (2022-11-07 17:57:15)
Offline
mormot #30 & #52 position for the moment
Offline
As expected - results for server with 96 workers are better when with 64 as in previous round. Below is /rawdb concurrency compare
Con 16 32 64 128 256 512
64thRPS 69406 116648 217865 305213 308233 306508
96thRPS 68852 116590 217524 232641 354421 349668
As a result on the end of this run (approximately) we got +20 position for db and queries, +10 for fortunes:
/rawdb #51 (instead of #76)
/db #52 (instead of #77)
/rawfortunes #43 (#52)
/fortunes #73 (#78)
/rawqueries #62 (#81)
/queries #74 (#83)
Hope pipelining MR will be merged by TFB team and on next run we got much better results for /rawqueries (I expect we will be at last #30 or even better)
Last edited by mpv (2022-11-11 11:29:50)
Offline
From my tests
See full result in Calc on Google Drve
In short:
- on 6/12 Raizen5 CPU best result is for 64 App threads.
- on 2x6/12 Xeon CPU - for 128 APP threads
on 256 threads I have a bad result. Suppose "reactor" thread (thread where async HTTP is performed) become a bottleneck but not shure
Offline
@ab - I remove SQLite3 code from TFB tests to prevent confusions like this - https://github.com/TechEmpower/Framewor … ssues/7692
Also increase threads to CPUCount*5 (instead of *4), so will be 24*5 = 120
Last edited by mpv (2022-11-12 12:27:53)
Offline
About problems with /plaintext endpoint:
My current exploration shows what if I increase thread pool then CPU consummation for main thread (in user space, so this is our code - not epoll) increases, and performance - decreases.
Content of HTTP headers doesn't meter, problem is reproduced without headers (so it's not a header parser) using command
wrk -d 10 -c 512 -t 8 "http://192.168.1.6:8080/plaintext"
On 12 CPU desktop
for "./raw 24" main thread of 24-thread async server consume 600% CPU (6 core), 50% of the overall CPUs load is in the kernel space, result is 751294RPS
......
for "./raw 96" main thread of 96-thread async server consume 1000% CPU (10 core), 90% of the overall CPUs load is in the USER space, result is 262234RPS
Unfortunately under valgrind problem is not reproduced, because valgrind in instrumentation mode is slow itself...
Last edited by mpv (2022-11-13 15:48:59)
Offline
Update to prev. post - my measurement mistake - all threads uses near the same portion of CPU (25% - 9%). But overall CPU usage moved from kernel space to user space as noted above
Last edited by mpv (2022-11-13 17:09:33)
Offline
@mpv
Hope you are not too bad.
What do you call the "main" thread?
What is its name in htop for instance?
Is it the main process thread (named "main"), or is it TAsyncServer.Execute (named "A...") or THttpAsyncServer.Execute (named "W...") or TAsyncConnectionsThread.Execute (named "R..." - probably "R0" for the epoll main thread).
I only have a small CPU (2 cores) so I can't reproduce the issue.
Offline
I'm currently most of the time w\o electricity, GSM and internet. Hope we recover our infrastructure soon... In any case, it is better than russians occupation.
As I noted above - all threads uses near the same portion (about all load in one thread - i'ts my measurement mistake) of CPU, but when I increase workers count, overall CPU usage moved from kernel space (red mark in htop) to user space (green mark)
Last edited by mpv (2022-11-25 10:34:05)
Offline
Could you send a screenshot of htop so that I could see the diverse threads?
8-)
If you have a little time, look at https://forum.lazarus.freepascal.org/in … #msg461281
Where our little mORMot can do wonders for a more realistic use case than the TechEmpower benchmark.
Offline
Nice - now I know how to turn on custom thread name in htop
Command
wrk -d 10 -c 512 -t 8 "http://192.168.1.6:8080/plaintext"
HTOP
24 th ~ 700k RPS : (https://drive.google.com/file/d/1LDU3C1 … sp=sharing)
96 th ~ 400k RPS: (https://drive.google.com/file/d/1_FmtiQ … sp=sharing)
sorry for delay
Last edited by mpv (2022-11-26 18:14:35)
Offline
look at https://forum.lazarus.freepascal.org/in … #msg461281
Where our little mORMot can do wonders for a more realistic use case than the TechEmpower benchmark.
Nice. Will be good to place mORMot numbers to the prokject README.md, because there is no Pascal numbers on the main page
But TFB is well known benchmark and will be good to be in top20 there IMHO.
I see strange results for /rawqueries in pipelining mode - 636RPS is very strange to me and I can't reproduce it
Offline
You are right: /rawqueries should be faster than /queries.
Also the cached queries at 3,082 RPS whereas it should be close to JSON serialization.
It is weird that the relative performance between tests on their HW does not match what we obtain in our own tests.
And also the relative performance against Lithium, e.g. for /plaintext.
It is likely to be a problem of worker threads scheduling, triggered only with a lot of threads.
If there are more threads, it should not be slower. The threads should just remain inactive.
Yes, by default, thread names are not displayed in htop - you have to change the config.
The fact that R0* thread is the first is not a good sign.
Normally, it should be very low, since all other threads are working, so there is nothing to do in R0, which should wait for new data.
On my PC, R0* has a slow CPU usage.
We may try to change to use the poll API instead of epoll... and see.
Try to define USEPOLL conditional.
With two potential follow-ups:
1. There may be a problem with the "triggering" mode of epoll.
2. Or we may unsubscribe/subscribe each socket when it is handled in a thread...for epoll it is supposed to be fast, and would avoid false positives which let the epoll wait API call return without waiting.
Offline
My observations:
- on my hardware I also DO NOT reproduce bad performance for "/cached-queries?count=20" - it gives 75 561 RPS vs, for example, 9 535 RPS for "/queries?queries=20"
- using poll gives -10% RPS vs epoll. R0 is on top also. Degradation is the same for 24 vs 96 thread, so problem is not in poll/epoll
About CPU scheduling - you are right - there is non-normal increasing of cpu-mirgation (and HIGH context-switch in both case, IMHO too) and user-space CPU utilization for both poll and epoll in case we increase workers count - see truncated perf output below for 24 vs 96 workers
I think this can be a root of problem. I found in h2o test what author sets affinty mask for each thread manually - see here
Do you think it can help?
$ perf stat ./raw 24
Performance counter stats for './raw 24':
63457,69 msec task-clock # 4,301 CPUs utilized
1739496 context-switches # 27,412 K/sec
7200 cpu-migrations # 113,461 /sec
1021 page-faults # 16,089 /sec
...
13,730222000 seconds user
50,388775000 seconds sys
$ perf stat ./raw 96
Performance counter stats for './raw 96':
78413,93 msec task-clock # 5,293 CPUs utilized
3461114 context-switches # 44,139 K/sec
117782 cpu-migrations # 1,502 K/sec
1460 page-faults # 18,619 /sec
...
36,417579000 seconds user
42,630085000 seconds sys
Last edited by mpv (2022-11-27 12:13:57)
Offline