You are not logged in.
Fresh results
Max RPS:
┌──────────────┬────────────┬────────────┬────────────┬────────────┐
│   (index)    │mormot(0720)│mormot(0730)│mormot(0801)│ lithium    │
├──────────────┼────────────┼────────────┼────────────┼────────────┤
│   fortune    │   74318    │   90500    │   91287    │   90064    │
│  plaintext   │   920198   │   977024   │   986253   │  3388906   │
│      db      │   111119   │   116756   │   117624   │   99463    │
│    update    │   10177    │   10108    │   10981    │   25718    │
│     json     │   422771   │   446284   │   458358   │   544247   │
│    query     │   106665   │   113516   │   114842   │   94638    │
│ cached-query │   384818   │   416903   │   419020   │   528433   │
└──────────────┴────────────┴────────────┴────────────┴────────────┘We achieved performance at which room temperature changes affects measurement  So I upgrade my PS to middle-tower with manual cooler speed switch. During normal work I sets it minimal for silence, during measurement - to maximum to prevent CPU temperature growing.
 So I upgrade my PS to middle-tower with manual cooler speed switch. During normal work I sets it minimal for silence, during measurement - to maximum to prevent CPU temperature growing.
Offline
Perhaps with hsoNoStats to disable low-level statistic counters, it may vary a little bit.
Or it may be the room temperature. 
https://github.com/synopse/mORMot2/commit/59e1f82c
Offline
I add new hsoNoStats option.
Also another small (1%) improvements PR#111:
 - our StrLen is twice faster compared to PQGetLength 0.2% vs 0.4% on TFB /db
 - prevent unnecessary PQGetIsNull call - should be called only for empty string (to distinguish null and empty string result)  0.6%
Last edited by mpv (2022-08-01 14:45:44)
Offline
How fast is new MoveFast in mORMot? So fast what I decide to add a TFB #1 drogon for comparition (results are without latest Postgre improvements)
Max RPS:
┌──────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐────────────┐
│   (index)    │mormot(0720)│mormot(0730)│mormot(0801)│ mormot(mf) │ drogon     │ lithium    │
├──────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤────────────┤
│   fortune    │   74318    │   90500    │   91287    │   113073   │   176131   │   90064    │
│  plaintext   │   920198   │   977024   │   986253   │  1436231   │  3583444   │  3388906   │
│      db      │   111119   │   116756   │   117624   │   153009   │   176776   │   99463    │
│    update    │   10177    │   10108    │   10981    │   15476    │   90230    │   25718    │
│     json     │   422771   │   446284   │   458358   │   590979   │   554328   │   544247   │
│    query     │   106665   │   113516   │   114842   │   148187   │   171092   │   94638    │
│ cached-query │   384818   │   416903   │   419020   │   547307   │            │   528433   │
└──────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘────────────┘ Last edited by mpv (2022-08-01 18:10:18)
Offline
Please find some small changes in https://gist.github.com/synopse/83f4522 … 70d25b9c4d
- use the new TSynMustache.RenderDataArray() which allow direct use from the local variables, with no temporary TDocVariant conversion, for fortune
- tuned TRestBatch options for update
- use an external count variable for rawfortune
We need to note that drogon is an interresting C++ framework, much more realistic than lithium, even if what is called "ORM" is not really a runtime ORM - there is still a lot of bowerplate code to generate as in https://github.com/TechEmpower/Framewor … Fortune.cc at compile time. It does not start from an Object, it generates some class types from a .cc controller file.
Ahead-of-time code generation is of course very efficient in performance, but we loose the ORM approach.
There is still a lot of code to write for the methods, for instance, it seems to support only manual JSON serialization.
Offline
Today`s state: +10% for fortunes thanks to TSynMustache.RenderDataArray()
Max RPS:
┌--------------┬------------┬------------┬------------┬------------┬------------┐------------┐------------┐
│   (index)    │mormot(0720)│mormot(0730)│mormot(0801)│mormot(0802)│mormot(0813)│ drogon     │ lithium    │
├--------------┼------------┼------------┼------------┼------------┼------------┤------------┤------------┤
│   fortune    │   74318    │   90500    │   91287    │   113073   │   126055   │   176131   │   90064    │
│  plaintext   │   920198   │   977024   │   986253   │  1436231   │  1373177   │  3583444   │  3388906   │
│      db      │   111119   │   116756   │   117624   │   153009   │   154033   │   176776   │   99463    │
│    update    │   10177    │   10108    │   10981    │   15476    │   15336    │   90230    │   25718    │
│     json     │   422771   │   446284   │   458358   │   590979   │   584294   │   554328   │   544247   │
│    query     │   106665   │   113516   │   114842   │   148187   │   149122   │   171092   │   94638    │
│ cached-query │   384818   │   416903   │   419020   │   547307   │   551230   │            │   528433   │
└--------------┴------------┴------------┴------------┴------------┴------------┘------------┘------------┘Last edited by mpv (2022-08-13 14:09:15)
Offline
Hi ab,
I follow mORMot in a long time bcs your works are amazing! I'm not sure about runtime ORM. Here is my ORM does:
  /// <summary>
  ///
  /// </summary>
  TDBException = class(TDBModel)
  public
    const FIELD_ID          = 'id';
    const FIELD_MESSAGE     = 'message';
    const FIELD_STACK_TRACE = 'stack_trace';
    const FIELD_MODULE      = 'module';
    const FIELD_CREATED_AT  = 'created_at';
  private
    fID        : Int64;
    fMessage   : string;
    fStackTrace: string;
    fModule    : string;
    fCreatedAt : TDateTime;
    procedure SetMessage(const aValue: string);
    procedure SetStackTrace(const aValue: string);
    procedure SetModule(const aValue: string);
    procedure SetCreatedAt(const aValue: TDateTime);
  protected
    function DoGetFieldValue(const aField: string): TValue; override;
    class function DoGetFieldType(const aField: string): TFieldType; override;
    procedure DoSetInteger(const aField: string; const aValue: Int64); override;
    procedure DoSetDouble(const aField: string; const aValue: Double); override;
    procedure DoSetText(const aField: string; const aValue: string); override;
  public
    class function GetFieldList: TArray<string>; override;
    class function GetDataFieldList: TArray<string>; override;
    class function GetPrimaryKeyFields: TArray<string>; override;
    class function GetTableName: string; override;
  public
    property ID: Int64 read fID;
    property Message: string read fMessage write SetMessage;
    property StackTrace: string read fStackTrace write SetStackTrace;
    property Module: string read fModule write SetModule;
    property CreatedAt: TDateTime read fCreatedAt write SetCreatedAt;
  end;
class function TDBException.DoGetFieldType(const aField: string): TFieldType;
begin
  if InArray(aField,
    [
      FIELD_ID
    ])
  then
    Result := ftInteger
  else if InArray(aField,
    [
      FIELD_MESSAGE,
      FIELD_STACK_TRACE,
      FIELD_MODULE
    ])
  then
    Result := ftText
  else if InArray(aField,
    [
      FIELD_CREATED_AT
    ])
  then
    Result := ftDouble
  else
    Result := ftNone;
end;
function TDBException.DoGetFieldValue(const aField: string): TValue;
begin
  if aField = FIELD_ID then
    Result := fID
  else if aField = FIELD_MESSAGE then
    Result := fMessage
  else if aField = FIELD_STACK_TRACE then
    Result := fStackTrace
  else if aField = FIELD_MODULE then
    Result := fModule
  else if aField = FIELD_CREATED_AT then
    Result := fCreatedAt;
end;
procedure TDBException.DoSetDouble(const aField: string; const aValue: Double);
begin
  if aField = FIELD_CREATED_AT then
    fCreatedAt := aValue;
end;
procedure TDBException.DoSetInteger(const aField: string; const aValue: Int64);
begin
  if aField = FIELD_ID then
    fID := aValue;
end;
procedure TDBException.DoSetText(const aField, aValue: string);
begin
  if aField = FIELD_MESSAGE then
    fMessage := aValue
  else if aField = FIELD_STACK_TRACE then
    fStackTrace := aValue
  else if aField = FIELD_MODULE then
    fModule := aValue;
end;
class function TDBException.GetDataFieldList: TArray<string>;
begin
  Result := [
    FIELD_MESSAGE,
    FIELD_STACK_TRACE,
    FIELD_MODULE,
    FIELD_CREATED_AT
  ];
end;
class function TDBException.GetFieldList: TArray<string>;
begin
  Result := [
    FIELD_ID,
    FIELD_MESSAGE,
    FIELD_STACK_TRACE,
    FIELD_MODULE,
    FIELD_CREATED_AT
  ];
end;
class function TDBException.GetPrimaryKeyFields: TArray<string>;
begin
  Result := [
    FIELD_ID
  ];
end;
class function TDBException.GetTableName: string;
begin
  Result := 'exceptions';
end;
procedure TDBException.SetCreatedAt(const aValue: TDateTime);
begin
  if (CompareDateTime(fCreatedAt, aValue) <> EqualsValue) or IsNull(FIELD_CREATED_AT) then
  begin
    fCreatedAt := aValue;
    MarkNotNullAndDirtyField(FIELD_CREATED_AT);
  end;
end;
procedure TDBException.SetMessage(const aValue: string);
begin
  if (fMessage <> aValue) or IsNull(fMessage) then
  begin
    fMessage := aValue;
    MarkNotNullAndDirtyField(FIELD_MESSAGE);
  end;
end;
procedure TDBException.SetModule(const aValue: string);
begin
  if (fModule <> aValue) or IsNull(FIELD_MODULE) then
  begin
    fModule := aValue;
    MarkNotNullAndDirtyField(FIELD_MODULE);
  end;
end;
procedure TDBException.SetStackTrace(const aValue: string);
begin
  if (fStackTrace <> aValue) or IsNull(FIELD_STACK_TRACE) then
  begin
    fStackTrace := aValue;
    MarkNotNullAndDirtyField(FIELD_STACK_TRACE);
  end;
end;This is for example only. It does not use RTTI. It runs fast in my cases. If you would like to use the idea of my ORM, please let me know. I will post the full source here!
Offline
Hello nglthach and welcome.
But please follow the forum rules, and don't post huge set of code in the forum itself.
See https://synopse.info/forum/misc.php?action=rules
Use gist for instance to share some code.
So if I understand correctly, you don't use RTTI but you generate some pascal code from a data model?
This is not what an ORM is - because there is no Object mapping to be honest. It is close to what pseudo-ORM of lithium or drogon do in the benchmark.
The benefit of RTTI is that you can reuse a lot of code, once you handle the type mapping  between pascal RTTI and SQL columns.
I am not sure your approach which is text-based for the field lookup will make a huge performance difference - on the contrary, using RTTI and a bitset is likely to be faster, as we do in TOrmProperties/TOrmModelProperties, especially if you have more than a few fields.
mORMot approach is not perfect, this is not a "pure ORM", for instance we are limited to an TID=Int64 primary key. But this limitation allows to use the SQLite3 virtual table mechanism, which is of a huge benefit. Or we have to inherit from the TOrm class, which has a lot of benefits too because we can use this base type in our ORM CRUD layer with no need to bloated generics, for instance.
You can share some code on gist or github: it is always interesting to share ideas.
Try to make something similar to https://github.com/synopse/mORMot2/tree … xtdb-bench with your ORM, and share the numbers. 
Offline
Hi ab,
Thank you for the notice! I will follow the forum rules. I do know there are many things need to be optimized. I will do a benchmark and post into another thread. Return this thread back to TechEmpower Framework Benchmarks 
Offline
I have included a copy of the TFB mORMot test in the repository.
See https://github.com/synopse/mORMot2/tree … ower-bench
With some additional information, and above numbers.
Offline

Please allow me an uneducated question (this is far outside my expertise): 
Would building these same benchmarks on both Fpc and Delphi compilers and running them on the same hardware provide any meaningful comparison between the compilers as well or would the current tests not provide relevant data?
Offline
On Delphi, I expect it to be slower for such a benchmark.
Because
1. it will run only on Windows - and from our tests, both http.sys and socket server are slower than on Linux
2. it won't use our x86_64 memory manager, which scales better than the original FastMM4.
3. some of our AVX/AVX2 asm compiles only on FPC
So it won't be a fair comparison. Unless we compile the Windows version - but it is not the main point of such benchmarks.
On production, I would use a Linux server.
In practice, Delphi and FPC has similar code generation level on x86_64. On Win32, Delphi generates slightly better code from my experiment.
And mORMot bypass most of the RTL to use its own routines, so difference is not noticeable.
Offline
Our TFB pull 7481 is merged into master.
Next  tfb-status check start  in ~97 hours, so we got a results ~after 225 hours =  2022-09-01
Offline
We newer test mORMot before on such powerful hardware ( Intel Xeon Gold 5120 CPU, 32 GB, enterprise SSD. 3 servers (DB, app and load generator) connected using dedicated Cisco 10-gigabit), some unexpected things may happens, but hope everything will be OK.
Last edited by mpv (2022-08-22 16:07:04)
Offline
IF ZLib is used, happened to stumble upon this:
https://aws.amazon.com/blogs/opensource … lib-forks/
Quite a bit faster forks of ZLIb, some not API compatible, and some only for in memory operations, but anyhow. check those out.
-Tee-
Last edited by TPrami (2022-08-23 05:58:18)
Offline
For TFB tests compression is not permitted - see rule ix of requirements
In mORMot client <-> mORMot server scenarios proprietary synLZ is used.
In real life hi-load Web scenarios (mORMot <-> reverse proxy <-> browser) IMHO preferred compression is Brotli  see comparison with gzip , and it can be enabled on reverse proxy level
For gzip compession mORMot uses libdeflate - @ab - it's a good idea to note a sources, used to build static libraries for mORMot in statics/README,.md - `/res/static/` is enough. 
And inside /res/static/ libraries folders - link to original sources, because currently it is not clear what exactly implementation is used.
Last edited by mpv (2022-08-23 07:18:53)
Offline
Just follow the Readme link you will find https://github.com/synopse/mORMot2/tree … res/static
And there all the information and original C code.
And from my tests, libdeflate at level 1 (what we use for HTTP) is faster than brotli.
I will include the latest libdeflate 1.13 statics in the next days, which made level 1 even faster than before.
This AWS blog article is not very accurate.
They just missed the fastest zlib library around, which is libdeflate. Much faster than cloudflare fork, because it is a full rewrite.
It is exciting to wait for next September - and potential results on this configuration.
I am confident it should pass with this HW - which is not much more diverse than what we already used.
And since we reduced the memory allocation to bare minimum, and our MM has good scaling abilities, I expect almost linear progression.
The only blocking process is currently the ORM update, which has a critical section in the ORM core (to be optimized later). But all other process should be non blocking.
Offline
And from my tests, libdeflate at level 1 (what we use for HTTP) is faster than brotli.
I will include the latest libdeflate 1.13 statics in the next days, which made level 1 even faster than before.This AWS blog article is not very accurate.
They just missed the fastest zlib library around, which is libdeflate. Much faster than cloudflare fork, because it is a full rewrite.
Slightly off topic but made qp fro delphi for the ZLib implementation, if someone cares, please vote and give more info: https://quality.embarcadero.com/browse/RSP-38978
-Tee-
Offline
I have optimized TSynMustache, making some extensions and a code review:
- avoid searching for space in tag names (to see if it may be an helper)
- a lot of fixes and refactoring to enhance code generation
https://github.com/synopse/mORMot2/commit/117bd27c
Now on TFB benchmark /fortunes is very close to /db - as it should in a perfect world.
See https://gist.github.com/synopse/7ec565b … 435d7f7b3a
My notebook is not a production server for sure (only 2 cores), so we could guess very good numbers on dedicated HW.
Here are the server-side statistics on the above requests:
https://gist.github.com/synopse/6305b88 … 5397f37dd5
As we can see, a lot of requests were processed, and our MM did not sleep once, i.e. it had no multi-thread contention, and consumed only 7MB of RAM when allocating 3GB of small blocks. Nice. 
Offline
You are right!
Should be fine with https://github.com/synopse/mORMot2/commit/1392558f
Numbers are good
abouchez@tisab:~/Downloads$ wrk -c 100 -d 15s -t 2 http://localhost:8080/db
Running 15s test @ http://localhost:8080/db
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.81ms    1.76ms  27.47ms   79.16%
    Req/Sec    10.47k   581.61    11.71k    78.33%
  312643 requests in 15.01s, 55.99MB read
Requests/sec:  20835.23
Transfer/sec:      3.73MB
abouchez@tisab:~/Downloads$ wrk -c 100 -d 15s -t 2 http://localhost:8080/fortunes
Running 15s test @ http://localhost:8080/fortunes
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.32ms    1.94ms  26.68ms   75.58%
    Req/Sec     9.37k   706.97    11.15k    74.33%
  279683 requests in 15.02s, 370.48MB read
Requests/sec:  18623.19
Transfer/sec:     24.67MBOffline
On my 12 thread desktop (Rizen5 5600G overclocked to 4.2GHz, DDR4 memory DDR4-2666 (16)) there is no visible difference after mustache optimizations.
New test result are little slower, compared to one from 2022-08-13, but this is because I add 2 RAM module, and now can't overclock memory to DDR4-3200:
The best result as expected) is on -c 12 wrk mode (12 threads)
for ./ftb --benchmark:
- fortunes before mustache opt - 124114 RPS 
- fortunes after    mustache opt - 124510 RPS
If a run server on host and manually test
wrk -H 'Host: tfb-server' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 128 --timeout 8 -t 12 "http://tfb-server:8080/fortunes"- fortunes before mustache opt - 126819
- fortunes after    mustache opt - 127688
BTW when Postgres is in container and app is on host - `docker-proxy` become a bottleneck ( tfb runs all part in docker-compose, so there is no docket proxy in their case)
I disable it by forcing docker port forwarding to use iptables  - add a  { "userland-proxy": false } in /etc/docker/daemon.json and now results from tfb util and from manual test execution is near the same/
Offline
Our TFB pull 7481 is merged into master.
Next tfb-status check start in ~97 hours, so we got a results ~after 225 hours = 2022-09-01
The current work is still not finished. They seemed to have paused their "citrine" server instance.
It did not include the mORMot merge request in this round anyway (they grabbed the 19/8 commit, and mORMot was merged on 22/8).
So I don't know when we could have some official data including mORMot. 
Let's wait as marmots do at the end of the Winter. 
Offline
Test is finally running. TechEmpower have a lot of problems and stop tests, change HW and move office.
Thanks in advance to Pavel and Arnaud. Waiting for results...
Pavel I hope you are OK, I'm really admire to you and your work in this terrible time for you and your family.
Offline
Yes, tests is running 
Here in Ukraine we currently have a huge problems with electricity - russian terrorists destroys our electric transformers across the country, so we save energy as much as possible. Positive - I'm reading paper books again during blackouts  And we believe in the Armed Forces of Ukraine together with help from all civilized countries of the world.
 And we believe in the Armed Forces of Ukraine together with help from all civilized countries of the world.
Offline
#34 the mormot for the moment. I saw it also in #52
Offline
There is definitely a scalability problem. If we look at "Data table" tab for "Single query", we`ll se what we do not scales linear during increasing of connections count.
Con  16	        32	        64	        128	        256	        512
RPS  69406	116648		217865		305213		308233		306508If we sort data table by conn count (I copy it to Calc), when for 16 connection raw mORMot is #10, for 32 - #9, 64 - #22, 128 - #23, 256 - #37, 512 - #46. Near the same distribution is for "orm" mORMot.
While top frameworks shows its best on 512 connections.
So my expectation what on powerful hardware we cat got some unexpected results is unfortunately true (I test on 6core/12thread CPU, while TFB Citrine environment is 2x 14 core/28 thread CPU)  
Hope the main problem is what we limit DB connection pool to 64. 
PostgresSQL in TFB test suite is configured to 2000 max_connections (default is 100!).
I got a temporary access to server with 24 cores, and (when blackouts allow) will try to play with pool size.
Offline
Doing the same before @mvp post :-), rank/concurrency Fortunes test data.
No improvements between 256 and 512
Framework	16	32	64	128	256	512
Mormot-raw	10	9	13	20	35	38
Mormot-orm	33	32	30	47	54	51Last edited by ttomas (2022-10-30 10:09:20)
Offline
Huge problems with electricity today in Ukraine. Horde is very active on Mondays. 
But I test on 2 environment - first is
1)  Wrk on Laptop -> 1Gb Ethernet -> DB+App on Raizen5 PC (12 thread CPU)
Second is  
2) DB+wrk+APP on   2x Intel® Xeon® Processor E5649 ( 24 thread CPU total)
See full result in Calc on Google Drve
In short: 
 - on 6/12 Raizen5 CPU  best result is for 64 App threads.
 - on 2x6/12 Xeon CPU - for 128 APP threads
From  TFB Citrine environment  I do not understand Is their Dell R440  equipped with 2 CPU or with one (both is possible)
If Citrine is 2 CPU - we should limit app server thread pool size to 256, if one - to 128 (instead of 64).
P.S.
On Xeon processors test results are very reproducible in opposite to  Desktop CPU (even with good cooling). On Laptop CPU performance tests is near to impossible because of trotting.
P.S.2
 Does anyone understand how many CPU in on Citrine? Tomorrow I plane to prepare a PR for TFB with new thread pool limits and latest 2.0.4148 mORMot
Last edited by mpv (2022-10-31 21:05:12)
Offline
Testing on same HW will produce false thread/workers optimization. We need 2 servers (app and db) with same cpu/cores.
With this bench we also test posgresql. Looking at 
https://pganalyze.com/blog/postgres-14- … monitoring
we can see that best TPS is when # connection=cpu cores (96 vCPU)
Looking at drogon source config.json https://github.com/TechEmpower/Framewor … onfig.json
app.threads_num: 0, mean worker threads=cpu cores.
Looking at actix-http source, no change of number of worker threads, actix default # is cpu cores.
For production server worker threads is Ok to be (1-4)*cpu cores, for this type of benchmark I will try with # cpu cores.
Last edited by ttomas (2022-11-01 09:22:51)
Offline
Testing on same HW will produce false thread/workers optimization. We need 2 servers (app and db) with same cpu/cores.
Yes, you are right (more over - we need 3 servers - third for wrk), but currently I do not have such hardware.
About "connection=cpu cores" - all framework on top of TFB rating are use HTTP and Postgres pipelining. This IMHO is good for benchmarking or to write tools like PgBouncer, but in real life this is a big pain (problems with transactions, debugging and so on).
So our approach with fixed thread pool of synchronous connections is OK. But in this case we need  connection > cpu cores. Yes, some connections will be idle in this case (while app server parse HTTP headers and so on), but (as noted in article you provide) this is not a big problem for connection count < 500.
My propose for now (see my MR) is to remove 64 worker limitation, so on Citrine (in case there is one CPU there) we spawn 24 Cores*4 = 96 working threads. Let's wait for next run and see.
And another note - even with current results if we filter for ORM type=full in /fortunes tests mORMot is #6 ! . #1 is lithium, but it's not a true ORM as noted by @ab in this post. So actually we are #3, what is PERFECT IMHO.
Last edited by mpv (2022-11-01 11:10:16)
Offline
I think it's possible to implement a SQL pipelining for Postgres in synchronous mode. It should speed up a `/queries` endpoint. I will do it for raw mode, and may be even for ORM
Last edited by mpv (2022-11-02 11:22:27)
Offline
I implement an SQL pipe-lining for PostgreSQL in synchronous mode - see MR #127.
For TFB "/rawqueries?queries=20" endpoint it gives ~70% boots, for "rawupdates?queries=20" ~ 10%
When MR will be accepted (it's good to do a mORMot release after this) I'ill do MR to TFB.
@ab - please, look, may be in future you can add a pipelining to the ORM level also?
Offline
@mpv I wanted to say thank you and I am following the work.
Offline
Thanks a lot @mpv for your PR !
It is merged now.
I could make a release tomorrow, because I am currently on vacation, and should be away for the keyboard for two weeks.
So I guess I won't be able to finalize something stable for the ORM in a few hours - even if I suspect it is pretty feasible in the future (just integrate the pipelining as one official feature of the abstract TSqlDBConnection).
I spoke about you today (about TFB and PG) during my session at EKON 26 - some more Delphiers looking toward the Ukraine situation. Stay safe!
Offline
I made a PR to TFB based on [d2a2d829] commit. On my local 12-core environment (after editing "query_url": "/rawqueries?queries=" in benchmark_config.json because I don't know how to run raw test locally)
./tfb --test mormot --type query --query-levels 20
wrk -H 'Host: tfb-server' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 512 --timeout 8 -t 12 "http://tfb-server:8080/rawqueries?queries=20"
WITH PIPELINING
Requests/sec:  17927.58
W/O PIPELINING
Requests/sec:  10449.76Today should starts a new TFB pipeline with our MR where threads is not limited to 64 and we will see is it helps with concurrency, hope next run will include pipelining. Let's wait. Have a good vocation!
P.S.
 In my previous posts I miss `-query-levels 20` parameter, so /query endpoint is verified with 1 query, this is why numbers are so big
Last edited by mpv (2022-11-07 17:57:15)
Offline
mormot #30 & #52 position for the moment
Offline
As expected - results for server with 96 workers are better when with 64 as in previous round. Below is /rawdb concurrency compare
Con      16	    32	        64	        128	        256	        512
64thRPS  69406	116648		217865		305213		308233		306508
96thRPS  68852	116590	        217524	        232641	        354421	        349668As a result on the end of this run (approximately) we got +20 position for db and queries, +10 for fortunes:
 /rawdb  #51 (instead of #76)
 /db       #52 (instead of #77)
/rawfortunes #43 (#52)
/fortunes       #73 (#78)
/rawqueries  #62 (#81)
/queries        #74 (#83)
Hope pipelining MR will be merged by TFB team and on next run we got much better results for /rawqueries (I expect we will be at last #30 or even better)
Last edited by mpv (2022-11-11 11:29:50)
Offline
From my tests
See full result in Calc on Google Drve
In short:
- on 6/12 Raizen5 CPU best result is for 64 App threads.
- on 2x6/12 Xeon CPU - for 128 APP threads
on 256 threads I have a bad result. Suppose "reactor" thread (thread where async HTTP is performed) become a bottleneck but not shure
Offline
@ab - I remove SQLite3 code from TFB tests to prevent confusions like this - https://github.com/TechEmpower/Framewor … ssues/7692
Also increase threads to CPUCount*5 (instead of *4), so will be 24*5 =  120
Last edited by mpv (2022-11-12 12:27:53)
Offline
About problems with /plaintext endpoint:
My current exploration shows what if I increase thread pool then CPU consummation for main thread (in user space, so this is our code - not epoll) increases, and performance - decreases.
Content of HTTP headers doesn't meter, problem is reproduced without headers (so it's not a header parser) using command
wrk -d 10 -c 512 -t 8 "http://192.168.1.6:8080/plaintext"On 12 CPU desktop
for "./raw 24"  main thread of 24-thread async server consume 600% CPU (6 core), 50% of the overall CPUs load is in the kernel space, result is 751294RPS
...... 
for "./raw 96"  main thread of 96-thread async server consume 1000% CPU (10 core), 90% of the overall CPUs load is in the USER space, result is 262234RPS
Unfortunately under valgrind problem is not reproduced, because valgrind in instrumentation mode is slow itself...
Last edited by mpv (2022-11-13 15:48:59)
Offline
Update to prev. post - my measurement mistake - all threads uses near the same portion of CPU (25% - 9%). But overall CPU usage moved from kernel space to user space as noted above
Last edited by mpv (2022-11-13 17:09:33)
Offline
@mpv 
Hope you are not too bad.
What do you call the "main" thread?
What is its name in htop for instance?
Is it the main process thread (named "main"), or is it TAsyncServer.Execute (named "A...") or THttpAsyncServer.Execute (named "W...") or TAsyncConnectionsThread.Execute (named "R..." - probably "R0" for the epoll main thread).
I only have a small CPU (2 cores) so I can't reproduce the issue.
Offline
I'm currently most of the time w\o electricity, GSM and internet. Hope we recover our infrastructure soon... In any case, it is better than russians occupation.
As I noted above - all threads uses near the same portion (about all load in one thread - i'ts my measurement mistake) of CPU, but when I increase workers count, overall CPU usage moved from kernel space (red mark in htop) to user space (green mark)
Last edited by mpv (2022-11-25 10:34:05)
Offline
Could you send a screenshot of htop so that I could see the diverse threads?
8-)
If you have a little time, look at https://forum.lazarus.freepascal.org/in … #msg461281
Where our little mORMot can do wonders for a more realistic use case than the TechEmpower benchmark. 
Offline
Nice - now I know how to turn on custom thread name in htop 
Command
wrk -d 10 -c 512 -t 8 "http://192.168.1.6:8080/plaintext"HTOP 
24 th ~ 700k RPS :   (https://drive.google.com/file/d/1LDU3C1 … sp=sharing)
96 th ~ 400k RPS:  (https://drive.google.com/file/d/1_FmtiQ … sp=sharing)
sorry for delay
Last edited by mpv (2022-11-26 18:14:35)
Offline
look at https://forum.lazarus.freepascal.org/in … #msg461281
Where our little mORMot can do wonders for a more realistic use case than the TechEmpower benchmark.
Nice. Will be good to place mORMot numbers to the prokject README.md, because there is no Pascal numbers on the main page  
   
But TFB is well known benchmark and will be good to be in top20 there IMHO.
I see strange results for /rawqueries in pipelining mode - 636RPS is very strange to me and I can't reproduce it 
Offline
You are right: /rawqueries should be faster than /queries.
Also the cached queries at 3,082 RPS whereas it should be close to JSON serialization.
It is weird that the relative performance between tests on their HW does not match what we obtain in our own tests.
And also the relative performance against Lithium, e.g. for /plaintext.
It is likely to be a problem of worker threads scheduling, triggered only with a lot of threads. 
If there are more threads, it should not be slower. The threads should just remain inactive.
Yes, by default, thread names are not displayed in htop - you have to change the config. 
The fact that R0* thread is the first is not a good sign.
Normally, it should be very low, since all other threads are working, so there is nothing to do in R0, which should wait for new data.
On my PC, R0* has a slow CPU usage. 
We may try to change to use the poll API instead of epoll... and see.
Try to define USEPOLL conditional.
With two potential follow-ups:
1. There may be a problem with the "triggering" mode of epoll.
2. Or we may unsubscribe/subscribe each socket when it is handled in a thread...for epoll it is supposed to be fast, and would avoid false positives which let the epoll wait API call return without waiting.
Offline
My observations:
 - on my hardware I also DO NOT reproduce bad performance for "/cached-queries?count=20" - it gives 75 561 RPS vs, for example, 9 535 RPS for "/queries?queries=20"   
 - using poll gives -10% RPS vs epoll. R0 is on top also. Degradation is the same for 24 vs 96 thread, so problem is not in poll/epoll  
About CPU scheduling - you are right - there is non-normal increasing of cpu-mirgation (and HIGH context-switch in both case, IMHO too) and user-space CPU utilization for both poll and epoll in case we increase workers count - see truncated perf output below for 24 vs 96 workers
I think this can be a root of problem. I found in h2o test what author sets affinty mask for each thread manually - see  here
Do you think it can help? 
$ perf stat ./raw 24
 Performance counter stats for './raw 24':
         63457,69 msec task-clock                #    4,301 CPUs utilized          
         1739496      context-switches          #   27,412 K/sec                  
             7200      cpu-migrations            #  113,461 /sec                   
             1021      page-faults               #   16,089 /sec              
...     
      13,730222000 seconds user
      50,388775000 seconds sys
$ perf stat ./raw 96
 Performance counter stats for './raw 96':
         78413,93 msec task-clock                #    5,293 CPUs utilized          
         3461114      context-switches          #   44,139 K/sec                  
           117782      cpu-migrations            #    1,502 K/sec                  
             1460      page-faults               #   18,619 /sec                   
...
      36,417579000 seconds user
      42,630085000 seconds sysLast edited by mpv (2022-11-27 12:13:57)
Offline