High-performance frameworks

mpv · 2022-08-01 10:52:46

Fresh results

Max RPS:
┌──────────────┬────────────┬────────────┬────────────┬────────────┐
│   (index)    │mormot(0720)│mormot(0730)│mormot(0801)│ lithium    │
├──────────────┼────────────┼────────────┼────────────┼────────────┤
│   fortune    │   74318    │   90500    │   91287    │   90064    │
│  plaintext   │   920198   │   977024   │   986253   │  3388906   │
│      db      │   111119   │   116756   │   117624   │   99463    │
│    update    │   10177    │   10108    │   10981    │   25718    │
│     json     │   422771   │   446284   │   458358   │   544247   │
│    query     │   106665   │   113516   │   114842   │   94638    │
│ cached-query │   384818   │   416903   │   419020   │   528433   │
└──────────────┴────────────┴────────────┴────────────┴────────────┘

We achieved performance at which room temperature changes affects measurement So I upgrade my PS to middle-tower with manual cooler speed switch. During normal work I sets it minimal for silence, during measurement - to maximum to prevent CPU temperature growing.

ab · 2022-08-01 12:08:28

Perhaps with hsoNoStats to disable low-level statistic counters, it may vary a little bit.
Or it may be the room temperature.
https://github.com/synopse/mORMot2/commit/59e1f82c

mpv · 2022-08-01 14:45:23

I add new hsoNoStats option.

Also another small (1%) improvements PR#111:
- our StrLen is twice faster compared to PQGetLength 0.2% vs 0.4% on TFB /db
- prevent unnecessary PQGetIsNull call - should be called only for empty string (to distinguish null and empty string result) 0.6%

Last edited by mpv (2022-08-01 14:45:44)

mpv · 2022-08-01 17:58:21

How fast is new MoveFast in mORMot? So fast what I decide to add a TFB #1 drogon for comparition (results are without latest Postgre improvements)

Max RPS:
┌──────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐────────────┐
│   (index)    │mormot(0720)│mormot(0730)│mormot(0801)│ mormot(mf) │ drogon     │ lithium    │
├──────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤────────────┤
│   fortune    │   74318    │   90500    │   91287    │   113073   │   176131   │   90064    │
│  plaintext   │   920198   │   977024   │   986253   │  1436231   │  3583444   │  3388906   │
│      db      │   111119   │   116756   │   117624   │   153009   │   176776   │   99463    │
│    update    │   10177    │   10108    │   10981    │   15476    │   90230    │   25718    │
│     json     │   422771   │   446284   │   458358   │   590979   │   554328   │   544247   │
│    query     │   106665   │   113516   │   114842   │   148187   │   171092   │   94638    │
│ cached-query │   384818   │   416903   │   419020   │   547307   │            │   528433   │
└──────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘────────────┘

Last edited by mpv (2022-08-01 18:10:18)

ab · 2022-08-12 07:15:32

Please find some small changes in https://gist.github.com/synopse/83f4522 … 70d25b9c4d

- use the new TSynMustache.RenderDataArray() which allow direct use from the local variables, with no temporary TDocVariant conversion, for fortune
- tuned TRestBatch options for update
- use an external count variable for rawfortune

We need to note that drogon is an interresting C++ framework, much more realistic than lithium, even if what is called "ORM" is not really a runtime ORM - there is still a lot of bowerplate code to generate as in https://github.com/TechEmpower/Framewor … Fortune.cc at compile time. It does not start from an Object, it generates some class types from a .cc controller file.
Ahead-of-time code generation is of course very efficient in performance, but we loose the ORM approach.
There is still a lot of code to write for the methods, for instance, it seems to support only manual JSON serialization.

mpv · 2022-08-13 14:08:39

Today`s state: +10% for fortunes thanks to TSynMustache.RenderDataArray()

Max RPS:
┌--------------┬------------┬------------┬------------┬------------┬------------┐------------┐------------┐
│   (index)    │mormot(0720)│mormot(0730)│mormot(0801)│mormot(0802)│mormot(0813)│ drogon     │ lithium    │
├--------------┼------------┼------------┼------------┼------------┼------------┤------------┤------------┤
│   fortune    │   74318    │   90500    │   91287    │   113073   │   126055   │   176131   │   90064    │
│  plaintext   │   920198   │   977024   │   986253   │  1436231   │  1373177   │  3583444   │  3388906   │
│      db      │   111119   │   116756   │   117624   │   153009   │   154033   │   176776   │   99463    │
│    update    │   10177    │   10108    │   10981    │   15476    │   15336    │   90230    │   25718    │
│     json     │   422771   │   446284   │   458358   │   590979   │   584294   │   554328   │   544247   │
│    query     │   106665   │   113516   │   114842   │   148187   │   149122   │   171092   │   94638    │
│ cached-query │   384818   │   416903   │   419020   │   547307   │   551230   │            │   528433   │
└--------------┴------------┴------------┴------------┴------------┴------------┘------------┘------------┘

Last edited by mpv (2022-08-13 14:09:15)

nglthach · 2022-08-14 07:22:46

Hi ab,

I follow mORMot in a long time bcs your works are amazing! I'm not sure about runtime ORM. Here is my ORM does:

  /// <summary>
  ///
  /// </summary>
  TDBException = class(TDBModel)
  public
    const FIELD_ID          = 'id';
    const FIELD_MESSAGE     = 'message';
    const FIELD_STACK_TRACE = 'stack_trace';
    const FIELD_MODULE      = 'module';
    const FIELD_CREATED_AT  = 'created_at';
  private
    fID        : Int64;
    fMessage   : string;
    fStackTrace: string;
    fModule    : string;
    fCreatedAt : TDateTime;
    procedure SetMessage(const aValue: string);
    procedure SetStackTrace(const aValue: string);
    procedure SetModule(const aValue: string);
    procedure SetCreatedAt(const aValue: TDateTime);
  protected
    function DoGetFieldValue(const aField: string): TValue; override;
    class function DoGetFieldType(const aField: string): TFieldType; override;
    procedure DoSetInteger(const aField: string; const aValue: Int64); override;
    procedure DoSetDouble(const aField: string; const aValue: Double); override;
    procedure DoSetText(const aField: string; const aValue: string); override;
  public
    class function GetFieldList: TArray<string>; override;
    class function GetDataFieldList: TArray<string>; override;
    class function GetPrimaryKeyFields: TArray<string>; override;
    class function GetTableName: string; override;
  public
    property ID: Int64 read fID;
    property Message: string read fMessage write SetMessage;
    property StackTrace: string read fStackTrace write SetStackTrace;
    property Module: string read fModule write SetModule;
    property CreatedAt: TDateTime read fCreatedAt write SetCreatedAt;
  end;


class function TDBException.DoGetFieldType(const aField: string): TFieldType;
begin
  if InArray(aField,
    [
      FIELD_ID
    ])
  then
    Result := ftInteger
  else if InArray(aField,
    [
      FIELD_MESSAGE,
      FIELD_STACK_TRACE,
      FIELD_MODULE
    ])
  then
    Result := ftText
  else if InArray(aField,
    [
      FIELD_CREATED_AT
    ])
  then
    Result := ftDouble
  else
    Result := ftNone;
end;

function TDBException.DoGetFieldValue(const aField: string): TValue;
begin
  if aField = FIELD_ID then
    Result := fID
  else if aField = FIELD_MESSAGE then
    Result := fMessage
  else if aField = FIELD_STACK_TRACE then
    Result := fStackTrace
  else if aField = FIELD_MODULE then
    Result := fModule
  else if aField = FIELD_CREATED_AT then
    Result := fCreatedAt;
end;

procedure TDBException.DoSetDouble(const aField: string; const aValue: Double);
begin
  if aField = FIELD_CREATED_AT then
    fCreatedAt := aValue;
end;

procedure TDBException.DoSetInteger(const aField: string; const aValue: Int64);
begin
  if aField = FIELD_ID then
    fID := aValue;
end;

procedure TDBException.DoSetText(const aField, aValue: string);
begin
  if aField = FIELD_MESSAGE then
    fMessage := aValue
  else if aField = FIELD_STACK_TRACE then
    fStackTrace := aValue
  else if aField = FIELD_MODULE then
    fModule := aValue;
end;

class function TDBException.GetDataFieldList: TArray<string>;
begin
  Result := [
    FIELD_MESSAGE,
    FIELD_STACK_TRACE,
    FIELD_MODULE,
    FIELD_CREATED_AT
  ];
end;

class function TDBException.GetFieldList: TArray<string>;
begin
  Result := [
    FIELD_ID,
    FIELD_MESSAGE,
    FIELD_STACK_TRACE,
    FIELD_MODULE,
    FIELD_CREATED_AT
  ];
end;

class function TDBException.GetPrimaryKeyFields: TArray<string>;
begin
  Result := [
    FIELD_ID
  ];
end;

class function TDBException.GetTableName: string;
begin
  Result := 'exceptions';
end;

procedure TDBException.SetCreatedAt(const aValue: TDateTime);
begin
  if (CompareDateTime(fCreatedAt, aValue) <> EqualsValue) or IsNull(FIELD_CREATED_AT) then
  begin
    fCreatedAt := aValue;
    MarkNotNullAndDirtyField(FIELD_CREATED_AT);
  end;
end;

procedure TDBException.SetMessage(const aValue: string);
begin
  if (fMessage <> aValue) or IsNull(fMessage) then
  begin
    fMessage := aValue;
    MarkNotNullAndDirtyField(FIELD_MESSAGE);
  end;
end;

procedure TDBException.SetModule(const aValue: string);
begin
  if (fModule <> aValue) or IsNull(FIELD_MODULE) then
  begin
    fModule := aValue;
    MarkNotNullAndDirtyField(FIELD_MODULE);
  end;
end;

procedure TDBException.SetStackTrace(const aValue: string);
begin
  if (fStackTrace <> aValue) or IsNull(FIELD_STACK_TRACE) then
  begin
    fStackTrace := aValue;
    MarkNotNullAndDirtyField(FIELD_STACK_TRACE);
  end;
end;

This is for example only. It does not use RTTI. It runs fast in my cases. If you would like to use the idea of my ORM, please let me know. I will post the full source here!

ab · 2022-08-14 13:22:17

Hello nglthach and welcome.

But please follow the forum rules, and don't post huge set of code in the forum itself.
See https://synopse.info/forum/misc.php?action=rules
Use gist for instance to share some code.

So if I understand correctly, you don't use RTTI but you generate some pascal code from a data model?
This is not what an ORM is - because there is no Object mapping to be honest. It is close to what pseudo-ORM of lithium or drogon do in the benchmark.
The benefit of RTTI is that you can reuse a lot of code, once you handle the type mapping between pascal RTTI and SQL columns.
I am not sure your approach which is text-based for the field lookup will make a huge performance difference - on the contrary, using RTTI and a bitset is likely to be faster, as we do in TOrmProperties/TOrmModelProperties, especially if you have more than a few fields.
mORMot approach is not perfect, this is not a "pure ORM", for instance we are limited to an TID=Int64 primary key. But this limitation allows to use the SQLite3 virtual table mechanism, which is of a huge benefit. Or we have to inherit from the TOrm class, which has a lot of benefits too because we can use this base type in our ORM CRUD layer with no need to bloated generics, for instance.

You can share some code on gist or github: it is always interesting to share ideas.
Try to make something similar to https://github.com/synopse/mORMot2/tree … xtdb-bench with your ORM, and share the numbers.

nglthach · 2022-08-15 03:04:04

Hi ab,

Thank you for the notice! I will follow the forum rules. I do know there are many things need to be optimized. I will do a benchmark and post into another thread. Return this thread back to TechEmpower Framework Benchmarks

ab · 2022-08-16 11:50:17

I have included a copy of the TFB mORMot test in the repository.
See https://github.com/synopse/mORMot2/tree … ower-bench

With some additional information, and above numbers.

squirrel · 2022-08-22 08:47:11

Please allow me an uneducated question (this is far outside my expertise):
Would building these same benchmarks on both Fpc and Delphi compilers and running them on the same hardware provide any meaningful comparison between the compilers as well or would the current tests not provide relevant data?

ab · 2022-08-22 09:18:59

On Delphi, I expect it to be slower for such a benchmark.
Because
1. it will run only on Windows - and from our tests, both http.sys and socket server are slower than on Linux
2. it won't use our x86_64 memory manager, which scales better than the original FastMM4.
3. some of our AVX/AVX2 asm compiles only on FPC

So it won't be a fair comparison. Unless we compile the Windows version - but it is not the main point of such benchmarks.
On production, I would use a Linux server.

In practice, Delphi and FPC has similar code generation level on x86_64. On Win32, Delphi generates slightly better code from my experiment.
And mORMot bypass most of the RTL to use its own routines, so difference is not noticeable.

mpv · 2022-08-22 16:00:53

Our TFB pull 7481 is merged into master.
Next tfb-status check start in ~97 hours, so we got a results ~after 225 hours = 2022-09-01

mpv · 2022-08-22 16:05:06

We newer test mORMot before on such powerful hardware ( Intel Xeon Gold 5120 CPU, 32 GB, enterprise SSD. 3 servers (DB, app and load generator) connected using dedicated Cisco 10-gigabit), some unexpected things may happens, but hope everything will be OK.

Last edited by mpv (2022-08-22 16:07:04)

TPrami · 2022-08-23 04:48:37

IF ZLib is used, happened to stumble upon this:

https://aws.amazon.com/blogs/opensource … lib-forks/

Quite a bit faster forks of ZLIb, some not API compatible, and some only for in memory operations, but anyhow. check those out.

-Tee-

Last edited by TPrami (2022-08-23 05:58:18)

mpv · 2022-08-23 07:03:33

For TFB tests compression is not permitted - see rule ix of requirements

In mORMot client <-> mORMot server scenarios proprietary synLZ is used.
In real life hi-load Web scenarios (mORMot <-> reverse proxy <-> browser) IMHO preferred compression is Brotli see comparison with gzip , and it can be enabled on reverse proxy level

For gzip compession mORMot uses libdeflate - @ab - it's a good idea to note a sources, used to build static libraries for mORMot in statics/README,.md - `/res/static/` is enough.
And inside /res/static/ libraries folders - link to original sources, because currently it is not clear what exactly implementation is used.

Last edited by mpv (2022-08-23 07:18:53)

ab · 2022-08-23 09:44:02

Just follow the Readme link you will find https://github.com/synopse/mORMot2/tree … res/static
And there all the information and original C code.

And from my tests, libdeflate at level 1 (what we use for HTTP) is faster than brotli.
I will include the latest libdeflate 1.13 statics in the next days, which made level 1 even faster than before.

This AWS blog article is not very accurate.
They just missed the fastest zlib library around, which is libdeflate. Much faster than cloudflare fork, because it is a full rewrite.

It is exciting to wait for next September - and potential results on this configuration.
I am confident it should pass with this HW - which is not much more diverse than what we already used.
And since we reduced the memory allocation to bare minimum, and our MM has good scaling abilities, I expect almost linear progression.
The only blocking process is currently the ORM update, which has a critical section in the ORM core (to be optimized later). But all other process should be non blocking.

TPrami · 2022-08-30 04:05:17

ab wrote:

And from my tests, libdeflate at level 1 (what we use for HTTP) is faster than brotli.
I will include the latest libdeflate 1.13 statics in the next days, which made level 1 even faster than before.
This AWS blog article is not very accurate.
They just missed the fastest zlib library around, which is libdeflate. Much faster than cloudflare fork, because it is a full rewrite.

Slightly off topic but made qp fro delphi for the ZLib implementation, if someone cares, please vote and give more info: https://quality.embarcadero.com/browse/RSP-38978

-Tee-

ab · 2022-09-02 10:54:16

I have optimized TSynMustache, making some extensions and a code review:
- avoid searching for space in tag names (to see if it may be an helper)
- a lot of fixes and refactoring to enhance code generation
https://github.com/synopse/mORMot2/commit/117bd27c

Now on TFB benchmark /fortunes is very close to /db - as it should in a perfect world.
See https://gist.github.com/synopse/7ec565b … 435d7f7b3a
My notebook is not a production server for sure (only 2 cores), so we could guess very good numbers on dedicated HW.

Here are the server-side statistics on the above requests:
https://gist.github.com/synopse/6305b88 … 5397f37dd5
As we can see, a lot of requests were processed, and our MM did not sleep once, i.e. it had no multi-thread contention, and consumed only 7MB of RAM when allocating 3GB of small blocks. Nice.

mpv · 2022-09-03 08:47:00

Unfortunately /fortune returns empty table (without rows) on current master... Data are retrieved correctly, so this is mustache problem

ab · 2022-09-03 19:19:01

You are right!
Should be fine with https://github.com/synopse/mORMot2/commit/1392558f

Numbers are good

abouchez@tisab:~/Downloads$ wrk -c 100 -d 15s -t 2 http://localhost:8080/db
Running 15s test @ http://localhost:8080/db
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.81ms    1.76ms  27.47ms   79.16%
    Req/Sec    10.47k   581.61    11.71k    78.33%
  312643 requests in 15.01s, 55.99MB read
Requests/sec:  20835.23
Transfer/sec:      3.73MB
abouchez@tisab:~/Downloads$ wrk -c 100 -d 15s -t 2 http://localhost:8080/fortunes
Running 15s test @ http://localhost:8080/fortunes
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.32ms    1.94ms  26.68ms   75.58%
    Req/Sec     9.37k   706.97    11.15k    74.33%
  279683 requests in 15.02s, 370.48MB read
Requests/sec:  18623.19
Transfer/sec:     24.67MB

mpv · 2022-09-04 09:05:22

On my 12 thread desktop (Rizen5 5600G overclocked to 4.2GHz, DDR4 memory DDR4-2666 (16)) there is no visible difference after mustache optimizations.
New test result are little slower, compared to one from 2022-08-13, but this is because I add 2 RAM module, and now can't overclock memory to DDR4-3200:

The best result as expected) is on -c 12 wrk mode (12 threads)
for ./ftb --benchmark:
- fortunes before mustache opt - 124114 RPS
- fortunes after mustache opt - 124510 RPS

If a run server on host and manually test

wrk -H 'Host: tfb-server' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 128 --timeout 8 -t 12 "http://tfb-server:8080/fortunes"

- fortunes before mustache opt - 126819
- fortunes after mustache opt - 127688

BTW when Postgres is in container and app is on host - `docker-proxy` become a bottleneck ( tfb runs all part in docker-compose, so there is no docket proxy in their case)
I disable it by forcing docker port forwarding to use iptables - add a { "userland-proxy": false } in /etc/docker/daemon.json and now results from tfb util and from manual test execution is near the same/

ab · 2022-09-10 07:59:35

mpv wrote:

Our TFB pull 7481 is merged into master.
Next tfb-status check start in ~97 hours, so we got a results ~after 225 hours = 2022-09-01

The current work is still not finished. They seemed to have paused their "citrine" server instance.
It did not include the mORMot merge request in this round anyway (they grabbed the 19/8 commit, and mORMot was merged on 22/8).

So I don't know when we could have some official data including mORMot.
Let's wait as marmots do at the end of the Winter.

ttomas · 2022-10-27 10:17:40

Test is finally running. TechEmpower have a lot of problems and stop tests, change HW and move office.
Thanks in advance to Pavel and Arnaud. Waiting for results...
Pavel I hope you are OK, I'm really admire to you and your work in this terrible time for you and your family.

mpv · 2022-10-27 13:06:30

Yes, tests is running
Here in Ukraine we currently have a huge problems with electricity - russian terrorists destroys our electric transformers across the country, so we save energy as much as possible. Positive - I'm reading paper books again during blackouts And we believe in the Armed Forces of Ukraine together with help from all civilized countries of the world.

dcoun · 2022-10-30 07:11:56

#34 the mormot for the moment. I saw it also in #52

mpv · 2022-10-30 09:41:57

There is definitely a scalability problem. If we look at "Data table" tab for "Single query", we`ll se what we do not scales linear during increasing of connections count.

Con  16	        32	        64	        128	        256	        512
RPS  69406	116648		217865		305213		308233		306508

If we sort data table by conn count (I copy it to Calc), when for 16 connection raw mORMot is #10, for 32 - #9, 64 - #22, 128 - #23, 256 - #37, 512 - #46. Near the same distribution is for "orm" mORMot.
While top frameworks shows its best on 512 connections.

So my expectation what on powerful hardware we cat got some unexpected results is unfortunately true (I test on 6core/12thread CPU, while TFB Citrine environment is 2x 14 core/28 thread CPU)
Hope the main problem is what we limit DB connection pool to 64.

PostgresSQL in TFB test suite is configured to 2000 max_connections (default is 100!).
I got a temporary access to server with 24 cores, and (when blackouts allow) will try to play with pool size.

ttomas · 2022-10-30 10:06:20

Doing the same before @mvp post :-), rank/concurrency Fortunes test data.
No improvements between 256 and 512

Framework	16	32	64	128	256	512
Mormot-raw	10	9	13	20	35	38
Mormot-orm	33	32	30	47	54	51

Last edited by ttomas (2022-10-30 10:09:20)

mpv · 2022-10-31 20:59:15

Huge problems with electricity today in Ukraine. Horde is very active on Mondays.
But I test on 2 environment - first is
1) Wrk on Laptop -> 1Gb Ethernet -> DB+App on Raizen5 PC (12 thread CPU)
Second is
2) DB+wrk+APP on 2x Intel® Xeon® Processor E5649 ( 24 thread CPU total)

See full result in Calc on Google Drve

In short:
- on 6/12 Raizen5 CPU best result is for 64 App threads.
- on 2x6/12 Xeon CPU - for 128 APP threads

From TFB Citrine environment I do not understand Is their Dell R440 equipped with 2 CPU or with one (both is possible)
If Citrine is 2 CPU - we should limit app server thread pool size to 256, if one - to 128 (instead of 64).

P.S.
On Xeon processors test results are very reproducible in opposite to Desktop CPU (even with good cooling). On Laptop CPU performance tests is near to impossible because of trotting.

P.S.2
Does anyone understand how many CPU in on Citrine? Tomorrow I plane to prepare a PR for TFB with new thread pool limits and latest 2.0.4148 mORMot

Last edited by mpv (2022-10-31 21:05:12)

ttomas · 2022-11-01 09:09:30

Testing on same HW will produce false thread/workers optimization. We need 2 servers (app and db) with same cpu/cores.
With this bench we also test posgresql. Looking at
https://pganalyze.com/blog/postgres-14- … monitoring
we can see that best TPS is when # connection=cpu cores (96 vCPU)
Looking at drogon source config.json https://github.com/TechEmpower/Framewor … onfig.json
app.threads_num: 0, mean worker threads=cpu cores.
Looking at actix-http source, no change of number of worker threads, actix default # is cpu cores.
For production server worker threads is Ok to be (1-4)*cpu cores, for this type of benchmark I will try with # cpu cores.

Last edited by ttomas (2022-11-01 09:22:51)

mpv · 2022-11-01 11:08:57

ttomas wrote:

Testing on same HW will produce false thread/workers optimization. We need 2 servers (app and db) with same cpu/cores.

Yes, you are right (more over - we need 3 servers - third for wrk), but currently I do not have such hardware.
About "connection=cpu cores" - all framework on top of TFB rating are use HTTP and Postgres pipelining. This IMHO is good for benchmarking or to write tools like PgBouncer, but in real life this is a big pain (problems with transactions, debugging and so on).
So our approach with fixed thread pool of synchronous connections is OK. But in this case we need connection > cpu cores. Yes, some connections will be idle in this case (while app server parse HTTP headers and so on), but (as noted in article you provide) this is not a big problem for connection count < 500.

My propose for now (see my MR) is to remove 64 worker limitation, so on Citrine (in case there is one CPU there) we spawn 24 Cores*4 = 96 working threads. Let's wait for next run and see.

And another note - even with current results if we filter for ORM type=full in /fortunes tests mORMot is #6 ! . #1 is lithium, but it's not a true ORM as noted by @ab in this post. So actually we are #3, what is PERFECT IMHO.

Last edited by mpv (2022-11-01 11:10:16)

mpv · 2022-11-02 11:22:12

I think it's possible to implement a SQL pipelining for Postgres in synchronous mode. It should speed up a `/queries` endpoint. I will do it for raw mode, and may be even for ORM

Last edited by mpv (2022-11-02 11:22:27)

mpv · 2022-11-06 19:24:41

I implement an SQL pipe-lining for PostgreSQL in synchronous mode - see MR #127.
For TFB "/rawqueries?queries=20" endpoint it gives ~70% boots, for "rawupdates?queries=20" ~ 10%

When MR will be accepted (it's good to do a mORMot release after this) I'ill do MR to TFB.

@ab - please, look, may be in future you can add a pipelining to the ORM level also?

okoba · 2022-11-07 06:50:34

@mpv I wanted to say thank you and I am following the work.

ab · 2022-11-07 16:39:52

Thanks a lot @mpv for your PR !
It is merged now.

I could make a release tomorrow, because I am currently on vacation, and should be away for the keyboard for two weeks.
So I guess I won't be able to finalize something stable for the ORM in a few hours - even if I suspect it is pretty feasible in the future (just integrate the pipelining as one official feature of the abstract TSqlDBConnection).

I spoke about you today (about TFB and PG) during my session at EKON 26 - some more Delphiers looking toward the Ukraine situation. Stay safe!

mpv · 2022-11-07 17:54:32

I made a PR to TFB based on [d2a2d829] commit. On my local 12-core environment (after editing "query_url": "/rawqueries?queries=" in benchmark_config.json because I don't know how to run raw test locally)

./tfb --test mormot --type query --query-levels 20
wrk -H 'Host: tfb-server' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 512 --timeout 8 -t 12 "http://tfb-server:8080/rawqueries?queries=20"

WITH PIPELINING
Requests/sec:  17927.58

W/O PIPELINING
Requests/sec:  10449.76

Today should starts a new TFB pipeline with our MR where threads is not limited to 64 and we will see is it helps with concurrency, hope next run will include pipelining. Let's wait. Have a good vocation!

P.S.
In my previous posts I miss `-query-levels 20` parameter, so /query endpoint is verified with 1 query, this is why numbers are so big

Last edited by mpv (2022-11-07 17:57:15)

dcoun · 2022-11-11 07:09:48

mormot #30 & #52 position for the moment

mpv · 2022-11-11 11:27:29

As expected - results for server with 96 workers are better when with 64 as in previous round. Below is /rawdb concurrency compare

Con      16	    32	        64	        128	        256	        512
64thRPS  69406	116648		217865		305213		308233		306508
96thRPS  68852	116590	        217524	        232641	        354421	        349668

As a result on the end of this run (approximately) we got +20 position for db and queries, +10 for fortunes:
/rawdb #51 (instead of #76)
/db #52 (instead of #77)

/rawfortunes #43 (#52)
/fortunes #73 (#78)

/rawqueries #62 (#81)
/queries #74 (#83)

Hope pipelining MR will be merged by TFB team and on next run we got much better results for /rawqueries (I expect we will be at last #30 or even better)

Last edited by mpv (2022-11-11 11:29:50)

ab · 2022-11-11 17:15:42

Why not set 256 threads ?

mpv · 2022-11-12 11:52:09

From my tests

mpv wrote:

See full result in Calc on Google Drve
In short:
- on 6/12 Raizen5 CPU best result is for 64 App threads.
- on 2x6/12 Xeon CPU - for 128 APP threads

on 256 threads I have a bad result. Suppose "reactor" thread (thread where async HTTP is performed) become a bottleneck but not shure

mpv · 2022-11-12 12:25:48

@ab - I remove SQLite3 code from TFB tests to prevent confusions like this - https://github.com/TechEmpower/Framewor … ssues/7692
Also increase threads to CPUCount*5 (instead of *4), so will be 24*5 = 120

Last edited by mpv (2022-11-12 12:27:53)

mpv · 2022-11-13 15:47:43

About problems with /plaintext endpoint:

My current exploration shows what if I increase thread pool then CPU consummation for main thread (in user space, so this is our code - not epoll) increases, and performance - decreases.
Content of HTTP headers doesn't meter, problem is reproduced without headers (so it's not a header parser) using command

wrk -d 10 -c 512 -t 8 "http://192.168.1.6:8080/plaintext"

On 12 CPU desktop
for "./raw 24" main thread of 24-thread async server consume 600% CPU (6 core), 50% of the overall CPUs load is in the kernel space, result is 751294RPS
......
for "./raw 96" main thread of 96-thread async server consume 1000% CPU (10 core), 90% of the overall CPUs load is in the USER space, result is 262234RPS

Unfortunately under valgrind problem is not reproduced, because valgrind in instrumentation mode is slow itself...

Last edited by mpv (2022-11-13 15:48:59)

mpv · 2022-11-13 17:09:20

Update to prev. post - my measurement mistake - all threads uses near the same portion of CPU (25% - 9%). But overall CPU usage moved from kernel space to user space as noted above

Last edited by mpv (2022-11-13 17:09:33)

ab · 2022-11-23 09:11:08

@mpv
Hope you are not too bad.

What do you call the "main" thread?
What is its name in htop for instance?

Is it the main process thread (named "main"), or is it TAsyncServer.Execute (named "A...") or THttpAsyncServer.Execute (named "W...") or TAsyncConnectionsThread.Execute (named "R..." - probably "R0" for the epoll main thread).

I only have a small CPU (2 cores) so I can't reproduce the issue.

mpv · 2022-11-25 10:32:33

I'm currently most of the time w\o electricity, GSM and internet. Hope we recover our infrastructure soon... In any case, it is better than russians occupation.
As I noted above - all threads uses near the same portion (about all load in one thread - i'ts my measurement mistake) of CPU, but when I increase workers count, overall CPU usage moved from kernel space (red mark in htop) to user space (green mark)

Last edited by mpv (2022-11-25 10:34:05)

ab · 2022-11-25 13:00:19

Could you send a screenshot of htop so that I could see the diverse threads?
8-)

If you have a little time, look at https://forum.lazarus.freepascal.org/in … #msg461281
Where our little mORMot can do wonders for a more realistic use case than the TechEmpower benchmark.

mpv · 2022-11-26 18:12:49

Nice - now I know how to turn on custom thread name in htop

Command

wrk -d 10 -c 512 -t 8 "http://192.168.1.6:8080/plaintext"

HTOP
24 th ~ 700k RPS : 24 thread mode (https://drive.google.com/file/d/1LDU3C1 … sp=sharing)
96 th ~ 400k RPS: 96 thread mode (https://drive.google.com/file/d/1_FmtiQ … sp=sharing)

sorry for delay

Last edited by mpv (2022-11-26 18:14:35)

mpv · 2022-11-26 18:25:02

ab wrote:

look at https://forum.lazarus.freepascal.org/in … #msg461281
Where our little mORMot can do wonders for a more realistic use case than the TechEmpower benchmark.

Nice. Will be good to place mORMot numbers to the prokject README.md, because there is no Pascal numbers on the main page
But TFB is well known benchmark and will be good to be in top20 there IMHO.

I see strange results for /rawqueries in pipelining mode - 636RPS is very strange to me and I can't reproduce it

ab · 2022-11-26 21:15:14

You are right: /rawqueries should be faster than /queries.
Also the cached queries at 3,082 RPS whereas it should be close to JSON serialization.

It is weird that the relative performance between tests on their HW does not match what we obtain in our own tests.
And also the relative performance against Lithium, e.g. for /plaintext.

It is likely to be a problem of worker threads scheduling, triggered only with a lot of threads.
If there are more threads, it should not be slower. The threads should just remain inactive.

Yes, by default, thread names are not displayed in htop - you have to change the config.
The fact that R0* thread is the first is not a good sign.
Normally, it should be very low, since all other threads are working, so there is nothing to do in R0, which should wait for new data.
On my PC, R0* has a slow CPU usage.

We may try to change to use the poll API instead of epoll... and see.
Try to define USEPOLL conditional.

With two potential follow-ups:
1. There may be a problem with the "triggering" mode of epoll.
2. Or we may unsubscribe/subscribe each socket when it is handled in a thread...for epoll it is supposed to be fast, and would avoid false positives which let the epoll wait API call return without waiting.

mpv · 2022-11-27 12:13:08

My observations:
- on my hardware I also DO NOT reproduce bad performance for "/cached-queries?count=20" - it gives 75 561 RPS vs, for example, 9 535 RPS for "/queries?queries=20"
- using poll gives -10% RPS vs epoll. R0 is on top also. Degradation is the same for 24 vs 96 thread, so problem is not in poll/epoll

About CPU scheduling - you are right - there is non-normal increasing of cpu-mirgation (and HIGH context-switch in both case, IMHO too) and user-space CPU utilization for both poll and epoll in case we increase workers count - see truncated perf output below for 24 vs 96 workers
I think this can be a root of problem. I found in h2o test what author sets affinty mask for each thread manually - see here
Do you think it can help?

$ perf stat ./raw 24
 Performance counter stats for './raw 24':

         63457,69 msec task-clock                #    4,301 CPUs utilized          
         1739496      context-switches          #   27,412 K/sec                  
             7200      cpu-migrations            #  113,461 /sec                   
             1021      page-faults               #   16,089 /sec              
...     
      13,730222000 seconds user
      50,388775000 seconds sys

$ perf stat ./raw 96
 Performance counter stats for './raw 96':

         78413,93 msec task-clock                #    5,293 CPUs utilized          
         3461114      context-switches          #   44,139 K/sec                  
           117782      cpu-migrations            #    1,502 K/sec                  
             1460      page-faults               #   18,619 /sec                   
...
      36,417579000 seconds user
      42,630085000 seconds sys

Last edited by mpv (2022-11-27 12:13:57)

mORMot Open Source

#101 2022-08-01 10:52:46

Re: High-performance frameworks

#102 2022-08-01 12:08:28

Re: High-performance frameworks

#103 2022-08-01 14:45:23

Re: High-performance frameworks

#104 2022-08-01 17:58:21

Re: High-performance frameworks

#105 2022-08-12 07:15:32

Re: High-performance frameworks

#106 2022-08-13 14:08:39

Re: High-performance frameworks

#107 2022-08-14 07:22:46

Re: High-performance frameworks

#108 2022-08-14 13:22:17

Re: High-performance frameworks

#109 2022-08-15 03:04:04

Re: High-performance frameworks

#110 2022-08-16 11:50:17

Re: High-performance frameworks

#111 2022-08-22 08:47:11

Re: High-performance frameworks

#112 2022-08-22 09:18:59

Re: High-performance frameworks

#113 2022-08-22 16:00:53

Re: High-performance frameworks

#114 2022-08-22 16:05:06

Re: High-performance frameworks

#115 2022-08-23 04:48:37

Re: High-performance frameworks

#116 2022-08-23 07:03:33

Re: High-performance frameworks

#117 2022-08-23 09:44:02

Re: High-performance frameworks

#118 2022-08-30 04:05:17

Re: High-performance frameworks

#119 2022-09-02 10:54:16

Re: High-performance frameworks

#120 2022-09-03 08:47:00

Re: High-performance frameworks

#121 2022-09-03 19:19:01

Re: High-performance frameworks

#122 2022-09-04 09:05:22

Re: High-performance frameworks

#123 2022-09-10 07:59:35

Re: High-performance frameworks

#124 2022-10-27 10:17:40

Re: High-performance frameworks

#125 2022-10-27 13:06:30

Re: High-performance frameworks

#126 2022-10-30 07:11:56

Re: High-performance frameworks

#127 2022-10-30 09:41:57

Re: High-performance frameworks

#128 2022-10-30 10:06:20

Re: High-performance frameworks

#129 2022-10-31 20:59:15

Re: High-performance frameworks

#130 2022-11-01 09:09:30

Re: High-performance frameworks

#131 2022-11-01 11:08:57

Re: High-performance frameworks

#132 2022-11-02 11:22:12

Re: High-performance frameworks

#133 2022-11-06 19:24:41

Re: High-performance frameworks

#134 2022-11-07 06:50:34

Re: High-performance frameworks

#135 2022-11-07 16:39:52

Re: High-performance frameworks

#136 2022-11-07 17:54:32

Re: High-performance frameworks

#137 2022-11-11 07:09:48

Re: High-performance frameworks

#138 2022-11-11 11:27:29

Re: High-performance frameworks

#139 2022-11-11 17:15:42

Re: High-performance frameworks

#140 2022-11-12 11:52:09