You are not logged in.
Pages: 1
While reading this brilliant book (it's change my mind in many programming aspects) I found what file descriptor returned from accept() call inherits most of parent socket properties under Linux.
A small patch to TCrtSocket marge request #278 allow to save 3% of electricity.
10000 requests to /timestamp endpont measured using strace -cf ./test (output cropped) show us what 3% of time on setsockopt (60 000 cals or 6 calll per accept) go away
Before patch
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
47.95 1.680828 55 30689 276 futex
37.31 1.307881 131 10001 accept
3.33 0.116714 2 60018 setsockopt
2.47 0.086639 9 10004 sendto
1.87 0.065584 7 10002 recvfrom
1.75 0.061495 1 70019 clock_gettime
1.52 0.053310 5 10232 close
1.35 0.047174 5 10002 shutdown
After patch
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
34.84 1.309760 43 30652 349 futex
28.41 1.068102 1068102 1 1 rt_sigsuspend
26.34 0.990362 99 10001 accept
2.21 0.083104 8 10004 sendto
1.97 0.073905 1 70019 clock_gettime
1.57 0.058937 6 10002 recvfrom
1.24 0.046750 5 10231 close
1.24 0.046503 5 10002 shutdown
The next step of optimization can be a huge number of calls to futex (not found a reason yet) and clock_gettime. C (goes from GetTick64 in TrySockRecv). Help is welcome here
Last edited by mpv (2020-03-06 22:27:29)
Offline
Nice trick!
I have merged it.
About the futex, is it the critical section used in TSynThreadPool.Push() ?
Perhaps TSynThreadPoolSubThread.Execute may also be refactored to use an InterlockedExchange() instead of EnterCriticalSection(fOwner.fSafe).
I guess that strace reports a lot of time because there is a contention.
We may find some ideas from https://github.com/BeRo1985/pasmp
Offline
I have committed https://synopse.info/fossil/info/d79db41d83
It should reduce the thread contention, by using two locks in TSynThreadPool (one for the main threads list, and another for the contention list), and a per-thread another lock for the processing flag pointer.
There was a single critical section, which may much more likely trigger concurrent access, therefore futex API call. IIRC if there is no concurrent access, there is no OS API call.
Offline
One question: why you have choosed TRTLCriticalSection instead of TSynLocker?
Offline
@ab - please, rollback this changes - https://synopse.info/fossil/info/d79db41d83 - it produce a dedalock (randomly, sometimes on 1000 request, sometimes on 9000) seem it occurs when PendingContextCount became > 0. In any case (even if no dedlock) the number of futex calls after this patch remains the same (accept x 3) in HTTP1.0 mode (without keep alive)
Last edited by mpv (2020-03-08 13:48:30)
Offline
I found form where one of the futex calls is comes - this is not from calls to Enter/LeaveCriticalSection but from call to fEvent.SetEvent (I'm use a "writeln base debugging" and sure this is where).
Last edited by mpv (2020-03-08 15:36:44)
Offline
Please check https://github.com/synopse/mORMot/commi … ee5bc1c915
It reverts to a single CS, but still with some enhancements.
Also reduce the TObjectList/TList usage.
Offline
Please see also https://synopse.info/fossil/info/1cd7d503dd
TSynList/TSynObjectList should be cross-compiler efficient TList/TObjectList replacement, without any notifications, but slightly faster and easier to inherit/extend.
Offline
Now everything seams to be stable (at last from my CI POW).
As a results in more or less realistic scenario TechEmpower single DB query for SQLite3 DB + SyNode(plain mORMot should be faster) + HTTP1.0 + 400 concurrent connection performance increased from ~35000 RPS to ~37000 RPS (on my core i5U Linux laptop).
On the low level the way to future optimization is clock_gettime
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
18.90 6.102560 6102560 1 1 rt_sigsuspend
15.56 5.023886 9 546456 clock_gettime
15.34 4.952688 91 54637 accept
14.26 4.603774 115 40100 mprotect
7.69 2.481902 61 40805 3434 futex
4.08 1.317904 232 5692 569 lstat
3.73 1.204458 22 55395 close
3.68 1.188450 22 54635 sendto
2.12 0.684840 649 1056 lseek
2.07 0.668234 12 54638 387 shutdown
1.98 0.637603 12 55438 fstat
1.87 0.604021 11 54639 poll
1.71 0.552490 1309 422 read
1.47 0.473438 9 54638 recvfrom
1.46 0.471973 9 54637 getpeername
1.43 0.460655 8 54678 5 ioctl
I'm not sure, but can we calculate a timeout in TCrtSocket.TrySockRecv / TrySndLow only in case single read/write operation do not receive/sends all data? In most case we read/write all we need and calling a GetTick64 in the beginning is not necessary.
Offline
Nice!
Please check https://synopse.info/fossil/info/606597ced5
Offline
And finally I found a root of evil. FPC use syscall for getclock_time, while libc use vDSO (Linux subsytem what expose some of kernel level functions to user space). See https://github.com/synopse/mORMot/pull/281/files
GetTickCount64 fpc 2 494 563 op/sec
GetTickCount64 libc 119 919 893 op/sec
The picture now is such
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
22.92 8.514322 214 39795 mprotect
16.64 6.183122 45 138342 accept
11.75 4.365739 31 139100 close
10.13 3.764708 27 138344 sendto
7.06 2.624247 19 138343 recvfrom
6.69 2.486703 18 138383 1 ioctl
6.03 2.241786 16 138343 393 shutdown
5.82 2.161967 16 138342 getpeername
5.56 2.064928 15 138344 poll
3.14 1.165810 3490 334 79 futex
Not ideal for me, but MUCH better
In absolute numbers (see my prev post) 37680 RPS
Last edited by mpv (2020-03-09 21:21:05)
Offline
Very nice finding!
I have merged your pull request.
I suspect it could lead to a great deal of performance improvement, on highly loaded server with a lot of REST calls - therefore a lot of timing.
Note that GetTickCount64() won't be the only beneficial: I suspect QueryPerformanceMicroSeconds() would also benefit from using the libc.
I guess "mprotect" is from the heap manager reserving memory.
Offline
Our previous commit about GetTick64 removal broke things on Windows.
So I kept it only on POSIX.
See https://synopse.info/fossil/info/1078ba0f05
Offline
Actually we do not need to remove it even on POSIX after GetTickCount64 if fixed....
About mprotect - I also think this is heap allocation. So for a while do not plane to optimize it ( and most of all this is in my code - not in mORMot).
The things what can be optimized is getpeername and ioctl
With getpeername I found elegant solution - it comes from GetRemoteIP call inside TCrtSocket.AcceptRequest. But we already have a remote IP after call to accept in the THttpServer.Execute loop. So we can create a
ServerSock := fServer.fSocketClass.Create(fServer); IPText(ClientSin, ServerSocket.fRemoteIP;
right in the main loop instead of ThreadPull and pass ServerSock instance to Task() instead of socket fd.
This removes a call of getpeername for both Win/POSIX.
But I want to discuss a one step further.
The problem:
In current implementation we accept a connection and if all threads are busy - put it in the Contention queue (the same did Windows on the HTTP.SYS level).
In case server process must be closed either because of internal errors or because OS kill it, or power down we lost all queued sockets (and data what clients sent but we do not read yet - at last first 4Kb (MTU size) is alredy sended by client and buffered in the kernel network intrface).
The worst case is heppens when all threads are busy for a long time (for example database problems or something wrong in server businness logic), contention queue overflows and server reboots. All 1000 (default contention queue size) queries are losts.
The idea:
Why should we accept socket connection in case there is no free thread in the thread pool?
Socket already have a queue for non-accepted connection (see DefaultListenBacklog in SynCrtSock). In case we have a short performance problems its will work as analogue of our contention queue. In case we heve a logn-term performance problem it works even better - caller got error on connect() and:
- in case of load ballancer - select next node from upstream
- in case of direct connection - can try to reconnect
The pros:
- kernel level contention queue
- no "parasite" traffic in case server is busy
- resource saving (every accepted connection weput in contention queue take one port, create memory buffer etc)
- we can simpily our ThreadPull implementation - remove contention queue, simplify selection of free thread
Implementation of http server main loop can be:
while not Terminated do begin
waitForFreeThreadInPull()
SockedFd := Accept(Sock.Sock,ClientSin);
ServerSock := fServer.fSocketClass.Create(fServer);
IPText(ClientSin, ServerSocket.fRemoteIP;
OnConnect;
ThreadPool.Push(ServerSock);
....
end;
Everything is so beautiful that maybe I missed some hidden problems?
Sorry for long post - to the end of week I temporary have very bad conection so buffering all ideas
Offline
But it the error on connect acceptable?
For most simple use cases, without any load-balancer, the re-connection seems like a lot of trouble if the server is just full for a few milliseconds.
Perhaps we may still use a queue, and call something like waitForFreeThreadInPullOrPlaceInQueue() before Accept().
Then for a load-balancer, we just setup our server with no queue and let it reject the connection and switch to another server.
Offline
Not exactly. If we do not call accept() without any load-balancer client will wait (up to client-side read timeout), because on server side OS put connection in the listen socket backlog. Good explanation of backlog is here
Just try to do such:
procedure THttpServer.Execute;
..
// main server process loop
if Sock.Sock>0 then
try
while not Terminated do begin
sleep(10000); <------------------------- thread pull full wait for it for 10 second
ClientSock := Accept(Sock.Sock,ClientSin);
I verify - clients (browser, curl, wrk) is waits
Offline
In fact, there is already such a waitForFreeThreadInPullOrPlaceInQueue() waiting loop in TSynThreadPool.Push, when the queue is full.
It will wait up to ContentionAbortDelay ms (5000 by default) and during this delay, no new connection is ACCEPTed, since Push() will be blocking and don't return to the main THttpServer.Execute loop.
This ContentionAbortDelay was only effective for IOCP, but I just made it also for the socket server.
To test it, just set a small queue (e.g. QueueLenght=10 instead of 1000 slots) and see it happening.
Together with a refactoring of RemoteIP taken directly without GetPeerName call:
please check https://synopse.info/fossil/info/62688161a7
Another potential optimization may be to reuse the THttpServerSocket instances between the calls.
No need to reallocate the SockIn buffer for instance.
But it may be premature optimization.
Offline
@ab - I understand about accept() - thanks!
I also apply another small optimization for Linux. Now all is ideal from my POW (agree what reuse the THttpServerSocket instances is premature optimization - see below).
I try to collect all together in this post - may be you decide to create a blog article (IMHO we are very-very fast in REST scenario (even w/o this optimizations) compared to other FPC/Delphi based HTTP servers).
1) Testing methodology
As a server for test we take a Samples/21 - HTTP Client-Server performance server, set it to HTTP 1.0 mode and change port to 8881 (allow to bind w/o root permission on linux):
// launch the server
aHTTPServer := TSQLHttpServer.Create('8881',[aServer]);
// disable keep-alive to prevent thread spawn for every Keep-Alive HTTP connection
(aHTTPServer.HttpServer as THttpServer).ServerKeepAliveTimeOut := 0;
As a client we use a wrk - well known tool to test a hi-load http servers. Runnign with 8 thread and 400 connections for 5 second :
wrk -t8 -c400 -d5s http://localhost:8881/root/timestamp
For every test we worm-up a server first, when attach an strace to server process and measure a syscalls
sudo strace -cf -p `pidof Project21HttpServer`
and when stop an strace and measure performance.
All tests is on laptop (this is why we run test for 5 second and wait after each test - to prevent CPU trotting)
Linux 5.3.0-40-generic (cp65001) 8 x Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz (x64)
The initial stage: ~58 000 request per second (RPS)
Running 5s test @ http://localhost:8881/root/timestamp
8 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.16ms 9.16ms 211.01ms 99.46%
Req/Sec 7.39k 1.97k 13.89k 71.50%
294288 requests in 5.03s, 50.80MB read
Requests/sec: 58449.58
Transfer/sec: 10.09MB
Strace shows
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
62.66 40.416056 1543 26195 poll
17.38 11.208072 61 183355 clock_gettime
5.41 3.486260 10694 326 90 futex
4.28 2.762974 105 26195 accept
3.81 2.455415 16 157170 setsockopt
2.32 1.496650 57 26195 close
1.42 0.913937 35 26195 getpeername
1.23 0.792855 30 26191 sendto
0.52 0.334711 13 26198 recvfrom
0.50 0.321031 12 26195 362 shutdown
0.47 0.305818 12 26195 3 ioctl
0.00 0.002638 36 73 select
Too many system calls. Let's try to remove some of them.
1) clock_gettime - the strange thing in FPC codebase. Fixed by enabling vDSO time function
2) setsockopt - Linux give us a file descriptor properties inheritance - let's use it
3) getpeername - lets use peer name we got from accept() instead of getpeername and optimize accept for hi-load
On this stage results are ~61 000 RPS
Running 5s test @ http://localhost:8881/root/timestamp
8 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.93ms 5.14ms 209.70ms 99.70%
Req/Sec 7.74k 1.69k 11.72k 69.25%
308383 requests in 5.03s, 53.23MB read
Requests/sec: 61276.04
Transfer/sec: 10.58MB
and syscalls
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
60.56 63.161843 845 74746 poll
12.17 12.694926 170 74747 1 recvfrom
7.61 7.938714 106 74747 close
6.68 6.970383 93 74746 4 ioctl
4.63 4.825983 4699 1027 238 futex
3.03 3.164977 42 74746 accept
3.01 3.142014 42 74742 sendto
2.30 2.398296 32 74747 342 shutdown
0.00 0.002180 46 47 select
Much better, but not ideal - if we look at nginx strace for the same scenario shutdown and ioctl is absent there.
Let's do the same in out mORMot and remove unnecessary ioctl and shytdown
And the final results is ~63000 RPS
Running 5s test @ http://localhost:8881/root/timestamp
8 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.96ms 10.01ms 210.92ms 99.35%
Req/Sec 8.04k 1.89k 13.67k 71.50%
320319 requests in 5.04s, 55.29MB read
Requests/sec: 63547.40
Transfer/sec: 10.97MB
and a system calls are
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
70.36 94.925387 978 97030 poll
9.86 13.308974 137 97030 close
7.01 9.459046 97 97027 sendto
6.23 8.405241 87 97039 recvfrom
4.25 5.736011 2154 2663 684 futex
2.28 3.075656 32 97030 accept
0.00 0.004272 41 103 select
The ideal code is not one with nothing to add, but one with nothing to remove
As a results in REST scenario mORMot http server performance under Linux increased from 57 000 RPS to 63 000 RPS (10%).
And all of this is impossible without excellent Linux tools like strace, perf, valgrind we unfortunently miss on Windows platform.
P.S.
If server is in keep-alive mode the same test give us AWESOME 170 000 RPS but creates too many thread (for a while). So (IMHO) applicable for scenarios with <1000 concurrent users
Last edited by mpv (2020-03-12 16:13:46)
Offline
Great!
I guess that with nginx in reverse proxy, we could have slightly better numbers: keep alive from the client side, and less connections to the mORMot server.
Did you make some benchmark tests with nginx in reverse proxy?
I also suspect that a remote connection over a local fast network (1GB link) may be worth trying to measure.
With nginx as reverse proxy, we would measure with a more realistic scenario.
And it may also validate if using Unix Socket instead of TCP Socket between nginx and the mORMot server makes another difference...
You did an awesome job!
Offline
To test configuration with nginx we need a powerfull server - on laptop everything become slow (maximum 25000 RPS). This is because kernel need to handle twice more sockets compared to direct connection. Hope I can create appropriate infrastructure next week (in our country the government is transferring people to remote work in connection with the COVID - access to infrastructure is difficult)
Offline
Thanks to Christmas I found some time for a mORMot2 (still not enough to migrate my product to it, but starts to discover a codebase more closely).
The main mORMot2 feature for me is an epoll (Linux) based HTTP1.1 (keep-alive) HTTP/WebSocket server with a fixed size workers thread pool.
So I starts from dummy THttpAsyncServer instance to see how it works - results are PERFECT I got ~ 213 000 RPS for dummy server with 1000 concurrent keep-alive connections
But there is some problems:
- first one is about broken keep-alive -it is simple and fixed in #70
- second one is about performance and it's more complex: THttpAsyncServer in idle mode (no requests) uses ~3% of host CPU. This is because of massive futex & nanosleep syscalls:
$ sudo strace -cf -p `pidof httpServerRaw`
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
32.20 0.329206 120 2733 1366 futex
29.48 0.301435 46 6492 nanosleep
The sources for it is TPollSockets.GetOne - a futex is a critical section inside GetOnePending and nanosleep is inside SleepStep call.
We definitionally should do something with this.
Currently I can`t provide a good solution. Just some ideas:
- timeouts can be implemented using timerfd_* together with epoll
- lock in request queue can be avoided by using ring buffer
Will continue to investigate
Last edited by mpv (2022-01-07 18:46:13)
Offline
This is syscalls statistic for 521053 HTTP keep alive request (10 second duration load with 1000 concurrent connections)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
23.46 13.710255 331 41384 12693 futex
21.29 12.438808 8421 1477 1 accept
21.11 12.336652 858 14370 nanosleep
17.78 10.389063 19 522047 sendto
15.20 8.882895 17 522076 recvfrom
1.05 0.612528 333 1835 epoll_wait
0.04 0.024850 8 2952 epoll_ctl
~ 45% of time we are in futex and nanosleep syscalls
Last edited by mpv (2022-01-07 18:45:19)
Offline
I may need to change the GetOnePending() with a lockless queue, and classical TEvent thread wakeup.
What I don't understand is that within TPollSockets.GetOne, if no client is connected, SleepStep() should be in 120-250 ms steps, so should be at almost 0% CPU use.
Could you try https://github.com/synopse/mORMot2/commit/3c8b0ea8 ?
I tried to ensure that those 120-250ms steps were used in case of idle server.
Perhaps the culprit is not GetOne or PollForPendingEvents.
Edit: I suspect my modification won't change anything.
Because we are not in atpReadSingle thread, but in a atpReadPool configuration.
So normally an idle server would be detected and run this:
if fOwner.fClients.fRead.Count = 0 then
fEvent.WaitFor(INFINITE); // blocking until next accept()
which should consume 0% of CPU until next accept().
Edit2: In fact, this WaitFor(INFINITE) is only called at initial accept(). Not for any next time where no client is connected.
I have written https://github.com/synopse/mORMot2/commit/5fdcbbce to ensure if no client is connected, then we wait for the next Accept().
Hope it helps for reaching 0% CPU on Idle server.
Remark: I have explicitly documented that this new async server is for huge number of concurrent connections. If you have a few connections at once, then the regular/old thread server is still there, and may use better the resources.
Offline
Still 3% CPU for idle server with latest sources.
I profile using valgrind - most CPU time spends on internal fpc_initilize\fpc_finalize what wraps PollForPendingEvents function
THTTPAsyncServer.Execute
if fAsync.fClients.fWrite.GetOne(1000, 'W', notif) then
if fPending.Count = 0 then
PollForPendingEvents({timeoutMS=}10); <-------- this call
I remove all try\finally inside PollForPendingEvents but fpc_initialize\finalize still exists. Is this because of records in var block?
P.S.
a raw server sources I use for testing
Last edited by mpv (2022-01-08 10:06:59)
Offline
So it is about polling for writes.
Most of the time, there is no pending output to write. So I guess polling here does not make any sense.
I will poll only if needed.
The fpc_initialize/finalize are for the local sub/new records.
IIRC epoll_wait() can still run in a thread while epoll_ctl() is done in another thread.
So I guess that on Linux we could use a single fPoll[], then use epoll_wait() timeout for waiting instead of polling.
For select and poll (Windows and BSD), this is not possible because those are not thread-safe.
Offline
I spent the whole day making a huge refactoring of epoll use.
Currently poll is actually faster than epoll... not what we expected!
Several issues have been identified. I almost doubled the keep-alive performance.
But it broke how WebSockets sent their frame asynchronously using epoll...
So it is not finished yet.
I hope I will publish something stable tomorrow.
Offline
Please try https://github.com/synopse/mORMot2/commit/98d97175
From my tests, epoll API should use much less OS API calls, and also better performance in keep alive (HTTP/1.1) mode. Also when upgraded to WebSockets.
Still need to tune the multi connection (HTTP/1.0) performance.
I am looking forward to your feedback and numbers.
Edit: Please check also https://github.com/synopse/mORMot2/commit/013be075
It should enhanced HTTP/1.0 performance on small requests.
Offline
My test results (HTTP server with 4 thread)
- in idle mode CPU consumption is 0% - GOOD
- while requesting with 100 concurrent HTTP 1.1 connections max RPS is 274 176 - !impressive!
For comparison on the same hardware for the same test we are:
- 6x times faster compared to nodejs what gives 41 000 RPS (in case mORMot thread pool size is 1 I still got 145659 RPS)
- 2x times faster compared to nginx!! what gives 104 642 RPS (4 working thread - the same as for our server)
Now the bad part
Implementation is not stable - for some conditions server hangs in case ServerThreadPoolCount > 1.
For ServerThreadPoolCount=4 100 (and even 500) concurrent connection newer hang the server, but 1000 does (after some amount of requests 2 thread consumes 1 core each after test ends, server continue to accept and even read request headers, but newer answer)
For ServerThreadPoolCount=16 500 concurrent connection always hangs (after some amount of requests server consume 0 CPU, accepts, read headers but newer answer)
Verified with -O3, the same with -O1 optimization flag
Offline
Thanks for the feedback.
Please try with my latest commits of today.
I have fixed some issues, and tried to stabilize the process.
I am interrested also with HTTP/1.0 numbers, which should be much better now.
Could you try to enable logs, and see what is happening when it hangs.
(set aLogVerbose=true last parameter to THttpAsyncServer.Create)
I have your sample server, but how do you create the clients? Which tool do you use?
Edit: try to reduce SleepHiRes(500); into SleepHiRes(10); in line #2914 of mormot.net.async.pas.
Perhaps the async sending thread - from THttpAsyncServer.Execute - is not responsive enough. But I doubt it may be the culprit because small writes are involved in your test, so no async sending should be triggered.
Offline
With latest commit situation is the same
I still use wrk as a stress load tool - see "Load testing tool" thread for details, there is other tools @Eugene Ilyin notes in this post (Class D)
Problem can be reproduced for example on server with 16 thread using
~/dev/wrk/wrk -c 500 -t 8 http://localhost:8888/echo
After stress we can see (using `strace curl -v http://localhost:8888/echo`) what server accept request but don`t respond
And I got a stack trace in log!
EInvalidPointer {Message:"Invalid pointer operation"} [R0mormotHttpSvr] at 41a1f3 httpServerRaw.lpr (82)
../../src/net/mormot.net.sock.pas mergependingevents (2265)
../../src/net/mormot.net.sock.pas tpollsockets.pollforpendingevents (2464)
../../src/net/mormot.net.async.pas tpollasyncreadsockets.pollforpendingevents (949)
../../src/net/mormot.net.async.pas tasyncconnectionsthread.execute (1625)
Last edited by mpv (2022-01-13 08:55:29)
Offline
I will verify a HTTP 1.0 today evening, but it's a little hard to do - we quickly run out of sockets (whey are remains in "closed" state) - this is my main problem currently in mORMot1 without keep-alive.
Even with such OS tuning, behind NGINX (nginx is very efficiently use connections to back-end), moving all static no nginx etc. max number of actively working users for one IP address is up to 10000. So we need to run several servers on several IPs to handle more. Lucky me I have only one client with huge user base.
Offline
I guess there was an allocation problem in MergePendingEvents().
Should be fixed by https://github.com/synopse/mORMot2/commit/0c15b057
You last test was already including this fix, so it is why you found it stable now.
About your max 10,000 connections, if you use unix socket (and not TCP) between nginx and the mORMot server, you won't suffer from port exhaustion.
The only limit is the OS number of file descriptors, which can be huge.
And from my tests, unix sockets are faster and safer than TCP over the loopback.
I have also made some fixes and enhancements.
Especially https://github.com/synopse/mORMot2/commit/6f735a13 which seems to fix some problems I discovered with TTestMultiThreadProcess.
Now TTestMultiThreadProcess pass with no issue with 25,0000 concurrent clients:
1.1. Multi thread process:
- Create thread pool: 1 assertion passed 666us
- TCP sockets: 599,833 assertions passed 2m28
1=12329/s 2=5210/s 5=842/s
MaxThreads=5 MaxClients=25000 TotalOps=150000 TotalClients=40000
Total failed: 0 / 599,834 - Multi thread process PASSED 2m28
(numbers with logs enabled, which generates 830MB of log useful to track performance and stability problems)
Here there was some clients which were rejected as "too many connections", but the mORMot REST client did retry after a while and eventually succeed. As they should in the real world.
From my point of view, the server seems stabilized now.
Any more feedback and numbers are welcome!
And... please ensure you include mormot.core.fpcx64mm.pas because it has been found to be very efficient in practice.
Ensure you defined -dFPC_X64MM and -dFPCMM_SERVER for your project.
Offline
I confirm there is no problem of huge number of concurrent connections using Unix Domain Sockets.
If I first run
ulimit -H -n
then I can run the following tests:
1.1. Multi thread process:
- Create thread pool: 1 assertion passed 7.37ms
- Unix domain sockets: 1,918,959 assertions passed 1m12
1=17764/s 2=16365/s 5=19613/s
MaxThreads=5 MaxClients=80000 TotalOps=480000 TotalClients=128000
Total failed: 0 / 1,918,960 - Multi thread process PASSED 1m12
So here 80,000 concurrent clients with HTTP/1.1 keep alive stream, with no issue of REST Add/Retrieve process on ORM + SQLite3 database, and perfect scaling.
We can see that its scales much better than TCP sockets, from the numbers on my previous post.
Such numbers were impossible to reach with mORMot 1 server.
Offline
Yes, you are right about UDS and my mORMot1 socket problems...
I can confirm enabling fpcx64mm improve performance - for test case above +12% RPS what is very good..
Tries to test HTTP1.0 mode (just enable nginx as a reverse proxy) and found in both 1.0 and 1.1 mORMOot return ~2000 responses and hangs/ Will continue to investigate later (evening) to give more info
nginx config is:
upstream mormot2 {
server localhost:8888 max_fails=2 fail_timeout=30;
keepalive 32;
}
server {
listen 82;
server_name localhost;
# prevent nginx version exposing in Server header
server_tokens off;
# Enable gzip compression.
# Default: off
gzip on;
location /echo {
proxy_pass http://mormot2;
# proxy_http_version 1.1;
}
location /pure {
return 200 'gangnam style!';
}
}
Test command:
wrk -c 100 -t 16 http://localhost:82/echo
Last edited by mpv (2022-01-14 11:00:52)
Offline
Additional info:
I disable x64mm and now (if behind nginx) got AV very quickly (AV not logged into file, only on console)
0220114 12173421 ' trace mormot.net.async.THttpAsyncConnection(7f211a3574c0) OnRead hrsGetCommand len=56 GET /echo HTTP/1.0$0d$0aHost: mormot2$0d$0aConnection: close$0d$0a$0d$0a
20220114 12173421 ( trace mormot.net.async.THttpAsyncConnection(7f211a357640) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 12173421 ( trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Write 2d closed by AfterWrite handle=36
20220114 12173421 ( trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) UnlockSlotAndCloseConnection: Write() finished on handle=36
20220114 12173421 ( trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Stop sock=2d handle=36 r=2 w=0
20220114 12173421 ( trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) OnClose {"THttpAsyncConnection(7f211a357640)":{Handle:36}}
20220114 12173421 2 trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Start sock=2d handle=36
20220114 12173421 2 trace mormot.net.async.THttpAsyncConnections(7f211ae3e040) Execute: Accept(8888)=20220114 12173421 2 warn mormot.net.async.THttpAsyncConnections(7f211ae3e040) Execute raised uncatched EAccessViolation -> terminate mormotHttpServer
20220114 12173421 2 info mormot.net.async.THttpAsyncConnections(7f211ae3e040) Execute: done AW mormotHttpServer
20220114 12173421 2 info mormot.net.async.THttpAsyncConnections(7f211ae3e040) Execute: State=esFinished
20220114 12173421 ' trace mormot.net.async.THttpAsyncConnection(7f211a3574c0) Write len=151 HTTP/1.0 200 OK$0d$0aServer: mORMot2 (Linux)$0d$0aX-Powered-By: mORMot 2 synopse.info$0d$0aContent-Length: 31$0d$0aConnection: Close$0d$0a$0d$0agot request from connection #36
20220114 12173421 ' trace mormot.net.async.THttpAsyncConnection(7f211a3574c0) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 12173421 ' trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Write 2c closed by AfterWrite handle=35
20220114 12173421 ' trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) UnlockSlotAndCloseConnection: Write() finished on handle=35
20220114 12173421 ' trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Stop sock=2c handle=35 r=2 w=0
20220114 12173421 ' trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) OnClose {"THttpAsyncConnection(7f211a3574c0)":{Handle:35}}
Offline
If verbose log is disabled problem exists. I think I understand the reason - see this lines
20220114 12173421 ( trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Stop sock=2d handle=36 r=2 w=0
20220114 12173421 ( trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) OnClose {"THttpAsyncConnection(7f211a357640)":{Handle:36}}
20220114 12173421 2 trace mormot.net.async.TAsyncConnectionsSockets(7f211b12f340) Start sock=2d handle=36
The same sock 2d used twice. This may happens because nginx aggressively reuse connections..
Offline
What I don't understand is that the handle should be unique, even if the socket is reuse, which should be not a problem here.
The issue is that handle=36 is for both Stop and Start...
Is there a "Start sock=2d handle=36" really twice in your log?
Could you send me a bigger log content?
Offline
I sure the same handle is twice. I repeat tests - see 3e
20220114 14062029 ' trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Write 3b closed by AfterWrite handle=50
20220114 14062029 ' trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) UnlockSlotAndCloseConnection: Write() finished on handle=50
20220114 14062029 ' trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Stop sock=3b handle=50 r=2 w=0
20220114 14062029 ' trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) OnClose {"THttpAsyncConnection(7f81914dbb40)":{Handle:50}}
20220114 14062029 2 trace mormot.net.async.THttpAsyncConnections(7f8191fc1040) ConnectionNew {"THttpAsyncConnection(7f81914dbe40)":{Handle:52}} socket=3d count=0
20220114 14062029 2 trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Start sock=3d handle=52
20220114 14062029 2 trace mormot.net.async.THttpAsyncConnections(7f8191fc1040) Execute: Accept(8888)={"THttpAsyncConnection(7f81914dbe40)":{Handle:52}}
20220114 14062029 - trace mormot.net.async.TPollAsyncReadSockets(7f8192051240) GetOnePending(R7 mormotHttpServer)=7f81914dbe40 1 #1/1
20220114 14062029 ( trace mormot.net.async.THttpAsyncConnection(7f81914dbcc0) Write len=151 HTTP/1.0 200 OK$0d$0aServer: mORMot2 (Linux)$0d$0aX-Powered-By: mORMot 2 synopse.info$0d$0aContent-Length: 31$0d$0aConnection: Close$0d$0a$0d$0agot request from connection #51
20220114 14062029 2 trace mormot.net.async.THttpAsyncConnections(7f8191fc1040) ConnectionNew {"THttpAsyncConnection(7f81914dbfc0)":{Handle:53}} socket=3e count=0
20220114 14062029 - trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) ProcessRead recv(3d)=Ok len=56 in 4us {"TPollAsyncReadSockets(7f8192051240)":{PendingCount:1}}
20220114 14062029 - trace mormot.net.async.THttpAsyncConnection(7f81914dbe40) OnRead hrsGetCommand len=56 GET /echo HTTP/1.0$0d$0aHost: mormot2$0d$0aConnection: close$0d$0a$0d$0a
20220114 14062029 ( trace mormot.net.async.THttpAsyncConnection(7f81914dbcc0) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 14062029 * trace mormot.net.async.TPollAsyncReadSockets(7f8192051240) GetOnePending(R8 mormotHttpServer)=7f81914dbfc0 1 #1/1
20220114 14062029 * trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) ProcessRead recv(3e)=Ok len=56 in 4us {"TPollAsyncReadSockets(7f8192051240)":{}}
20220114 14062029 * trace mormot.net.async.THttpAsyncConnection(7f81914dbfc0) OnRead hrsGetCommand len=56 GET /echo HTTP/1.0$0d$0aHost: mormot2$0d$0aConnection: close$0d$0a$0d$0a
20220114 14062029 * trace mormot.net.async.THttpAsyncConnection(7f81914dbfc0) Write len=151 HTTP/1.0 200 OK$0d$0aServer: mORMot2 (Linux)$0d$0aX-Powered-By: mORMot 2 synopse.info$0d$0aContent-Length: 31$0d$0aConnection: Close$0d$0a$0d$0agot request from connection #53
20220114 14062029 * trace mormot.net.async.THttpAsyncConnection(7f81914dbfc0) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 14062029 * trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Write 3e closed by AfterWrite handle=53
20220114 14062029 * trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) UnlockSlotAndCloseConnection: Write() finished on handle=53
20220114 14062029 * trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Stop sock=3e handle=53 r=2 w=0
20220114 14062029 * trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) OnClose {"THttpAsyncConnection(7f81914dbfc0)":{Handle:53}}
20220114 14062029 - trace mormot.net.async.THttpAsyncConnection(7f81914dbe40) Write len=151 HTTP/1.0 200 OK$0d$0aServer: mORMot2 (Linux)$0d$0aX-Powered-By: mORMot 2 synopse.info$0d$0aContent-Length: 31$0d$0aConnection: Close$0d$0a$0d$0agot request from connection #52
20220114 14062029 ( trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Write 3c closed by AfterWrite handle=51
20220114 14062029 ( trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) UnlockSlotAndCloseConnection: Write() finished on handle=51
20220114 14062029 ( trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Stop sock=3c handle=51 r=2 w=0
20220114 14062029 ( trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) OnClose {"THttpAsyncConnection(7f81914dbcc0)":{Handle:51}}
20220114 14062029 - trace mormot.net.async.THttpAsyncConnection(7f81914dbe40) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 14062029 - trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Write 3d closed by AfterWrite handle=52
20220114 14062029 - trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) UnlockSlotAndCloseConnection: Write() finished on handle=52
20220114 14062029 - trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Stop sock=3d handle=52 r=2 w=0
20220114 14062029 - trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) OnClose {"THttpAsyncConnection(7f81914dbe40)":{Handle:52}}
20220114 14062029 2 trace mormot.net.async.TAsyncConnectionsSockets(7f81922b2340) Start sock=3e handle=53
20220114 14062029 2 trace mormot.net.async.THttpAsyncConnections(7f8191fc1040) Execute: Accept(8888)=20220114 14062029 2 warn mormot.net.async.THttpAsyncConnections(7f8191fc1040) Execute raised uncatched EAccessViolation -> terminate mormotHttpServer
20220114 14062029 2 info mormot.net.async.THttpAsyncConnections(7f8191fc1040) Execute: done AW mormotHttpServer
20220114 14062029 2 info mormot.net.async.THttpAsyncConnections(7f8191fc1040) Execute: State=esFinished
you do not reproduce it?
Last edited by mpv (2022-01-14 14:12:23)
Offline
Perhaps I understand what happened.
After a connection, TAsyncServer.Execute runs TPollAsyncSockets.Start which may delegate the whole connection process to a background thread.
In our case, the connection instance is actually deleted in this background thread BEFORE the connection is logged in Accept().... therefore a random GPF occurs.
Please check https://github.com/synopse/mORMot2/commit/335b2cb1
Offline
Now it works, if I use wrk even not hangs, but very slow (2000 RPS).
There is 60 second random delay, it can be reproduced if i send queries using curl into nginx by hands (Up Enter; Up Eneter in console) with command `curl -v http://localhost:82/echo`
After 15 - 100 requests (randomly) server stop responding, think one minute and continue to respond - see time in line 4 (15:34) and 5 (15:35) below:
20220114 15340646 1 trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) UnlockSlotAndCloseConnection: Write() finished on handle=75
20220114 15340646 1 trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) Stop sock=54 handle=75 r=2 w=0
20220114 15340646 1 trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) OnClose {"THttpAsyncConnection(7f12a5219bc0)":{Handle:75}}
20220114 15340704 2 trace mormot.net.async.THttpAsyncConnections(7f12a5cfe040) ConnectionNew {"THttpAsyncConnection(7f12a5219d40)":{Handle:76}} socket=55 count=0
20220114 15350710 2 trace mormot.net.async.THttpAsyncConnections(7f12a5cfe040) ConnectionNew {"THttpAsyncConnection(7f12a5219ec0)":{Handle:77}} socket=56 count=0
20220114 15350710 & trace mormot.net.async.TPollAsyncReadSockets(7f12a5d8e240) GetOnePending(R1 mormotHttpServer)=7f12a5219d40 1 #1/2
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) ProcessRead recv(55)=Ok len=96 in 12us {"TPollAsyncReadSockets(7f12a5d8e240)":{PendingIndex:1,PendingCount:2}}
20220114 15350710 & trace mormot.net.async.THttpAsyncConnection(7f12a5219d40) OnRead hrsGetCommand len=96 GET /echo11 HTTP/1.1$0d$0aHost: mormot2$0d$0aConnection: close$0d$0aUser-Agent: curl/7.68.0$0d$0aAccept: */*$0d$0a$0d$0a
20220114 15350710 & trace mormot.net.async.THttpAsyncConnection(7f12a5219d40) Write len=151 HTTP/1.0 200 OK$0d$0aServer: mORMot2 (Linux)$0d$0aX-Powered-By: mORMot 2 synopse.info$0d$0aContent-Length: 31$0d$0aConnection: Close$0d$0a$0d$0agot request from connection #76
20220114 15350710 & trace mormot.net.async.THttpAsyncConnection(7f12a5219d40) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) Write 55 closed by AfterWrite handle=76
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) UnlockSlotAndCloseConnection: Write() finished on handle=76
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) Stop sock=55 handle=76 r=2 w=0
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) OnClose {"THttpAsyncConnection(7f12a5219d40)":{Handle:76}}
20220114 15350710 & trace mormot.net.async.TPollAsyncReadSockets(7f12a5d8e240) GetOnePending(R1 mormotHttpServer)=7f12a5219ec0 1 #2/2
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) ProcessRead recv(56)=Ok len=96 in 9us {"TPollAsyncReadSockets(7f12a5d8e240)":{}}
20220114 15350710 & trace mormot.net.async.THttpAsyncConnection(7f12a5219ec0) OnRead hrsGetCommand len=96 GET /echo11 HTTP/1.1$0d$0aHost: mormot2$0d$0aConnection: close$0d$0aUser-Agent: curl/7.68.0$0d$0aAccept: */*$0d$0a$0d$0a
20220114 15350710 & trace mormot.net.async.THttpAsyncConnection(7f12a5219ec0) Write len=151 HTTP/1.0 200 OK$0d$0aServer: mORMot2 (Linux)$0d$0aX-Powered-By: mORMot 2 synopse.info$0d$0aContent-Length: 31$0d$0aConnection: Close$0d$0a$0d$0agot request from connection #77
20220114 15350710 & trace mormot.net.async.THttpAsyncConnection(7f12a5219ec0) AfterWrite Done ContentLength=31 Wr=0 Flags=pc
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) Write 56 closed by AfterWrite handle=77
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) UnlockSlotAndCloseConnection: Write() finished on handle=77
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) Stop sock=56 handle=77 r=2 w=0
20220114 15350710 & trace mormot.net.async.TAsyncConnectionsSockets(7f12a5fef340) OnClose {"THttpAsyncConnection(7f12a5219ec0)":{Handle:77}}
BTW trace is perfect thing!
Last edited by mpv (2022-01-14 15:42:51)
Offline
Delay occurs only when proxy using HTTP 1.0, if proxied using HTTP 1.1 performance is ~70000 RPS.
Updated nginx config with 3 endpoint:
- /echo proxy using http1.0 w\o keep-alive
- /echo11 - uses HTTP1.1 with keep alive (up to 1024 connection for each nginx worker)
- and (just to compare performance) /pure endpoint what respond 200 "gangnam style!' on nginx level
upstream mormot2 {
server localhost:8888 max_fails=2 fail_timeout=30;
keepalive 1024;
}
server {
listen 82;
server_name localhost;
server_tokens off;
location /echo {
proxy_pass http://127.0.0.1:8888;
}
location /echo11 {
proxy_pass http://mormot2;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
location /pure {
return 200 'gangnam style!';
}
}
Last edited by mpv (2022-01-14 15:59:28)
Offline
Please try with my latest commits.
It should get rid of the delay now.
Edit: I discovered a regression about raw Network protocols (not http/ws) on Delphi / Windows.
I will fix it.
Edit 2: Should be fixed now with https://github.com/synopse/mORMot2/commit/e0aa2703
Offline
Test results:
- Server stable in case clients are HTTP 1.1, 1.0 and mix of HTTP1.1. and 1.0.
- I can't reproduce a delay anymore
But
HTTP1.0 mode still too slow ~1864 RPS
The strange thing is what in HTTP 1.0 mode server consume only small part of CPU (each core is about ~10%), while in HTTP 1.1 - 100%
I analyse a syscalls - everything is OK
Try to profile via valgrind - and got another strange result - if sample profiler is active all responses are non 200, in verbose mode server start reporting
warn mormot.net.async.THttpAsyncConnections(04b36040) Execute: Accept(8888) failed as Too Many Connections
what is not true, because ss -s shows 3595 socket in use
PS
test result above is done before Edit2 added in post above
Edit2
with latest commit result is the same
Last edited by mpv (2022-01-14 21:36:52)
Offline
I spend days tracking - and fixing - several issues.
The main problem was that the sockets were not closed when epoll() was used!
Then I have discovered several difficult race condition bugs, when heavy multi-threading was involved.
Also enhanced the MaxPending value to 10000 - the previous 2000 did make the server reject connections on heavy benchmark.
Now I have no trouble with HTTP/1.1 or HTTP/1.0, with thousands of connections.
For instance, here are numbers with forcing HTTP/1.0, and 50 threads running 29980 requests, therefore connections:
1.1. Multi thread process:
- Create thread pool: 1 assertion passed 3.03ms
- TCP sockets: 119,921 assertions passed 4.18s
1=11682/s 2=16590/s 5=18688/s 10=19351/s 30=14645/s 50=14223/s
MaxThreads=50 MaxClients=5000 TotalOps=29980 TotalClients=9800 HTTP1.0
Total failed: 0 / 119,922 - Multi thread process PASSED 4.25s
Your feedback is welcome.
About the remaining socket errors above 1000 clients with WRK, I guess the problem rely on the WRK tool itself.
It first connect every client in a loop, then start working with the clients. This is not well handled by our server. I have no trouble with the ORM multi-thread client code, which uses a more regular pattern. My guess is that WRK behaves not like a regular set of clients, but like a DOS attacker. I consider the socket failure when bombing the server as a security feature, if more regular clients do connect with no problem.
I am only concerned about nginx as HTTP/1.0 reverse proxy. Perhaps in this case, the regular useHttpSocket web server should be defined instead of useHttpAsync - this is what is currently documented, by the way.
Offline
Pages: 1