#1 2019-08-20 09:00:24

berocoder
Member
From: Nedervetil, Finland
Registered: 2017-01-13
Posts: 19

Spontaneous disconnection in production environment

Well the header may describe our problem well.

Since a while, our team has used Synopse for networking. But some parts still use Indy.
The reason for the change was basically the support for WebSockets and nonblocking calls in Synopse.
Then that it has a reputation as fast doesn't hurt either.

Our client.exe and server.exe files use common units for connection to our own developed windows service.
This connection now uses Synopse for that. Some days everything works fine. But some days there are spontaneous disconnections that require manually restart of servers. This is of course serious!

Note that I have not worked with Synopse integration directly. It is Daniel Mauric that has vacation now.
The first question is what version of Synopse we use? I see in git that the latest commit in Synopse folder was 24 June 2019 with commit comment "Fix for thread names".

SynopseCommit.inc contains 1.18.5254. Is that the build-number?

The first change in SynCommons.pas is from

function IPToCardinal(const aIP: RawUTF8; out aValue: cardinal): boolean; overload;

to

function IPToCardinal(P: PUTF8Char; out aValue: cardinal): boolean; overload;

Maybe this is enough to identify our version?

Now the next question is how to fix this?
We have speculated that the reason can be a high load on the network.

This is a sample from logs when this happens at production. JCL is used as an exceptionhandler.

2019-08-16 11:15:13:612 (9584) Failed to send OSS message.: TInterfacedObjectFakeClient.FakeCall(IOSSService.SendOssNotification) failed: 'URI root/OSSService.SendOssNotification ["SYNC","E:GeoLocation:349879534\r\nM:GeoLocation.fLatitude,fLongitude,updated:349879534\r\n"] returned status 'Not Found' (404 - Network problem or request timeout)'
2019-08-16 11:15:13:752 (9584) [EXCEPTION] Raised EInterfaceFactoryException: TInterfacedObjectFakeClient.FakeCall(IOSSService.SendOssNotification) failed: 'URI root/OSSService.SendOssNotification ["SYNC","E:GeoLocation:349879534\r\nM:GeoLocation.fLatitude,fLongitude,updated:349879534\r\n"] returned status 'Not Found' (404 - Network problem or request timeout)'. At Location mORMot.RaiseError (mORMot.pas:54991)
Call Stack:
    [00] mORMot.RaiseError (mORMot.pas:54989)
    [01] mORMot.InternalProcess (mORMot.pas:55077)
    [02] System.@GetMem$qqri (System.pas:4614)
    [03] System.@NewAnsiString$qqrius (System.pas:24129)
    [04] System.@LStrAsg$qqrr27System.%AnsiStringT$us$i0$%x27System.%AnsiStringT$us$i0$% (System.pas:25278)
    [05] SynCommons.FormatUTF8$qqrx31System.%AnsiStringT$us$i65001$%px14System.TVarRecxir31System.%AnsiStringT$us$i65001$% (SynCommons.pas:24176)
    [06] FastMM4.FastReallocMem$qqrpvi
    [07] mORMot.TInterfacedObjectFake.FakeCall$qqrr21Mormot.TFakeCallStack (mORMot.pas:55165)
    [08] System.@UStrCat$qqrr20System.UnicodeStringx20System.UnicodeString (System.pas:29458)
    [09] System.@FreeMem$qqrpv (System.pas:4662)
    [10] System.@UStrClr$qqrpv (System.pas:24272)
    [11] BoldLoggableCriticalSection.TBoldLoggableCriticalSection.Release$qqrv
    [12] AttracsNetworkTransportSynopse.TOSSService.Execute$qqrv (AttracsNetworkTransportSynopse.pas:119)
    [13] System.Classes.Classes.ThreadProc$qqrxp22System.Classes.TThread (System.Classes.pas:14701)
    [14] System.ThreadWrapper$qqspv (System.pas:23825)


2019-08-16 11:15:13:971 (2876) Lost connection to OSS Server

Is there a more recent version of Synopse that can improve the situation?
We also discussed if we maybe could make a demo of the problem, but how do we simulate a high load on the network reproduce?

Of course, another possibility is wrong implementation of Synopse but Daniel doesn't believe that theory.
But I have noticed one thing.
At office when I connect to the database from my local machine this error never happens.
But when working from home when I use VPN it happens maybe 50 - 80 % of the time I try.
So obviously I now prefer to work at the office smile

So for us, this is top priority to solve!

Regards
Roland Bengtsson
Team Attracs Finland

Offline

#2 2019-08-20 09:53:53

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,659
Website

Re: Spontaneous disconnection in production environment

Current revision is 1.18.5308.
So first action would be to compile using this revision.

From your above information, sounds like a communication problem.
Which HTTP client class do you use? Socket or WinHttp?

Offline

#3 2019-08-20 11:35:36

berocoder
Member
From: Nedervetil, Finland
Registered: 2017-01-13
Posts: 19

Re: Spontaneous disconnection in production environment

Hi, it create an instance of TSQLHttpClientWebsockets like this

procedure TOSSService.Initialize;
begin
  fCallback := nil;
  fService := nil;
  fClient := TSQLHttpClientWebsockets.Create(fHost, AnsiString(IntToStr(fPort)), TSQLModel.Create([]));
  Client.Model.Owner := Client;
  Client.KeepAliveMS := 5 * 1000;
  Client.OnWebSocketsClosed := OnDisconnect;
  Client.WebSocketsUpgrade(BOLDSERVICE_TRANSMISSION_KEY);
  if not Client.ServerTimeStampSynchronize then
    raise EOssException.CreateFmt(
      'Error connecting to OSS server at %s:%d', [fHost, fPort]);

  Client.WebSockets.Settings.LoopDelay := 5;
  Client.WebSockets.Settings.HeartbeatDelay := 1000;
  Client.ServiceDefine([IOSSService],sicShared);
  if not Client.Services.Resolve(IOSSService, fService) then
    raise EOssException.Create('Service IOSSService unavailable');

  Client.OnWebSocketsClosed := OnDisconnect;

  fCallback := TOssCallback.Create(Client,IOssCallback);
  fService.SubscribeToOss(fCallback);
  AttracsTraceLog.TraceLog.Trace('Connected to OSS Server');
end;

This is latest code not yet in production.

procedure TOSSService.Initialize;
var
  s: string;
begin
  fCallback := nil;
  fService := nil;
  FreeAndNil(fClient);
  try
    fClient := TSQLHttpClientWebsockets.Create(fHost, AnsiString(IntToStr(fPort)), TSQLModel.Create([]));
    Client.DefaultWebSocketProcessSettings.HeartbeatDelay := 1000;
    Client.DefaultWebSocketProcessSettings.DisconnectAfterInvalidHeartbeatCount := 10;
    Client.OnWebSocketsClosed := DoOnDisconnect;
    Client.Model.Owner := Client;
    Client.KeepAliveMS := 10 * 1000;
    Client.OnWebSocketsClosed := OnDisconnect;
    s := Client.WebSocketsUpgrade(BOLDSERVICE_TRANSMISSION_KEY);
    if s <> '' then
      raise EOssException.Create(s);
    if not Client.ServerTimeStampSynchronize then
      raise EOssException.CreateFmt(Client.LastErrorMessage+
        '. Error connecting to OSS server at %s:%d', [fHost, fPort]);

    Client.WebSockets.Settings.LoopDelay := 5;
    Client.WebSockets.Settings.HeartbeatDelay := 1000;
    Client.ServiceDefine([IOSSService],sicShared);

    if not Client.Services.Resolve(IOSSService, fService) then
      raise EOssException.Create('Service IOSSService unavailable');

    fCallback := TOssCallback.Create(Client, IOssCallback);
    fService.SubscribeToOss({fClientId,} fCallback);
    DoOnConnect(self);
  except
    FreeAndNil(fClient);
    fCallback := nil;
    fService := nil;
    DoOnDisconnect(self);
    raise;
  end;
end;

So I believe that the major difference is only that the new event for disconnection is used.

/Roland

ab wrote:

Current revision is 1.18.5308.
So first action would be to compile using this revision.

From your above information, sounds like a communication problem.
Which HTTP client class do you use? Socket or WinHttp?

Offline

#4 2019-08-20 14:49:55

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,659
Website

Re: Spontaneous disconnection in production environment

Client.OnWebSocketsClosed is set twice... pretty weird...

First thing would be to enable the mORMot logs, with full details for the WebSockets, and check the traced information.

I could be available for code audit and remote debugging, if needed.

Offline

#5 2019-08-21 07:31:30

berocoder
Member
From: Nedervetil, Finland
Registered: 2017-01-13
Posts: 19

Re: Spontaneous disconnection in production environment

ab wrote:

Client.OnWebSocketsClosed is set twice... pretty weird...

Hi, first I must say that it is now I start reading docs for Mormot more deeply.
Both the code and docs are impressive work indeed!!!
Thanks for making this opensource.

Before pasting the code here I remove some commented code to make it more clear.
So that is set OnWebSocketsClosed twice is a small mistake but guess make no harm either.

First thing would be to enable the mORMot logs, with full details for the WebSockets, and check the traced information.

I could be available for code audit and remote debugging if needed.

Yes, I agree that detailed logs would be useful. This is the current code in Windows service that communicates with servers and clients.

  Server.ServiceDefine(TOssService,[IOSSService],sicShared).ByPassAuthentication := true;

  Server.OnSessionClosed := SessionClosed;
  Server.OnSessionCreate := SessionCreate;
  Server.CreateMissingTables;
  HttpServer := TSQLHttpServer.Create(AnsiString(IntToStr(vPort)), [Server], '+', useBidirSocket, 32, secSynShaAes, 'OSS');
  WebSocketServerRest := HttpServer.WebSocketsEnable(Server, BOLDSERVICE_TRANSMISSION_KEY, true);
//  WebSocketServerRest.Settings.SetFullLog;
  WebSocketServerRest.Settings.LogDetails := [];
  WebSocketServerRest.Settings.HeartbeatDelay := 1000;
  WebSocketServerRest.Settings.LoopDelay := 5;
//  TSQLLog.Family.Level := LOG_VERBOSE;
  TSQLLog.Family.PerThreadLog := ptIdentifiedInOnFile;
  WebSocketLog := TSQLLog;
  TDDDRepositoryRestFactory.ComputeSQLRecord([
    TSynMonitorWithSize,TSQLRestServerMonitor]);

As you see we had LOG_VERBOSE turned on in the past but is now commented out. So I believe it only logs exceptions now. The reason was that it relative quick logs GB of data and filled the discs...
I read the docs and Mormot supports rotating logs.

The service is tricky as it needs to be up during worktime 7-17 every day. Preferably 24/7 all days but we must of course stop it for updates. I also found the section with LogViewer to make logs more readable.

So Log configuration
- Is LOG_VERBOSE ok for debugging ?
- Compressed logs to save diskspace.
- When a certain logsize is reached it should close that file and create a new one. That way old files can be deleted if run out of diskspace without restarting service.

So if I want max 20 GB of logs:

  TSQLLog.Family.Level := LOG_VERBOSE;
  TSQLLog.Family.PerThreadLog := ptIdentifiedInOnFile;
  TSQLLog.Family.RotateFileCount := 20;
  TSQLLog.Family.RotateFileSizeKB := 1024 * 1024 // 1 GB log

Those 20 GB is it compressed or uncompressed?
Look at https://synopse.info/files/html/api-1.1 … FILESIZEKB

Maximum size of auto-rotated logging files, in kilo-bytes (per 1024 bytes)
- specify the maximum file size upon which .synlz rotation takes place

What happens when total log size reached the limit, in this case, 20 GB? Are the oldest logs deleted?
That would be the prefered way as that is probably not interesting anymore

This is for the service. Is the same setup for logs needed for our server.exe files to be able to follow both sender and receiver?


Another question, https://synopse.info/files/html/api-1.1 … THUNITNAME
That is used to log unitnames. Does it work with inc-files ?
Bold framework that uses inc-files to store methods in model.
That is a limitation as not all tools support it. It works in JCL but not in MadExcept exceptionhandler for example.

About remote debugging, it is good to know. Let's see then when more people here are back from their vacations.

/Roland

Offline

#6 2019-08-22 06:08:09

pvn0
Member
From: Slovenia
Registered: 2018-02-12
Posts: 211

Re: Spontaneous disconnection in production environment

It's not uncommon to get a timeout (unstable network that causes excessive packet drops, server busy etc...), you need to handle this events and recover or adjust the timeout parameters.

Secondly, use DefaultWebSocketProcessSettings to set the settings, otherwise your settings will not propagate to new connections.

Offline

#7 2019-08-22 07:37:37

berocoder
Member
From: Nedervetil, Finland
Registered: 2017-01-13
Posts: 19

Re: Spontaneous disconnection in production environment

pvn0 wrote:

It's not uncommon to get a timeout (unstable network that causes excessive packet drops, server busy etc...), you need to handle this events and recover or adjust the timeout parameters.

I tried to adjust timeout but got the same error.
I thought mormot handle errors. Have you example of errorhandling code ?
Basically if there is a network error I want to try again until success.


Secondly, use DefaultWebSocketProcessSettings to set the settings, otherwise your settings will not propagate to new connections.

Thanks for the information

Offline

#8 2019-08-22 09:09:24

pvn0
Member
From: Slovenia
Registered: 2018-02-12
Posts: 211

Re: Spontaneous disconnection in production environment

You should look at the OnFailed event property of your client class.

Because you said that you have to restart the servers, I would speculate your own server code manages to deadlock which causes clients to timeout, in this case no amount of parameter tweaking is going to solve your problem.
Judging by your lack of experience, I strongly suggest taking ab's offer on the code audit.

Offline

#9 2019-08-22 10:41:52

berocoder
Member
From: Nedervetil, Finland
Registered: 2017-01-13
Posts: 19

Re: Spontaneous disconnection in production environment

pvn0 wrote:

You should look at the OnFailed event property of your client class.

Could someone point me to the documentation for that ?
I found ONAUTHENTICATIONFAILED but probably not correct.

Because you said that you have to restart the servers, I would speculate your own server code manages to deadlock which causes clients to timeout, in this case no amount of parameter tweaking is going to solve your problem.
Judging by your lack of experience, I strongly suggest taking ab's offer on the code audit.

I guess we do code audit as this is important stuff for us. In the meantime I work to add enhanced logging and make it more configurable.

Offline

#10 2019-08-22 11:18:19

pvn0
Member
From: Slovenia
Registered: 2018-02-12
Posts: 211

Re: Spontaneous disconnection in production environment

Offline

#11 2019-08-22 11:21:17

berocoder
Member
From: Nedervetil, Finland
Registered: 2017-01-13
Posts: 19

Re: Spontaneous disconnection in production environment

Thanks smile

Offline

Board footer

Powered by FluxBB