#1 2024-10-15 14:55:11

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

malloc(): unaligned tcache chunk detected

Hi Arnaud,

under Linux + fpc 3.2.0 -O3 http async (epoll) built with mORMot2 1c3447f4 (2024-09-17) and I'm using cmem.
I've got that error and our service was killed by os (ubuntu).

Is that error known and fixed in a later version?

Last edited by danielkuettner (2024-10-15 16:08:02)

Offline

#2 2024-10-15 16:14:45

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,686
Website

Re: malloc(): unaligned tcache chunk detected

Very interesting.
This error was reported also during last TFB testing.
But we were not able to reproduce it here.
Our guess is that there is a memory-related bug in mORMot async server, which is usually not triggered by the mormot.core.fpcx64mm heap but more often by cmem.

So it is good news that you were able to reproduce it!

Do you have more info?
Where/when is it killed?
What is the load?
Can you generate the logs in verbose mode (with full low-level async process info) and send them to me?

Perhaps with more context, we could be able to locate the faulty part of the code...

Online

#3 2024-10-15 17:28:13

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,571
Website

Re: malloc(): unaligned tcache chunk detected

So problem is not in HTTP pipelining.
Hope Daniel gives us a way to reproduce this issue

Offline

#4 2024-10-15 17:41:54

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

Yes I hope too.

This error is not really reproducible. I had it the first time since July, but we had several other errors over the last months (e.g. service not responsible or undefined exceptions or the Variant error with with _Safe(v)^.InitArrayFromCsvFile and memory corruption after that) and we have changed a lot of possible bugs on our side but without serious knowledge.

I can change logging to verbose mode and send the file to you (but it's very huge and I will give you an url for downloading per email) but also verbose logging wasn't enough in past. Perhaps adding logs in mormot.net.async or mormot.net.sock is the way to go.

If error is async related could I change to useHttpSocket for testing (in front of our service we are using nginx as reverse proxy)?

->Do you have more info?
Only this errors I had in logfile since last weeks:
20241015 15564437 EXC EThreadError {Message:"Thread error"} [R12:root]
20241015 15564446 EXC EThreadError {Message:"Thread error"} [R14:root]
20241015 15564448 EXC EThreadError {Message:"Thread error"} [R13:root]
20241015 16171856 EXC EThreadError {Message:"Thread error"} [R13:root]
20241015 16171859 EXC EThreadError {Message:"Thread error"} [R8:root]
But it seems to be transparent from the client side, because we have no error responses related to that errors.

Atm I use two services as upstream under nginx (linux as main service and a windows http.sys service as backup). The windows service seems to be more stable as the linux with async server.


->Where/when is it killed?
It's killed from os today. Here are the rows from syslog:
Oct 15 16:09:53 mssql SOneSrv2[358268]: malloc(): unaligned tcache chunk detected
Oct 15 16:09:53 mssql systemd[1]: SOneSrv2.service: Main process exited, code=killed, status=6/ABRT
Oct 15 16:09:53 mssql systemd[1]: SOneSrv2.service: Failed with result 'signal'.
Oct 15 16:09:53 mssql systemd[1]: SOneSrv2.service: Consumed 28min 29.400s CPU time.
Oct 15 16:10:23 mssql systemd[1]: SOneSrv2.service: Scheduled restart job, restart counter is at 1.

->What is the load?
It's difficult to answer.
Per day we have about 400-500 thousand requests with a average request time of 20ms, but there are also request with text search in MongoDB that could run 10s.
We are using an lxc container with 32 cores, 100G mem (60G free) and zfs. Our service has 32 worker threads and there are about 100-200 concurrent users.
The cpu-usage is mostly under 10% but for some seconds over 100% because MongoDB is heavily used in front of postgres. But I would say we don't have any overload here.

Offline

#5 2024-10-15 20:49:04

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,686
Website

Re: malloc(): unaligned tcache chunk detected

I don't know where this EThreadError come from.

On FPC, this EThreadError comes from the fpc_threaderror RTL function.
It seems to be triggerred only by the cthreads.pp unit, when a TRtlCriticalSection is used or when RTL/Basic Events are created.

Since events are created once per thread, and the exception is raised during the run on existing threads, my guess is that it is a failing pthread_mutex_lock/trylock/unlock.
I will review the TRtlCriticalSection involved with the async server, but AFAICT there are very few of them.

Online

#6 2024-10-16 08:27:20

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

I've sent you an email with a log file link.

Perhaps this link is interesting for you:

https://stackoverflow.com/questions/693 … ecv-is-suc

I've also read about a hardware issue with ECC RAM but all other programs (postgres, mongodb) run without errors, so I would not go in this direction.

Last edited by danielkuettner (2024-10-16 08:42:20)

Offline

#7 2024-10-16 11:36:06

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,686
Website

Re: malloc(): unaligned tcache chunk detected

Yes, it is likely to be a dangling pointer issue, i.e. trying to free a memory block which is not a memory block.

I don't think it is about HW. The problem is in mORMot, but like an Heisenbug it is very difficult to reproduce, therefore to fix.

I have again reviewed the async we server code, and there is no obvious problem I could identify.

@mpv
Perhaps the EThreadError may give us some information.
With TFB, we got the malloc problem even if don't use cmem, but the mormot.core.fpcx64mm.pas heap manager. The only part using malloc is likely to be pthread.
Perhaps there is an execution problem in a thread, which triggers EThreadError, which lets the pthread library in a weird state, and after a while malloc fails and kills the process.
I have hardened https://github.com/synopse/mORMot2/commit/04e3c28b but I doubt it makes a real difference.

Online

#8 2024-10-16 13:34:29

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

Here are the relevant log file entries for the EThreadError:

20241016 13290923  -    08.340.038
20241016 13290923 EXC           EThreadError {Message:"Thread error"} [R16:root] at 428014
20241016 13290923 debug         uRestServerDB.TSOneRestServerDB(020bafc8) TServiceFactoryServer.InstanceFree: ignored EThreadError exception during IPersons._Release
20241016 13290923 srvr           192.168.1.214 Interface POST root/Persons.GetOneByID=200 out=3 KB in 4.32s
20241016 13290923 ret           mormot.rest.server.TRestServerRoutingRest(7fdb340cd238) {"result":[{"res": true, ...

The call of the sicSingle service method Persons.GetOneByID is ok from the pov of the client.

This error also comes in case of useHttpSocket (not only with useHttpAsync) and under current mORMot2 branch #38874e16c.

Last edited by danielkuettner (2024-10-16 13:53:20)

Offline

#9 2024-10-16 15:25:19

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,686
Website

Re: malloc(): unaligned tcache chunk detected

@daniel
You should perhaps enable the thread identification in the logs.

Online

#10 2024-10-16 15:38:58

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

perThreadLog:= ptIdentifiedInOneFile

This one?

Offline

#11 2024-10-16 18:31:39

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,686
Website

Re: malloc(): unaligned tcache chunk detected

Yes!

Online

#12 2024-10-17 19:39:43

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

Do you know this thread and is it interesting to you?

https://gitlab.com/freepascal.org/fpc/s … sues/40677

Offline

#13 2024-10-18 15:03:56

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

Hi ab,

just fyi the error in my service from today and the syslog entries bring me to the nested option for lxc containers and I think I'll give it a try:

https://forum.proxmox.com/threads/lxc-a … -13.36173/

Offline

#14 2024-10-21 08:01:02

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

Back to the EThreadError the source comes from:

20241021 06575224  . debug  uRestServerDB.TSOneRestServerDB(02eb3298) TServiceFactoryServer.InstanceFree: ignored EThreadError exception during IPersons._Release

So I would love to get further instructions from you.

Last edited by danielkuettner (2024-10-21 11:35:26)

Offline

#15 2024-10-22 16:44:44

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

I wrote again because I think it could be useful to stabilize the interfaces/network (mormot.core.interfaces / mormot.core.soa) part of the framework, so it is useful for all users:

My pov is, executing of interfaced methods isn't thread safe in some circumstances. Two parallel (sicSingle) service calls comes and their params will be mixed/override. In my case there are such corrupt bison's in mongodb queries.

@Ab could you please check TInterfaceMethodExecuteCached if there is anything what could go wrong under heaver load? All my wrk tests looks good but the live system has this error behavior and all other processes looks good, also the syslog looks good so far.

I've commented out this part of TInterfaceMethodExecuteCached .Aquire

if fCached.TryLock then
  begin
    // reuse this shared instance between calls
    SetOptions(opt);
    exec := self;
    fCachedWR.CancelAllAsNew;
    WR := fCachedWR;
  end

but its not enough or not the right place.
Are there any other changes in past that comes with https://blog.synopse.info/?post/2022/01 … e-Them-All the that I could test easily?

Perhaps it is not an issue of wrong locking but of not initialization of something reused because of caching?

Offline

#16 2024-10-29 20:25:42

danielkuettner
Member
From: Germany
Registered: 2014-08-06
Posts: 357

Re: malloc(): unaligned tcache chunk detected

I've changed our code to get a stable version of our software. The errors are away.
I know there is no issue in the async/socket part of mORMot. But I don't know what exactly had cause the errors.
My guess is an issue with TSynLocker. Perhaps we haven't use it right or made some mistake in the way of lock/unlock.

Offline

Board footer

Powered by FluxBB