malloc(): unaligned tcache chunk detected

danielkuettner · 2024-10-15 14:55:11

Hi Arnaud,

under Linux + fpc 3.2.0 -O3 http async (epoll) built with mORMot2 1c3447f4 (2024-09-17) and I'm using cmem.
I've got that error and our service was killed by os (ubuntu).

Is that error known and fixed in a later version?

Last edited by danielkuettner (2024-10-15 16:08:02)

ab · 2024-10-15 16:14:45

Very interesting.
This error was reported also during last TFB testing.
But we were not able to reproduce it here.
Our guess is that there is a memory-related bug in mORMot async server, which is usually not triggered by the mormot.core.fpcx64mm heap but more often by cmem.

So it is good news that you were able to reproduce it!

Do you have more info?
Where/when is it killed?
What is the load?
Can you generate the logs in verbose mode (with full low-level async process info) and send them to me?

Perhaps with more context, we could be able to locate the faulty part of the code...

mpv · 2024-10-15 17:28:13

So problem is not in HTTP pipelining.
Hope Daniel gives us a way to reproduce this issue

danielkuettner · 2024-10-15 17:41:54

Yes I hope too.

This error is not really reproducible. I had it the first time since July, but we had several other errors over the last months (e.g. service not responsible or undefined exceptions or the Variant error with with _Safe(v)^.InitArrayFromCsvFile and memory corruption after that) and we have changed a lot of possible bugs on our side but without serious knowledge.

I can change logging to verbose mode and send the file to you (but it's very huge and I will give you an url for downloading per email) but also verbose logging wasn't enough in past. Perhaps adding logs in mormot.net.async or mormot.net.sock is the way to go.

If error is async related could I change to useHttpSocket for testing (in front of our service we are using nginx as reverse proxy)?

->Do you have more info?
Only this errors I had in logfile since last weeks:
20241015 15564437 EXC EThreadError {Message:"Thread error"} [R12:root]
20241015 15564446 EXC EThreadError {Message:"Thread error"} [R14:root]
20241015 15564448 EXC EThreadError {Message:"Thread error"} [R13:root]
20241015 16171856 EXC EThreadError {Message:"Thread error"} [R13:root]
20241015 16171859 EXC EThreadError {Message:"Thread error"} [R8:root]
But it seems to be transparent from the client side, because we have no error responses related to that errors.

Atm I use two services as upstream under nginx (linux as main service and a windows http.sys service as backup). The windows service seems to be more stable as the linux with async server.

->Where/when is it killed?
It's killed from os today. Here are the rows from syslog:
Oct 15 16:09:53 mssql SOneSrv2[358268]: malloc(): unaligned tcache chunk detected
Oct 15 16:09:53 mssql systemd[1]: SOneSrv2.service: Main process exited, code=killed, status=6/ABRT
Oct 15 16:09:53 mssql systemd[1]: SOneSrv2.service: Failed with result 'signal'.
Oct 15 16:09:53 mssql systemd[1]: SOneSrv2.service: Consumed 28min 29.400s CPU time.
Oct 15 16:10:23 mssql systemd[1]: SOneSrv2.service: Scheduled restart job, restart counter is at 1.

->What is the load?
It's difficult to answer.
Per day we have about 400-500 thousand requests with a average request time of 20ms, but there are also request with text search in MongoDB that could run 10s.
We are using an lxc container with 32 cores, 100G mem (60G free) and zfs. Our service has 32 worker threads and there are about 100-200 concurrent users.
The cpu-usage is mostly under 10% but for some seconds over 100% because MongoDB is heavily used in front of postgres. But I would say we don't have any overload here.

ab · 2024-10-15 20:49:04

I don't know where this EThreadError come from.

On FPC, this EThreadError comes from the fpc_threaderror RTL function.
It seems to be triggerred only by the cthreads.pp unit, when a TRtlCriticalSection is used or when RTL/Basic Events are created.

Since events are created once per thread, and the exception is raised during the run on existing threads, my guess is that it is a failing pthread_mutex_lock/trylock/unlock.
I will review the TRtlCriticalSection involved with the async server, but AFAICT there are very few of them.

danielkuettner · 2024-10-16 08:27:20

I've sent you an email with a log file link.

Perhaps this link is interesting for you:

https://stackoverflow.com/questions/693 … ecv-is-suc

I've also read about a hardware issue with ECC RAM but all other programs (postgres, mongodb) run without errors, so I would not go in this direction.

Last edited by danielkuettner (2024-10-16 08:42:20)

ab · 2024-10-16 11:36:06

Yes, it is likely to be a dangling pointer issue, i.e. trying to free a memory block which is not a memory block.

I don't think it is about HW. The problem is in mORMot, but like an Heisenbug it is very difficult to reproduce, therefore to fix.

I have again reviewed the async we server code, and there is no obvious problem I could identify.

@mpv
Perhaps the EThreadError may give us some information.
With TFB, we got the malloc problem even if don't use cmem, but the mormot.core.fpcx64mm.pas heap manager. The only part using malloc is likely to be pthread.
Perhaps there is an execution problem in a thread, which triggers EThreadError, which lets the pthread library in a weird state, and after a while malloc fails and kills the process.
I have hardened https://github.com/synopse/mORMot2/commit/04e3c28b but I doubt it makes a real difference.

danielkuettner · 2024-10-16 13:34:29

Here are the relevant log file entries for the EThreadError:

20241016 13290923 - 08.340.038
20241016 13290923 EXC EThreadError {Message:"Thread error"} [R16:root] at 428014
20241016 13290923 debug uRestServerDB.TSOneRestServerDB(020bafc8) TServiceFactoryServer.InstanceFree: ignored EThreadError exception during IPersons._Release
20241016 13290923 srvr 192.168.1.214 Interface POST root/Persons.GetOneByID=200 out=3 KB in 4.32s
20241016 13290923 ret mormot.rest.server.TRestServerRoutingRest(7fdb340cd238) {"result":[{"res": true, ...

The call of the sicSingle service method Persons.GetOneByID is ok from the pov of the client.

This error also comes in case of useHttpSocket (not only with useHttpAsync) and under current mORMot2 branch #38874e16c.

Last edited by danielkuettner (2024-10-16 13:53:20)

ab · 2024-10-16 15:25:19

@daniel
You should perhaps enable the thread identification in the logs.

danielkuettner · 2024-10-16 15:38:58

perThreadLog:= ptIdentifiedInOneFile

This one?

ab · 2024-10-16 18:31:39

Yes!

danielkuettner · 2024-10-17 19:39:43

Do you know this thread and is it interesting to you?

https://gitlab.com/freepascal.org/fpc/s … sues/40677

danielkuettner · 2024-10-18 15:03:56

Hi ab,

just fyi the error in my service from today and the syslog entries bring me to the nested option for lxc containers and I think I'll give it a try:

https://forum.proxmox.com/threads/lxc-a … -13.36173/

danielkuettner · 2024-10-21 08:01:02

Back to the EThreadError the source comes from:

20241021 06575224 . debug uRestServerDB.TSOneRestServerDB(02eb3298) TServiceFactoryServer.InstanceFree: ignored EThreadError exception during IPersons._Release

So I would love to get further instructions from you.

Last edited by danielkuettner (2024-10-21 11:35:26)

danielkuettner · 2024-10-22 16:44:44

I wrote again because I think it could be useful to stabilize the interfaces/network (mormot.core.interfaces / mormot.core.soa) part of the framework, so it is useful for all users:

My pov is, executing of interfaced methods isn't thread safe in some circumstances. Two parallel (sicSingle) service calls comes and their params will be mixed/override. In my case there are such corrupt bison's in mongodb queries.

@Ab could you please check TInterfaceMethodExecuteCached if there is anything what could go wrong under heaver load? All my wrk tests looks good but the live system has this error behavior and all other processes looks good, also the syslog looks good so far.

I've commented out this part of TInterfaceMethodExecuteCached .Aquire

if fCached.TryLock then
begin
// reuse this shared instance between calls
SetOptions(opt);
exec := self;
fCachedWR.CancelAllAsNew;
WR := fCachedWR;
end

but its not enough or not the right place.
Are there any other changes in past that comes with https://blog.synopse.info/?post/2022/01 … e-Them-All the that I could test easily?

Perhaps it is not an issue of wrong locking but of not initialization of something reused because of caching?

danielkuettner · 2024-10-29 20:25:42

I've changed our code to get a stable version of our software. The errors are away.
I know there is no issue in the async/socket part of mORMot. But I don't know what exactly had cause the errors.
My guess is an issue with TSynLocker. Perhaps we haven't use it right or made some mistake in the way of lock/unlock.

mORMot Open Source

#1 2024-10-15 14:55:11

malloc(): unaligned tcache chunk detected

#2 2024-10-15 16:14:45

Re: malloc(): unaligned tcache chunk detected

#3 2024-10-15 17:28:13

Re: malloc(): unaligned tcache chunk detected

#4 2024-10-15 17:41:54

Re: malloc(): unaligned tcache chunk detected

#5 2024-10-15 20:49:04

Re: malloc(): unaligned tcache chunk detected

#6 2024-10-16 08:27:20

Re: malloc(): unaligned tcache chunk detected

#7 2024-10-16 11:36:06

Re: malloc(): unaligned tcache chunk detected

#8 2024-10-16 13:34:29

Re: malloc(): unaligned tcache chunk detected

#9 2024-10-16 15:25:19

Re: malloc(): unaligned tcache chunk detected

#10 2024-10-16 15:38:58

Re: malloc(): unaligned tcache chunk detected

#11 2024-10-16 18:31:39

Re: malloc(): unaligned tcache chunk detected

#12 2024-10-17 19:39:43

Re: malloc(): unaligned tcache chunk detected

#13 2024-10-18 15:03:56

Re: malloc(): unaligned tcache chunk detected

#14 2024-10-21 08:01:02

Re: malloc(): unaligned tcache chunk detected

#15 2024-10-22 16:44:44

Re: malloc(): unaligned tcache chunk detected

#16 2024-10-29 20:25:42

Re: malloc(): unaligned tcache chunk detected

Board footer