#1 2024-11-20 16:32:21

okoba
Member
Registered: 2019-09-29
Posts: 121

Multi thread free memory using fpcx64mm

One program is allocating huge memory parts (Total of 32GB) and I want to try to free it using multi threads to speed it up from near 4 seconds.
But in my tries, it seems completely linear.
Default memory manager of FPC does not free it until closing the program or on reuse.
Can it be done?
I am working with FPC and on Windows.

Offline

#2 2024-11-20 17:39:16

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,736
Website

Re: Multi thread free memory using fpcx64mm

I suspect you are using large blocks? That is, blocks > 256KB in size.

In fpcx64mm there is a giant lock for large blocks only when switching pointers: then syscall fpmunmap/VirtualFree is done with no lock, in the current thread.
So using fpcx64mm, there is no practical lock when releasing large blocks of memory. If you see no benefit with multithreading, it would be due to the Windows OS itself.

Offline

#3 2024-11-21 12:43:11

zen010101
Member
Registered: 2024-06-15
Posts: 70

Re: Multi thread free memory using fpcx64mm

okoba wrote:

Default memory manager of FPC does not free it until closing the program or on reuse.

Yes, I meet the same issue. My program calls the web API 10,000 times, the memory always increases until I close the program. It reports no memory leaks. But when using fpcx64mm, the amount of memory is a fixed value.

P.S. the OS is windows. In aarach64, the memory manager of FPC is the same as fpcx64mm.

Offline

#4 2024-11-21 12:50:30

okoba
Member
Registered: 2019-09-29
Posts: 121

Re: Multi thread free memory using fpcx64mm

@ab yes. They are a couple of megabytes. I hoped there is a way to speed it up you may be aware of.

Offline

#5 2024-11-21 12:56:03

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,736
Website

Re: Multi thread free memory using fpcx64mm

@okaba
Did you try with fpcx64mm?
It is not clear from your posts.

Offline

#6 2024-11-21 13:05:14

okoba
Member
Registered: 2019-09-29
Posts: 121

Re: Multi thread free memory using fpcx64mm

Yes! My test is done with fpcx64mm. The default memory manager allocation time is much slower. fpcx64mm allocation is very fast, but still freeing this much of memory takes time and I want to speed it up.
I just tested on Linux and it takes 100ms to free, but on Windows near 4 seconds.

Offline

#7 2024-11-21 14:16:02

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,736
Website

Re: Multi thread free memory using fpcx64mm

So this is clearly a WinAPI issue.

Perhaps you could try to allocate bigger buffers (e.g. GetMem per 1GB) then do the sub-allocation in your own program, and release the whole big GB buffer at once in a single syscall?

Offline

#8 2024-11-21 14:50:54

okoba
Member
Registered: 2019-09-29
Posts: 121

Re: Multi thread free memory using fpcx64mm

That can be done but it complicates the code quite a bit, and makes it harder to read. Preferably it should be done by the MM.
I made a test to verify it: https://gitlab.com/-/snippets/4771697
On my Windows 10 machine and FPC, blocks of 128 MB are much faster, but still much slower than FreeSingle on Linux (near 100ms).
Here are the times on Windows:

32 GB, 1 MB: 32768
Allocate: 32 GB in 4.62s i.e. 6.9 GB/s
FreeSingle: 32 GB in 2.57s i.e. 12.4 GB/s
Allocate: 32 GB in 4.66s i.e. 6.8 GB/s
FreeMulti: 32 GB in 2.87s i.e. 11.1 GB/s

32 GB, 4 MB: 8192
Allocate: 32 GB in 4.32s i.e. 7.4 GB/s
FreeSingle: 32 GB in 2.49s i.e. 12.8 GB/s
Allocate: 32 GB in 4.17s i.e. 7.6 GB/s
FreeMulti: 32 GB in 1.41s i.e. 22.6 GB/s

32 GB, 64 MB: 512
Allocate: 32 GB in 4.15s i.e. 7.7 GB/s
FreeSingle: 32 GB in 2.35s i.e. 13.5 GB/s
Allocate: 32 GB in 3.89s i.e. 8.2 GB/s
FreeMulti: 32 GB in 561.76ms i.e. 56.9 GB/s

32 GB, 128 MB: 256
Allocate: 32 GB in 3.96s i.e. 8 GB/s
FreeSingle: 32 GB in 2.40s i.e. 13.3 GB/s
Allocate: 32 GB in 3.85s i.e. 8.2 GB/s
FreeMulti: 32 GB in 514.55ms i.e. 62.1 GB/s

32 GB, 256 MB: 128
Allocate: 32 GB in 4.07s i.e. 7.8 GB/s
FreeSingle: 32 GB in 2.42s i.e. 13.1 GB/s
Allocate: 32 GB in 3.93s i.e. 8.1 GB/s
FreeMulti: 32 GB in 599.45ms i.e. 53.3 GB/s

32 GB, 1 GB: 32
Allocate: 32 GB in 4.04s i.e. 7.9 GB/s
FreeSingle: 32 GB in 2.41s i.e. 13.2 GB/s
Allocate: 32 GB in 3.96s i.e. 8 GB/s
FreeMulti: 32 GB in 1.36s i.e. 23.4 GB/s

What do you think?

One other question about fpcx64mm is that why it allocates much more Private bytes compare to Default MM?

Here are the numbers I get:

Default: Private Bytes: 32GB, Peak Working Set: 32GB
fpcx64mm: Private Bytes: 48GB, Peak Working Set: 32GB   

Last edited by okoba (2024-11-21 15:08:44)

Offline

#9 2024-11-21 15:45:14

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,736
Website

Re: Multi thread free memory using fpcx64mm

Instead of GetMem + FillZero you could just use AllocMem which fills with zero. wink
But I guess you want to write something to the memory to actually access the memory.

I don't know what to say, otherwise that the system behavior could change a lot between Windows 10 and Windows 11.
Here the bottleneck is the "VirtualFree" memory call.

"Private Bytes" are in fact reserved memory.
fpcx64mm pre-reserve the memory to avoid any hidden syscall and make first access to the ram faster.
So don't be afraid by this number, which is not the actual used memory in your RAM sticks.

Offline

#10 2024-11-22 10:41:38

okoba
Member
Registered: 2019-09-29
Posts: 121

Re: Multi thread free memory using fpcx64mm

Your guess about FillZero is right.

Are you testing on Windows 11 and see different results?
In theory the MM can allocate big blocks (128 MB) and free them when all of the sub blocks are used. But it can be a complicated task if needed to cover general cases.

Offline

#11 2024-11-22 11:30:50

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,736
Website

Re: Multi thread free memory using fpcx64mm

For the "large" blocks, we followed what most MM do - especially FastMM4/FastMM5.
That is, we call directly the OS.

On Windows, I sadly don't see anything easy, and faster than VirtualAlloc(MEM_COMMIT) and VirtualFree(MEM_RELEASE).
Perhaps forcing huge pages may help https://learn.microsoft.com/en-us/windo … ge-support

We may try to use NtAllocateVirtualMemory() and NtFreeVirtualMemory() with 1GB pages as mimalloc does.
https://github.com/microsoft/mimalloc/b … 5/src/os.c
You may consider and try mymalloc for your use case.

Offline

#12 2024-11-22 11:57:01

okoba
Member
Registered: 2019-09-29
Posts: 121

Re: Multi thread free memory using fpcx64mm

You are right.
Thank you for the notes.

Offline

#13 2024-11-27 15:44:43

zen010101
Member
Registered: 2024-06-15
Posts: 70

Re: Multi thread free memory using fpcx64mm

I tried on Windows 11, here is the result:

1. FPC Default MM

16 GB, 1 MB: 16384
Allocate: 16 GB in 21.29s i.e. 769.5 MB/s
FreeSingle: 16 GB in 613.26ms i.e. 26 GB/s
Allocate: 16 GB in 21.38s i.e. 766.1 MB/s
FreeMulti: 16 GB in 35.68ms i.e. 448.4 GB/s

16 GB, 4 MB: 4096
Allocate: 16 GB in 4.87s i.e. 3.2 GB/s
FreeSingle: 16 GB in 583.41ms i.e. 27.4 GB/s
Allocate: 16 GB in 4.45s i.e. 3.5 GB/s
FreeMulti: 16 GB in 33.54ms i.e. 477 GB/s

16 GB, 64 MB: 256
Allocate: 16 GB in 4.43s i.e. 3.6 GB/s
FreeSingle: 16 GB in 535.99ms i.e. 29.8 GB/s
Allocate: 16 GB in 3.89s i.e. 4.1 GB/s
FreeMulti: 16 GB in 42.48ms i.e. 376.6 GB/s

16 GB, 128 MB: 128
Allocate: 16 GB in 4.30s i.e. 3.7 GB/s
FreeSingle: 16 GB in 533.62ms i.e. 29.9 GB/s
Allocate: 16 GB in 3.90s i.e. 4 GB/s
FreeMulti: 16 GB in 68.81ms i.e. 232.4 GB/s

16 GB, 256 MB: 64
Allocate: 16 GB in 4.35s i.e. 3.6 GB/s
FreeSingle: 16 GB in 523.76ms i.e. 30.5 GB/s
Allocate: 16 GB in 3.86s i.e. 4.1 GB/s
FreeMulti: 16 GB in 134.51ms i.e. 118.9 GB/s

16 GB, 1 GB: 16
Allocate: 16 GB in 4.21s i.e. 3.8 GB/s
FreeSingle: 16 GB in 495.69ms i.e. 32.2 GB/s
Allocate: 16 GB in 3.83s i.e. 4.1 GB/s
FreeMulti: 16 GB in 36.50ms i.e. 438.2 GB/s

2. fpcx64mm

16 GB, 1 MB: 16384
Allocate: 16 GB in 3.99s i.e. 4 GB/s
FreeSingle: 16 GB in 564.80ms i.e. 28.3 GB/s
Allocate: 16 GB in 4s i.e. 3.9 GB/s
FreeMulti: 16 GB in 1.12s i.e. 14.2 GB/s

16 GB, 4 MB: 4096
Allocate: 16 GB in 3.96s i.e. 4 GB/s
FreeSingle: 16 GB in 610.13ms i.e. 26.2 GB/s
Allocate: 16 GB in 3.92s i.e. 4 GB/s
FreeMulti: 16 GB in 985.57ms i.e. 16.2 GB/s

16 GB, 64 MB: 256
Allocate: 16 GB in 3.86s i.e. 4.1 GB/s
FreeSingle: 16 GB in 518.28ms i.e. 30.8 GB/s
Allocate: 16 GB in 3.80s i.e. 4.2 GB/s
FreeMulti: 16 GB in 808.84ms i.e. 19.7 GB/s

16 GB, 128 MB: 128
Allocate: 16 GB in 3.85s i.e. 4.1 GB/s
FreeSingle: 16 GB in 488.24ms i.e. 32.7 GB/s
Allocate: 16 GB in 3.83s i.e. 4.1 GB/s
FreeMulti: 16 GB in 776.20ms i.e. 20.6 GB/s

16 GB, 256 MB: 64
Allocate: 16 GB in 3.89s i.e. 4.1 GB/s
FreeSingle: 16 GB in 541.66ms i.e. 29.5 GB/s
Allocate: 16 GB in 3.85s i.e. 4.1 GB/s
FreeMulti: 16 GB in 726.30ms i.e. 22 GB/s

16 GB, 1 GB: 16
Allocate: 16 GB in 3.83s i.e. 4.1 GB/s
FreeSingle: 16 GB in 518.28ms i.e. 30.8 GB/s
Allocate: 16 GB in 3.88s i.e. 4.1 GB/s
FreeMulti: 16 GB in 786.35ms i.e. 20.3 GB/s

The performance of FPC Default MM is very good, except for 2 issues: first, the speed of allocation in units of 1MB is indeed slow, but the speed of allocation in units of 4MB and above is comparable to fpcx64mm; second, memory is only released when the program is closed.
fpcx64mm Same as the problem described by @okoba, the speed of FreeMulti is not fast but slow.

Offline

#14 2024-11-27 18:31:07

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,736
Website

Re: Multi thread free memory using fpcx64mm

memory is only released when the program is closed.

I guess it may be why the FPC default MM is so fast as "freeing" memory blocks: it does not free anything.
wink

From what I can see, the problem is the Windows VirtualFree() API which does not scale.
The problem is not in fpcx64mm itself, which is almost lock-free when freeing large blocks: all time seems to be spent in the Windows API.
If you try on Linux, you won't find any bottleneck or lock in fpcx64mm.

Offline

#15 2024-11-28 07:48:28

okoba
Member
Registered: 2019-09-29
Posts: 121

Re: Multi thread free memory using fpcx64mm

Default MM leaves the memory and does not free it to the OS as far as I know.

Offline

Board footer

Powered by FluxBB