#51 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 12:34:49

Version 1.0b is released:
http://code.google.com/p/scalemm/downlo … M_v10b.zip
Interthread memory seems to work OK now (still a lock for each item that is freed) in a real world (stress) test.

Also POC of version 2 (bit scanning and "integer as 32 bit array") is available:
http://code.google.com/p/scalemm/source … aleMM2.pas
(just works, many optimizations to be done)

#52 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 11:02:24

So no direct improvement regarding Delphi's string reference counting? Because that's also done using a single LOCK.

#53 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 09:46:55

About RCU: how can a critical section be made with no overhead? IMHO you need at least a LOCK to make a critical section?

#54 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 08:55:57

Btw, I got my new POC working (no linked list but "bit arrays") which makes it easier for "inter thread" memory. Probably even no GC thread needed, only some small scanning in the owner thread when it needs more memory, but I'm still working on it :-).

#55 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 08:47:01

About the RCU: do have a simple explanation how this can be programmed in Delphi?

About thread safety: in the current version I use linked lists. When some memory is freed, I link the block to the block list (FirstFreeBlock) for a fast reuse (no block scanning). Doing such is not thread safe, because only a block is guaranteed to be available if not all memory is freed. So a block of a short living thread can be posted to the global list (because thread is terminated) and an other thread has gotten this block, so the block list (owner of the block) can be changed to an other thread -> not thread safe to change something in the block list in an other thread.

#56 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-25 11:05:51

Double free: In the current version, if you do
P := GetMem
FreeMem(p)
FreeMem(p)
you get no error, but very strange results (because mem can be given to 2 different GetMem calls after that)

Background thread is not difficult, but threadsafe is :-) (for example, if you have many short living threads, mem blocks can be reused in many different threads than the original)
(current version is not threadsafe for "interthread mem")

#57 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-25 09:32:11

FYI: I'm busy with some rework (using an Integer as 32bit array with BSF and BTR for bit scanning and toggling) which makes it more memory efficient (no fixed array needed) and also makes it simpler (also double free is no problem). Maybe even using the same technique to remove the 2 double linked lists...

I'm working/thinking on interthread memory: it uses a LOCK for every freemem call now, which makes it slow and bad scaling. Also the current implementation is not threadsafe, so AVs when passing memory from thread 1 to thread 2. The new implementation does not use a lock, but needs some kind of GC thread... If someone needs a version with a simple but bad scaling interthread mem implementation, I can make it quite easy.
(I want a fast interthread algoritme, because I use async threads for background data retrieval from the database, so lot of memory is freed in the mainthread)

I hope I have working version next week.

#58 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-23 14:48:50

Thanks for the rework!
I will try to do some bugfixing (interthread mem etc) and testing in our real world app, I hope this week.
I will post my changes in:
http://code.google.com/p/scalemm/source … caleMM.pas

Btw: why did you change record in object? little bit confusing because it is not a class/object?

#59 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-23 08:06:42

Would be nice!

I already implemented the simple "medium size" array (4096-36kb) because our real world apps use lot of these sizes.
Still some bugs (AVs) in our multithreaded app (intertread memory) so need to change that part (also too slow due to too many LOCKs). I already wanted to change that because I want to use "Bit Scanning: BSF and BSR" (32 or 64 bit array) instead of fixed array of pointer (too much memory overhead), using 32bit masks to quickly find and set free/allocated items, etc. Too many thoughts and wished, too little time... (especially due to annoying bugs)

#60 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-22 07:41:31

ab wrote:

I just noticed (from svn code) that you changed some i: integer into i: byte
In TGlobalMemManager.GetBlockMemory, I guess your try...finally bl.FRecursive := False is pointless. Previous implementation sounded better to me.
IMHO TMemBlockList.FItemSize should be NativeInt to avoid a word to integer/NativeInt cast in the generated code.

I tried your proposal and I got no speed improvement (byte to integer/nativeint). Changing FItemSize from word to integer/nativeint made it even some slower!
(probably due to alignment).
So I kept the byte and word, especially to show it uses low values and these sizes are minimum needed (and not getting slower with it). So the only parts I use
integer instead of NativeInt are the TMemoryManagerEx functions. I changed all other parts to nativeint if it made sense and to another type if an integer was not
needed.

About try..finally: just to be sure in case of an AV

#61 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-19 13:38:48

ab wrote:

Nice!
What about your testing status?
Does it scale well on real applications (like a server application)?

Not yet, only in a demo program with 30.000.000 mem allocs, strings + objects in 16 threads on a quad core machine :-).

I'm busy right now @ work with a large refactoring of old CRUDs objects, so I hope next week

#63 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-18 13:41:26

- "So replace it with GetMem and FreeMem"
  Ok thats better, Scale_GetMem etc it will be then :-)
- "replace all div with shl"
  Yes I will do that for version 1.0
- about using _fixedoffset:
  this way no (extra) multiply is done each time to calculate offset to threadmem, should be faster...
  But yes, for 64bit (no asm support :-( ) I will probably need to make use of normal threadvar.
- about asm rewrite:
  Yes, that would be nice! I made it a lower prio because of no 64bit asm, but for 32bit it would be good anyway.
  I am curious how much faster you can make it in asm!
- about MM of FPC
  No, I completely coded it myself, but I used some ideas of TopMM (which is slower than FastMM in real world test, is also too bloated, that's why I made ScaleMM) and of course some of FastMM. Also used the CAS32 of the Lock free queue:
http://www.thedelphigeek.com/2010/02/dy … right.html
  But I will take a look at it!

#64 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-18 13:09:31

Thank you for the positive reply!

- "You don't handle aSize<=0"
  Thank you, will add a check for that (or maybe the RTL already does that?)
- "Everywhere in the code (to be 64 bit ready)"
  Yes, I will change that too, really hope 64bit (preview) will be available soon to test ScaleMM.
  But for now, I used my sparse spare time to get it working. 64bit compatible is (near) future, lower prio.
- "the RET in ending of _GetOwnTls is pointless"
  Ok, thanks
- "Perhaps joining all code in only one ScaleMM unit could make sense (easier to deploy/install)"
  Yes, I first thought small units would be better than one big unit, but it not easier to install (use multiple units in dpr or use search path)
- "RELEASE should be the default: a NORELEASE could make sense by default (for version 1.0)"
  Ok, thanks
- "why did you use {$A8} ? {$A4} or 'packed' wouldn't be wrong either. 16 bytes alignment could be an option only, and FastMM4 doesn't always returns 16 byte aligned blocks in the default compilation mode (via OldMM.GetMem)."
  Just some test to see if it makes a difference :-)
  I do not have much experience with alignment yet
- "If you use 16 bytes per TMemHeader, you could add a "magic" cardinal number to it to identify Freemem/Reallocmem misuses"
  yes I can use the 8 extra bytes for something else then, but what do you mean with "misuses"?
- "you are using New() and Dispose() which can be safely replaced with a OldMM.GetMem(sizeof()) and OldMM.FreeMem() calls."
  Yes and no: this way it can use it's own memory (that's the reason for the "recursive" check)
- "You use some div 32 or div 256: you'd better use  shl 5  and shl 8, which will be faster when working with integer/NativeInt."
  Aha, thanks for the tip
- about downscaling:
  E.g. aSize is 1, itemsize is 256 then:
  1 < 256
    1 + 256 >= 256  ....   hmmm you're right

  should be:
    if aSize < Owner.FItemSize div 2 then   //downscale if more than half smaller
   1    < 256 div 2  -> not ok, too much downscaling
   128 < 256 div 2  -> ok, just the half but ok
- about large mem:
  For the first version, I focus on "make a working POC", so ScaleMM on top of FastMM was the easiest :-).
  Btw, "on top" is also a nice feature: you can put any other MM before ScaleMM!
  I haven't thought much about how to deal with large mem (I would like to make version 2 independent of FastMM),
  but your proposal of using the heap sounds good, thanks! Do you have good ideas about a "large memory" strategy?
  I know FastMM has some special logic for medium and large blocks.

#65 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-18 10:56:08

I made a new memory manager for Delphi, which uses a threadvar/TLS inside, so each thread has its own memory. This way no locks are used, and memory intensive apps scales (almost) perfectly now!
The name of the new MM is ScaleMM:
http://code.google.com/p/scalemm/

I have released version 0.9 yesterday: still a little bit rough but functional :-).
It is also 4x faster than FastMM, even for single threaded!

I had 2 major things in my mind:
1 = make it scale
2 = make it as simple as possible (so faster than bloated FastMM)

I still use a simple CAS lock for interthread memory (memory allocated in thread 1, freed in thread 2).
And for the global cache/pool I use simple lock, which maybe can replaced with an interlocked index for getting/putting an item in an array/pool.
If anybody has better ideas how to do it without locking: say it please :-).

Also other comments or help are welcome!

Board footer

Powered by FluxBB