You are not logged in.
Intel Core2 Quad core Q8300 @ 2.5Ghz
4B
1 = 47,83 nanoseconds per cycle
2 = 58,80 nanoseconds per cycle
4 = 124,68 nanoseconds per cycle
8 = 128,21 nanoseconds per cycle
8B
1 = 56,61 nanoseconds per cycle
2 = 73,01 nanoseconds per cycle
4 = 146,73 nanoseconds per cycle
8 = 146,73 nanoseconds per cycle
8BV
1 = 54,40 nanoseconds per cycle
2 = 75,75 nanoseconds per cycle
4 = 177,40 nanoseconds per cycle
8 = 222,44 nanoseconds per cycle
So 4B seems the fastest
Btw: I almost have my new ScaleMM algoritm ready, only need to fix some bugs (to have a working POC, fully working needs some more time)
Offline
Please check out my latest revision:
http://code.google.com/p/delphi-toolbox … Free/?r=30
Could not compile, D6 don't open the project, dies in Access Violation (Something wrong in my installation). And D2007 gives internal error when trying to compile it So I can't compile and test it at all for now...
I try it at home, but if it would be possible to have compiled .exe, and some bat to run some standard "test suite", with various settings.
Maybe to modify the test program to write Log on disk, to make posting into here easier...
-TP-
Offline
I compiled it with Delphi 2010. Previous versions will have problems with the record semantic.
But I'm not on the computer on which I compiled it, so I can't upload the exe easily.
Online
I compiled it with Delphi 2010. Previous versions will have problems with the record semantic.
I looked at it and it looked strange to me, but not sure a which point it was introduced...
I'll compile with D2010 at home...
-Tee-
Offline
Could not compile, D6 don't open the project, dies in Access Violation (Something wrong in my installation). And D2007 gives internal error when trying to compile it So I can't compile and test it at all for now...
I try it at home, but if it would be possible to have compiled .exe, and some bat to run some standard "test suite", with various settings.
Maybe to modify the test program to write Log on disk, to make posting into here easier...
-TP-
Sorry, I was traveling in the past few days...
I have started learning lock-free code around late 2008, so I have never tested my code on 2007 and below. But it should work with 2009 and up.
Last edited by AdamWu (2010-12-23 06:15:31)
Offline
D2010 compiled it just fine...
-TP-
Offline
I have a "working" POC of my newest algoritm:
http://code.google.com/p/scalemm/source … aleMM2.pas
It works completely different from the first version: it does not use preallocated blocks of one fixed size but
it does a dynamic allocation from one big block of 1Mb (like FastMM does with medium blocks). This way you have a lower
memory usage, because you use all memory of the 1mb block for all sizes. Downside of this approach is some more memory
fragmentation within the block, but I tried to reduce this by using an lookup index (bit array mask) by size, so a small alloc does not use the first available (big) mem but the tries to use the smallest as possible.
It is working for a small amount of allocs, but it has a nasty bug somewhere... However you can get an idea of the working how it should be. It is not optimized yet, but I tried to use "fast" techniques (shl and shr instead of div) in the base.
I hope I can remove some overhead somewhere (too much if statements etc).
Note: the bit scanning (also reverse bit scanning to get highest bit for the size), using of masks etc makes it less easy to follow (more "high tech" then version 1).
Don't know how fast it will be (for small allocs) in the end, maybe using ScaleMM v1 on top of version 2? :-)
Offline
Good to hear that there is an progress.
I was just pondering how ScaleMM will work with large blocks.
Allocating and Disposing is locked, because of FastMM (etc) underneath, but how about using them, like copying smaller blocks for processing etc... Will that be lock free? I Suppose so... (Just to try to understand where this is currently)
I need ScaleMM for one server, but quite often it uses large blocks, they are owned by each thread (so no cross thread usage), but just been thinking that operations on those are blocked anyways with FastMM currently.
And other thing crossed my mind is that most likely the OmniThreadLibrary users would also gain from ScaleMM, if there are no too much depencies between the OTL and FMM...
-TP-
Offline
Good to hear that there is an progress.
I was just pondering how ScaleMM will work with large blocks.
Allocating and Disposing is locked, because of FastMM (etc) underneath, but how about using them, like copying smaller blocks for processing etc... Will that be lock free? I Suppose so... (Just to try to understand where this is currently)
-TP-
Yes, ScaleMM version 1 works on top of FastMM, so large allocations (or when ScaleMM needs more mem) ar locked by FastMM. But all small blocks etc are processed per thread so no locking at all!
I need ScaleMM for one server, but quite often it uses large blocks, they are owned by each thread (so no cross thread usage), but just been thinking that operations on those are blocked anyways with FastMM currently.
And other thing crossed my mind is that most likely the OmniThreadLibrary users would also gain from ScaleMM, if there are no too much depencies between the OTL and FMM...
-TP-
Because of the FastMM lock dependency I am making a large block allocator. This allocator is fully dynamic and it seems so good (?) it could also be used for small blocks. So I am thinking to use it as a complete allocator (no real difference between small or medium mem). Or if the speed is not good enough to use ScaleMM1 on top of ScaleMM2 :-).
I really would like to get rid of FastMM locking to get full scaling: this is the future (multi cores, OTL, AsyncCalls, etc)!
Btw: I hope I solved a nasty bug in ScaleMM2 yesterday evening, so I can test/develop it further.
Offline
First of all: a happy 2011!
Good news: V2 seems to work now. Not fully working (like interthread mem and it does not release mem to Windows).
Speed is about 3 times slower (30M allocs/reallocs/free in 3.1s, V1 does it in 1.2s) but there are enough optimizations possible.
http://code.google.com/p/scalemm/source … aleMM2.pas
Offline
Version 2 almost working, ScaleMM1 works on top of v2 now.
(ScaleMM2 needs 16byte header + minimum alloc size of 32 bytes, so too much overhead for small blocks. ScaleMM1 is for small mem (<2kb) and ScaleMM2 for medium (<1Mb) and larger is direct VirtualAlloc/free)
http://code.google.com/p/scalemm/source … caleMM.pas
http://code.google.com/p/scalemm/source … aleMM2.pas
Some "small" problems needs to be fixed like backwards free block scanning, so you'll get "out of memory" in intensive tests.
And of course the necessary optimizations, cleanup, documentation etc.
Speed of ScaleMM1 is about 1100ms and ScaleMM2 about 2400ms (30M allocs/reallocs/free). So small memory allocs is faster than medium.
PS: ABA problem needs to be fixed too, will do this soon
Offline
should SynScaleMM be used for single or small number (up to 3-5) thread appications? what guidelines would be to choose right from FastMM/ScaleMM/SynScaleMM?
--- we no need no water, let the ... burn ---
Offline
Good question!
One important fact to notice is that SynScaleMM/ScaleMM will create a per-thread heap.
So it's definitively designed for use with some fixed number of threads.
Using a thread pool is a need for SynScaleMM/ScaleMM: if you don't have a thread pool, and create a lot of threads, each thread will create a new heap, which will be slow.
So here are some guidelines:
- FastMM4: If you use only one thread;
- FastMM4: If you are short in RAM (SynScaleMM/ScaleMM are more memory consuming);
- FastMM4: If you use some background threads which are not working continuously (e.g. a background thread for refreshing some data for some milliseconds, then free this thread, while the main thread deals with the UI);
- SynScaleMM or ScaleMM: when you use a server application with background threads running continuously in parallel - but WITHOUT a lot of thread creation (using e.g. a Thread Pool);
- FastMM4: when you use a server application with background threads running continuously in parallel - but WITH a lot of thread creation (no Thread Pool): in this case, consider reimplementing the server, using a Thread pool: this kind of architecture will be slower than SynScaleMM/ScaleMM + thread pool.
Of course, our framework is using a fixed number of threads:
- named pipe connections are designed to keep the connection alive as long as the client software is running: so there is only one thread created by client;
- HTTP/1.1 connections also handle keep alive connection, so one thread by client does make sense here;
- HTTP/1.0 connections are not kept alive, but use a thread pool via completion ports, so will perfectly fit with SynScaleMM/ScaleMM.
Online
thanks for the clarification - it is helpfull for the non-gurus of this field
--- we no need no water, let the ... burn ---
Offline
ScaleMM is faster than FastMM, so also good for single thread.
ScaleMM also caches the thread managers, so should also be good with many short lived threads.
ScaleMM only uses more mem then FastMM, so do not use it if you are low on memory (btw: FastMM also uses more mem then low level Windows mem)
ScaleMM2 is almost ready, busy with the last details. It works without FastMM so faster in multithreaded (no FastMM locks underneath), and it will have special medium block algoritm, and uses direct virtualalloc for large mem (> 1Mb) but a special large block handling can be easily made in case someone uses lots of large blocks.
Offline
In my tests, ScaleMM is slower than FastMM for single thread applications...
About the thread manager caching, you're right: I forgot about it!
Thanks for the good news about ScaleMM2.
Have you any preliminary benchmarks?
Online
30M alloc/realloc/free (small mem, 10 - 120bytes), 1 thread
FastMM = 4376ms
ScaleMM2 = 1651 (still too high, can be optimized further, ScaleMM1 has about 1100ms)
30M alloc/realloc/free (medium mem, 10kb - 80kb), 1 thread
FastMM = 58206ms (!), with no resize (+10bytes) = 2302ms
ScaleMM2 = 2326ms
for j := 0 to 1000 do
for i := 0 to 10000 do
p1 := GetMemory(10);
p2 := GetMemory(40);
p3 := GetMemory(80);
p1 := ReallocMemory(p1, 30);
p2 := ReallocMemory(p2, 60);
p3 := ReallocMemory(p3, 120);
FreeMemory(p1);
FreeMemory(p2);
FreeMemory(p3);
for j := 0 to 1000 do
for i := 0 to 10000 do
p1 := GetMemory(10 * 1024);
p2 := GetMemory(40 * 1024);
p3 := GetMemory(80 * 1024);
p1 := ReallocMemory(p1, 10 * 1024 + 10);
p2 := ReallocMemory(p2, 40 * 1024 + 100);
p3 := ReallocMemory(p3, 80 * 1024 + 10);
FreeMemory(p1);
FreeMemory(p2);
FreeMemory(p3);
Offline
This test is a bit not realistic.
You're freeing the just allocated memory.
This is really a "best-case", which doesn't reflect the reality of memory allocation in an application.
Or perhaps there are some missing begin...end in your above code !!!
My tests were with running the whole unitary test benchmark of our framework, using one MM or the other.
And FastMM4 made (a little bit) better results than ScaleMM.
Online
yeah, you must put begin/end around the for loop :-)
And yes, it is a simple test to test the "core" speed (only memory operations, nothing more)
In real life the results will be different. The FastCodeMMChallenge also showed slower ScaleMM in some cases
(because it works on top of FastMM: mem larger than 2kb is passed to FastMM, so slightly slower because of
the ScaleMM size check overhead). I hope ScaleMM2 won't have this limitation . I have still one (?) tiny bug so cannot
run the full FastCodeMMChallenge yet...
Btw: alpha version is in source control, ScaleMM1 in seperate branch
Last edited by andrewdynamo (2011-01-25 10:20:30)
Offline
Latest version seems to work OK now (added medium mem CheckMem functions: found a couple of nasty bugs with it!)
Also unit test added.
http://code.google.com/p/scalemm/source/browse/trunk
It only supports mem < 2Gb and no interthread mem support and no mem leak support yet.
And some more optimizations and cleanup needed...
Offline
Delphi 2007:
[DCC Error] smmFunctions.pas(96): E2107 Operand size mismatch
[DCC Error] smmFunctions.pas(119): E2107 Operand size mismatch
[DCC Error] smmLargeMemory.pas(45): F2063 Could not compile used unit 'smmFunctions.pas'
[ 96] lock cmpxchg dword ptr [aDestination], aNewValue
...
[119] BSF EAX, aValue;
Offline
Delphi 2007:
Thanks for reporting, I only checked it for D2010, will fix this tomorrow I think (after D2010 is completely tested)
Offline
Compiles and runs fine on D2007 too now
Extra checks added, new extensive test running and going fine so far
Offline
hmmm, FastCodeMMChallenge depends on big mem realloc (which is partly implemented) so it is not as fast as is should be (in that benchmark). So I need to use the same big mem realloc algoritm as FastMM I think (increment in steps of 64k instead of 1byte :-) and use VirtualQuery to expend virtualmem)
Offline
Checked in better large mem handling, now ScaleMM2 is 20% faster in FastCodeMMChallenge :-)
Average Speed Performance: (Scaled so that the winner = 100%)
DelphiInternal : 79,7
ScaleMem : 100,0
Still some improvements possible, because some test results are too bad (will investigate and "fix" them)
Offline
Some intermediate detailed benchmark:
http://code.google.com/p/scalemm/#Benchmark_2
It is slower in small reallocations (with no change, so only "function" overhead), but overall it is 25% faster now.
Need to change slow "owner" determination (small, medium, large), will change it to use same kind of logic as FastMM does
(use lowest or highest bits of "size" to mark for free and size type)
Offline
In this benchmark 2, the multi-threaded tests (with Nexus DB or 8 threads) results are not better than FastMM4, am I wrong?
Or is it the contrary? I didn't get the % thing... higher the better or lower the better?
Online
Sorry, it was a quick test
FastMM is 100% (time), so anything above is bad (like the first test), anything below is better (less is less time, so faster, is better )
Offline
Yes, but I want it to be overall faster
It must be a complete replacement, so also fast in single threaded. If possible .
Offline
I have some slow performance in my code:
https://picasaweb.google.com/lh/photo/G … directlink
https://picasaweb.google.com/lh/photo/Q … directlink
Why are these lines below so slow? (more than 100 CPU cycles)
if NativeUInt(pm.OwnerBlock) and 1 <> 0 then
and:
if ot = PBaseThreadMemory(FCacheSmallMemManager) then
Maybe because of L1/L2 cache fetch? (FCacheSmallMemManager is located in current object, maybe needs to fetch it?).
How can this be optimized?
(these 2 lines consumes most of the time)
========================================
function TThreadMemManager.ReallocMem(aMemory: Pointer;
aSize: NativeUInt): Pointer;
var
pm: PBaseMemHeader;
ot: PBaseThreadMemory;
begin
if FOtherThreadFreedMemory <> nil then
ProcessFreedMemFromOtherThreads;
pm := PBaseMemHeader(NativeUInt(aMemory) - SizeOf(TBaseMemHeader));
//check realloc of freed mem
if (pm.Size and 1 = 0) then
begin
if NativeUInt(pm.OwnerBlock) and 1 <> 0 then
//lowest bit is mark bit: medium mem has ownerthread instead of ownerblock (temp. optimization trial)
//otherwise slow L1/L2 fetch needed in case of "large" distance
begin
ot := PBaseThreadMemory( NativeUInt(pm.OwnerBlock) and -2); //clear lowest bit
if ot = PBaseThreadMemory(FCacheMediumMemManager) then
Result := FCacheMediumMemManager.ReallocMem(aMemory, aSize)
else
Result := ReallocMemOfOtherThread(aMemory, aSize);
end
else
begin
ot := pm.OwnerBlock.OwnerThread;
if ot = PBaseThreadMemory(FCacheSmallMemManager) then
Result := FCacheSmallMemManager.ReallocMem(aMemory, aSize)
// else if ot = PBaseThreadMemory(FCacheMediumMemManager) then
// Result := FCacheMediumMemManager.ReallocMem(aMemory, aSize)
else if ot = PBaseThreadMemory(FCacheLargeMemManager) then
Result := FCacheLargeMemManager.ReallocMemWithHeader(aMemory, aSize)
else
Result := ReallocMemOfOtherThread(aMemory, aSize);
end
end
else
Error(reInvalidPtr);
end;
Last edited by andrewdynamo (2011-02-04 15:11:04)
Offline
Hmm, high number of resource stalls, L1 load blocked by stores, etc:
https://picasaweb.google.com/lh/photo/q … directlink
Maybe memory is too good aligned? L1 has 4k aliasing, so first 4k must differ?
Offline
Status update: Tried some offsets to disalign, but no succes. I think I just must prevent too much "lookups" (needs L1/L2 cache fetches, or worse: memory fetch).
Instead of nice structure and good "code smell" I'll have to use speed hacks and/or more (packed) data in the header for intelligent reallocs (I must determine the type: small, medium and large, and even worse: check the thread owner). I have some ideas for it, but this need some restructure...
I have already made a simple pre-realloc check: if new size is smaller but greater than 1/4 of current size, nothing has to be done (no thread owner or type check). Now a realloc test is only slightly slower than (asm optimized!) FastMM/D2010.
Offline
Hello,
It's look like ScaleMM2 is compatible with Delphi 2007 and up only.
At least, it's does not compile with Delphi 7.
Btw,
Which is better, Delphi 7 or Delphi 2007 ?
Carl
Offline
Hello,
It's look like ScaleMM2 is compatible with Delphi 2007 and up only.
At least, it's does not compile with Delphi 7.
Carl
That is possible
I only have D2010 so I made it primary for that. But "ab" made some changes to make ScaleMM1 compatible with D7, something with changing "record" to "object" so you could try that first? What kind of compiler errors do you get next?
Offline
Hello,
Thank you for your reply
ScaleMM1 as well as SynScaleMM are working very fine with D7
For ScaleMM2,
the incompatilibities are mainly about "how" type declarations are done.
They are not runtime errors, it just not compile.
If you want, i can give you the list of errors generated in the IDE.
Carl
Offline
I fixed it (in a D7 portable edition :-) ) by changing "record" to "object"
(committed in SVN now)
Last edited by andrewdynamo (2011-03-17 13:18:48)
Offline
Hello,
Thank you
It compile and work very fine.
I notice that Small, Medium and Large memory are supported.
Does this mean we can remove FastMM ??
In any case, with FastMM as first modulem i got my application to freeze when i closed a form.
After removing FastMM, all is working fine.
Carl
* Edited 2011-03-18
* Just as a note, i am using it on a very big project.
* And, rarely and randomly, i am getting to program to freeze with ScaleMM2.
* Not always at the same place.
Last edited by cstuffer (2011-03-18 12:37:18)
Offline
I notice that Small, Medium and Large memory are supported.
Does this mean we can remove FastMM ??
Yes and no
Because it seems ScaleMM2 does not pass the validation tests of FastCode Benchmark:
https://forums.embarcadero.com/thread.j … 529#332529
so I should not use it in production yet
* And, rarely and randomly, i am getting to program to freeze with ScaleMM2.
Freeze as in 100% CPU (loop) or 0% CPU (deadlock)?
Can you make a minidump of your app (with debug info like a .map file)?
http://code.google.com/p/asmprofiler/so … nidump.pas
Offline
Hi folks,
2. string types and dynamic arrays just use the same LOCKed asm instruction everywhere, i.e. for every access which may lead into a write to the string.
I want to express my vision on the problem with string types and dynamic arrays. I believe everybody understands that dynamic arrays and strings are NOT actually thread safe (see an example below that leads to AV) and you have to synchronize an access to variables of such types on your own. If so, Embarcadero has to just mention this in documentation (Strings and Dyn Arrays are not tread safe) and remove "LOCK" instruction from ref counters as AB made in his workaround.
Am I missing something?
Regards,
Yegor
var
s: string;
procedure Writer1(p: Pointer); stdcall;
var
i: Integer;
begin
for i := 1 to 1000000 do
s := IntToStr(i);
end;
procedure Writer2(p: Pointer); stdcall;
var
i: Integer;
begin
for i := 1 to 1000000 do
s := s + IntToStr(i);
end;
begin
BeginThread(@Writer1);
BeginThread(@Writer2);
ReadLn;
end.
Offline
The thread-safeness of reference counted variable were never told to be 100% effective for both read and write.
In short, reading is thread-safe (thanks to the LOCK asm used), but writing such variables at the same time will fail, as your sample code demonstrates.
See this reference post:
From: "Yorai Aminov (TeamB)" <yaminov@nospam.trendline.co.il>
Subject: Re: Are Delphi Strings Thread Safe
Date: 18 Dec 1999 00:00:00 GMT
Newsgroups: borland.public.delphi.vcl.components.usingOn Sat, 18 Dec 1999 05:21:26 +0100, "Kenneth Ellested"
<ke@jydsk-data.dk> wrote:>I have always wondered how delphi dynamic strings are "implemented" by the
>compiler (in details). Will it be possible to have two threads operating on
>the same string ?Yes. Generally speaking, Delphi 5 strings are thread safe, but.
strings in earlier versions were not. This does not mean any action
performed on a string can be done without the proper synchronization,
though.Write access to strings across threads should always be protected by a
critical section or a similar mechanism. This is true for any type,
not only strings. The problem with earlier versions of Delphi was that
reading a string from two threads at the same time was not safe. This
has been fixed in Delphi 5. You can find several lengthy discussions
of this in the .objectpascal group.
AFAIK this is the official behavior as described by Borland/Embarcadero.
Online
Hello Hello,
No talk about the ScaleMM or SynScaleMM for some time now.
What is the status of each of the memory managers?
(HAve not seen repository changes of ScaleMM for ages also)
-Tee-
Offline
No talk about the ScaleMM or SynScaleMM for some time now.
What is the status of each of the memory managers?
(Have not seen repository changes of ScaleMM for ages also)
-Tee-
Yes that's true...
I've been busy for some time to fix the "inter thread memory" problem, but I could not get it 100% threadsafe.
Also the structure has too much overhead and too complicated (so not easy to find bugs). Furthermore I'm busy
with testing the next version of some piece of software.
But (as coincidence does not exists :-) ) I did some preliminary tests on ScaleMM3 yesterday evening. Results so
far seems good (twice as fast as internal D2010 FastMM). Still struggling with interthread memory: I don't want locks
(makes it double slower when I use InterlockedExchangeAdd for example) but also don't want sending a list of "otherthread
memory" to the owner thread. So I'm thinking about some kind of "relaxed" scanning for freed memory by an other thread
when no memory is available in a block and the block has a "interthread mark" (so 90% of the time it is fast, once in a while bit slower: better than extensive/exact administration overhead EVERY time).
Some more background info: I use 4k pages with "no" header: minimal information for each page is stored in an array at begin of 1Mb block. This should reduce memory fetches/cache invalidation for previous/next page checks (this made ScaleMM2 slow) because the array is one mem (no big gaps between memory reads).
I'm also thinking about some kind of carousel of 8 memory blocks, so when you do 8 contiguous mem request they are more spreaded to eliminate "false sharing": http://delphitools.info/2011/05/31/a-fi … tmonitors/
Offline
Is this new approach not too complex?
No, in fact it is much easier/simpler than ScaleMM2 (I did not like the complexity of ScaleMM2)
Offline
No, in fact it is much easier/simpler than ScaleMM2 (I did not like the complexity of ScaleMM2)
When we can test drive it
.-Tee-.
Offline