Posts by andrewdynamo

andrewdynamo · mORMot 1

Update: in the upcoming update of SMS 1.1 you can use the following:

  W3Button1.OnClick := lambda
    var rtti := RTTIRawAttributes;
    for var ir :=Low(rtti) to High(rtti) do
    begin
      var attrib := rtti[ir];
      ShowMessage( JSON.Stringify(attrib) );
    end;
  end;

Interface are not directly supported, only via a class:

  ITest = interface
    function GetTest: Integer;
    property Test: Integer read GetTest;
  end;

  TTest = class(ITest)                            <-- rtti reports ITest
  published
    function GetTest: Integer;
    property Test: Integer read GetTest;
  end;

And only properties and fields are supported, no functions (yet?).
So for now, rtti cannot be used for mORMot.

P.S. extended usage info by looking at the unit test code:
http://dwscript.googlecode.com/svn-hist … d_enum.pas
http://dwscript.googlecode.com/svn-hist … y_enum.pas

andrewdynamo · mORMot 1

Quick reply: yes also RTTI for interface is generated (but only when used because of smartlinking, can be probably tweaked using settings, eric will know that )
Maybe tomorrow a quick test

andrewdynamo · mORMot 1

ab wrote:

This is what I had in mind!
That is, have a hidden method able to create some SMS client wrapper code (or C#), from the server side itself.
Similar to the "mex" endpoint to be added for WCF servers, in order to access the WSDL from outside.
I wanted to allow some quick HTTP browsing of the available ORM classes, method-based services and interface-based services, from any browser.
For ORM classes and interface-based services, no need to read the source code, just create the SMS/C# wrapper classes from the RTTI available at runtime on the server.
We may had an option to the server (with command line or external config file), or enable the feature only if the server is run within the Delphi IDE (which could make sense, and seems to be secure).

Any progress on this side? (about the "metadata" endpoint of the mORMot server)
Because then an auto generator could be included in the SMS IDE itself, which can update/create
the client side classes e.g. from a menu option.
The best would be to have full metadata, so if a remote function returns an object, to have a lightweight
client side class generated too for this object (maximum type safety, code completion, etc).

Side note: DataSnap does not offer full metadata for the proxy generators, so only simple function wrappers
can be made . I probably need to make my own metadata retrieval for DS because of this, in order to be able
to generate DS wrappers for the SMS IDE. So maybe we can share some work/ideas etc to get an universal
metadata system? (I know, DS is a competitor but SMS should support as much RPC servers as possible, and this
way we can at least add full mORMot support in the SMS IDE soon too )

andrewdynamo · Low level and performance

carlos wrote:

I used the google code of ScaleMM2, and it gives me exceptions at the end of the execution. But threading is clearly faster.

You used the latest version from subversion from the version2 branch?
http://code.google.com/p/scalemm/source … 2Fversion2
(some bugs were fixed lately)

And have you some demo code/app to reproduce it?
Which version, Delphi XE2, 64bit?

Btw: better post an issue on scalemm project page:
http://code.google.com/p/scalemm/issues/list

andrewdynamo · mORMot 1

ab wrote:

You are right, WebSockets may be implemented in the future.
But it is AFAIK implemented as a long polling (or lock-and-wait) scheme, based on HTTP headers.
So it is not exactly real "push"ing, but header-based plumbing based on HTTP one-way connection.

No, it is really low level binary TCP connection! (I made a Delphi implementation so I can know it )
With WinSocks you can do the same, because you can send data from client to server and from server to client via the connection handle anytime you want.
(you only need to actively check and wait for incoming data on the client side, I made a single thread for this which can wait for 64 connection at once).
So for Delphi, there is no real benefit to use WebSockets for duplex communication (can be done already, nothing really new) between Delphi server and Delphi clients.

However, we probably gonna use WS in Delphi too, because then we only need ONE http server port for both html4 clients and binary communication for html5 and Delphi clients .

andrewdynamo · mORMot 1

Only polling (and long polling) is/will be used?
I know, this is the only possible with HTTP, but with WebSockets (over http) you can do real "push"ing:
http://www.websocket.org/quantum.html

Therefore I made a websocket implementation (for RemObjects, sorry) with Indy 10:
https://plus.google.com/u/0/11013108667 … vicDJLfYtQ

It's not perfect, but tried to keep thing seperate in a special IOHandler etc, so you can probably use it too without RO.

andrewdynamo · mORMot 1

Can you make a demo implementation with mORMot and RTC and RemObjects SDK/Data Abstract?
So I can compare mORMot against theses types?

andrewdynamo · Low level and performance

TPrami wrote:

When we can test drive it

If have a working version, but it is not ready yet.
It works for normal threads, but inter-thread memory is still a problem.
So I'm trying a simple locking (interlocked) approach now, although I would like something
without any LOCK... Anyway, it works, but thread memory is not reused or freed entirely,
so it leaks memory, etc etc. Also it is not optimized.

I hope I can post a preliminary version tomorrow if interested, but be aware, it is a draft: it needs code
cleanup, refactoring and some "finishing touch".

andrewdynamo · Source Code repository

My timing was correct :-) I tested it several times, even using QueryPerformanceCounter (TStopWatch) instead of millisecondspan.
But I use SQL Server 2008 R2, so probably it has some kind of caching on its own (not ADO generic)?

I think timing is also dependent on amount of null/integer/string data, because my little bit slower SynOleDb test had a lot of string data (which is converted from UTF8 in SynOleDB??).

By the way: SynOleDB seems to have much higher network usage! It used several MB/s during data retrieval, but ADO had much lower amount, like 100Kb/s?? Maybe some kind of compression? I read all column data to be sure everything is fetched.

andrewdynamo · Source Code repository

I did some testing against our database with low level ADO and SynOleDB:

=========================

ADO, client side cursor:
Execute: 15.894,00ms
1153092 Rows in 31.828,00ms

ADO, server side cursor:
Execute: 2.473,00ms
-1 Rows in 30.212,00ms

SynOLEDB
Execute: 2.421,00ms
1153092 Rows in 18.204,00ms

=========================

ADO, client side cursor:
Execute: 385,84ms
7727 Rows in 147,58ms

ADO, server side cursor:
Execute: 198,28ms
-1 Rows in 532,66ms
Second time:
Execute: 198,00ms
-1 Rows in 176,24ms <--- ???

SynOLEDB
Execute: 197,56ms
7727 Rows in 480,42ms

=========================

My code:

Var i: integer; s: string;

cmd := TADOCommand.Create(Self);
cmd.Connection := Self.ADOConnection1;
//ADOConnection1.CursorLocation := clUseServer;
cmd.CommandText := 'select top 1 * from view1';
recordset:= cmd.Execute(iRows, EmptyParam);
while not recordset.EOF do
begin
for i := 0 to recordset.Fields.Count - 1 do
s := recordset.Collect(i);
recordset.MoveNext;
end;

---

Query := Conn.NewStatement;
Query.Execute('select * from view1', True, []);

while Query.Step do
begin
for i := 0 to Query.ColumnCount - 1 do
s := Query.ColumnVariant(i);
end;

=========================

So for many records (>10.000) SynOleDB seems faster (18s vs 30s), but ADO seems faster at small records sets (680ms vs 532ms client side)
(ADO caches results, because second time is a lot faster sometimes?)

andrewdynamo · Low level and performance

ab wrote:

Is this new approach not too complex?

No, in fact it is much easier/simpler than ScaleMM2 (I did not like the complexity of ScaleMM2)

andrewdynamo · Low level and performance

TPrami wrote:

No talk about the ScaleMM or SynScaleMM for some time now.
What is the status of each of the memory managers?
(Have not seen repository changes of ScaleMM for ages also)
-Tee-

Yes that's true...

I've been busy for some time to fix the "inter thread memory" problem, but I could not get it 100% threadsafe.
Also the structure has too much overhead and too complicated (so not easy to find bugs). Furthermore I'm busy
with testing the next version of some piece of software.

But (as coincidence does not exists :-) ) I did some preliminary tests on ScaleMM3 yesterday evening. Results so
far seems good (twice as fast as internal D2010 FastMM). Still struggling with interthread memory: I don't want locks
(makes it double slower when I use InterlockedExchangeAdd for example) but also don't want sending a list of "otherthread
memory" to the owner thread. So I'm thinking about some kind of "relaxed" scanning for freed memory by an other thread
when no memory is available in a block and the block has a "interthread mark" (so 90% of the time it is fast, once in a while bit slower: better than extensive/exact administration overhead EVERY time).

Some more background info: I use 4k pages with "no" header: minimal information for each page is stored in an array at begin of 1Mb block. This should reduce memory fetches/cache invalidation for previous/next page checks (this made ScaleMM2 slow) because the array is one mem (no big gaps between memory reads).
I'm also thinking about some kind of carousel of 8 memory blocks, so when you do 8 contiguous mem request they are more spreaded to eliminate "false sharing": http://delphitools.info/2011/05/31/a-fi … tmonitors/

andrewdynamo · mORMot 1

How much faster is this kernel mode? Do you have any numbers? (response time, throughput, etc)

andrewdynamo · Low level and performance

cstuffer wrote:

I notice that Small, Medium and Large memory are supported.
Does this mean we can remove FastMM ??

Yes and no
Because it seems ScaleMM2 does not pass the validation tests of FastCode Benchmark:
https://forums.embarcadero.com/thread.j … 529#332529
so I should not use it in production yet

cstuffer wrote:

* And, rarely and randomly, i am getting to program to freeze with ScaleMM2.

Freeze as in 100% CPU (loop) or 0% CPU (deadlock)?

Can you make a minidump of your app (with debug info like a .map file)?
http://code.google.com/p/asmprofiler/so … nidump.pas

andrewdynamo · Low level and performance

I fixed it (in a D7 portable edition :-) ) by changing "record" to "object"
(committed in SVN now)

andrewdynamo · Low level and performance

cstuffer wrote:

Hello,
It's look like ScaleMM2 is compatible with Delphi 2007 and up only.
At least, it's does not compile with Delphi 7.
Carl

That is possible
I only have D2010 so I made it primary for that. But "ab" made some changes to make ScaleMM1 compatible with D7, something with changing "record" to "object" so you could try that first? What kind of compiler errors do you get next?

andrewdynamo · Low level and performance

Status update: Tried some offsets to disalign, but no succes. I think I just must prevent too much "lookups" (needs L1/L2 cache fetches, or worse: memory fetch).
Instead of nice structure and good "code smell" I'll have to use speed hacks and/or more (packed) data in the header for intelligent reallocs (I must determine the type: small, medium and large, and even worse: check the thread owner). I have some ideas for it, but this need some restructure...

I have already made a simple pre-realloc check: if new size is smaller but greater than 1/4 of current size, nothing has to be done (no thread owner or type check). Now a realloc test is only slightly slower than (asm optimized!) FastMM/D2010.

andrewdynamo · Low level and performance

Hmm, high number of resource stalls, L1 load blocked by stores, etc:
https://picasaweb.google.com/lh/photo/q … directlink

Maybe memory is too good aligned? L1 has 4k aliasing, so first 4k must differ?

andrewdynamo · Low level and performance

I have some slow performance in my code:
https://picasaweb.google.com/lh/photo/G … directlink
https://picasaweb.google.com/lh/photo/Q … directlink

Why are these lines below so slow? (more than 100 CPU cycles)
if NativeUInt(pm.OwnerBlock) and 1 <> 0 then
and:
if ot = PBaseThreadMemory(FCacheSmallMemManager) then

Maybe because of L1/L2 cache fetch? (FCacheSmallMemManager is located in current object, maybe needs to fetch it?).
How can this be optimized?
(these 2 lines consumes most of the time)

========================================

function TThreadMemManager.ReallocMem(aMemory: Pointer;
aSize: NativeUInt): Pointer;
var
pm: PBaseMemHeader;
ot: PBaseThreadMemory;
begin
if FOtherThreadFreedMemory <> nil then
ProcessFreedMemFromOtherThreads;

pm := PBaseMemHeader(NativeUInt(aMemory) - SizeOf(TBaseMemHeader));
//check realloc of freed mem
if (pm.Size and 1 = 0) then
begin
if NativeUInt(pm.OwnerBlock) and 1 <> 0 then
//lowest bit is mark bit: medium mem has ownerthread instead of ownerblock (temp. optimization trial)
//otherwise slow L1/L2 fetch needed in case of "large" distance
begin
ot := PBaseThreadMemory( NativeUInt(pm.OwnerBlock) and -2); //clear lowest bit
if ot = PBaseThreadMemory(FCacheMediumMemManager) then
Result := FCacheMediumMemManager.ReallocMem(aMemory, aSize)
else
Result := ReallocMemOfOtherThread(aMemory, aSize);
end
else
begin
ot := pm.OwnerBlock.OwnerThread;

if ot = PBaseThreadMemory(FCacheSmallMemManager) then
Result := FCacheSmallMemManager.ReallocMem(aMemory, aSize)
// else if ot = PBaseThreadMemory(FCacheMediumMemManager) then
// Result := FCacheMediumMemManager.ReallocMem(aMemory, aSize)
else if ot = PBaseThreadMemory(FCacheLargeMemManager) then
Result := FCacheLargeMemManager.ReallocMemWithHeader(aMemory, aSize)
else
Result := ReallocMemOfOtherThread(aMemory, aSize);
end
end
else
Error(reInvalidPtr);
end;

andrewdynamo · Low level and performance

Yes, but I want it to be overall faster
It must be a complete replacement, so also fast in single threaded. If possible .

andrewdynamo · Low level and performance

Sorry, it was a quick test
FastMM is 100% (time), so anything above is bad (like the first test), anything below is better (less is less time, so faster, is better )

andrewdynamo · Low level and performance

Some intermediate detailed benchmark:
http://code.google.com/p/scalemm/#Benchmark_2

It is slower in small reallocations (with no change, so only "function" overhead), but overall it is 25% faster now.
Need to change slow "owner" determination (small, medium, large), will change it to use same kind of logic as FastMM does
(use lowest or highest bits of "size" to mark for free and size type)

andrewdynamo · Low level and performance

Checked in better large mem handling, now ScaleMM2 is 20% faster in FastCodeMMChallenge :-)

Average Speed Performance: (Scaled so that the winner = 100%)
DelphiInternal : 79,7
ScaleMem : 100,0

Still some improvements possible, because some test results are too bad (will investigate and "fix" them)

andrewdynamo · Low level and performance

hmmm, FastCodeMMChallenge depends on big mem realloc (which is partly implemented) so it is not as fast as is should be (in that benchmark). So I need to use the same big mem realloc algoritm as FastMM I think (increment in steps of 64k instead of 1byte :-) and use VirtualQuery to expend virtualmem)

andrewdynamo · Low level and performance

Compiles and runs fine on D2007 too now
Extra checks added, new extensive test running and going fine so far

andrewdynamo · Low level and performance

mai62 wrote:

Delphi 2007:

Thanks for reporting, I only checked it for D2010, will fix this tomorrow I think (after D2010 is completely tested)

andrewdynamo · Low level and performance

Latest version seems to work OK now (added medium mem CheckMem functions: found a couple of nasty bugs with it!)
Also unit test added.
http://code.google.com/p/scalemm/source/browse/trunk

It only supports mem < 2Gb and no interthread mem support and no mem leak support yet.
And some more optimizations and cleanup needed...

andrewdynamo · Low level and performance

yeah, you must put begin/end around the for loop :-)

And yes, it is a simple test to test the "core" speed (only memory operations, nothing more)

In real life the results will be different. The FastCodeMMChallenge also showed slower ScaleMM in some cases
(because it works on top of FastMM: mem larger than 2kb is passed to FastMM, so slightly slower because of
the ScaleMM size check overhead). I hope ScaleMM2 won't have this limitation . I have still one (?) tiny bug so cannot
run the full FastCodeMMChallenge yet...

Btw: alpha version is in source control, ScaleMM1 in seperate branch

andrewdynamo · Low level and performance

30M alloc/realloc/free (small mem, 10 - 120bytes), 1 thread
FastMM = 4376ms
ScaleMM2 = 1651 (still too high, can be optimized further, ScaleMM1 has about 1100ms)

30M alloc/realloc/free (medium mem, 10kb - 80kb), 1 thread
FastMM = 58206ms (!), with no resize (+10bytes) = 2302ms
ScaleMM2 = 2326ms

for j := 0 to 1000 do
for i := 0 to 10000 do
p1 := GetMemory(10);
p2 := GetMemory(40);
p3 := GetMemory(80);
p1 := ReallocMemory(p1, 30);
p2 := ReallocMemory(p2, 60);
p3 := ReallocMemory(p3, 120);
FreeMemory(p1);
FreeMemory(p2);
FreeMemory(p3);

for j := 0 to 1000 do
for i := 0 to 10000 do
p1 := GetMemory(10 * 1024);
p2 := GetMemory(40 * 1024);
p3 := GetMemory(80 * 1024);
p1 := ReallocMemory(p1, 10 * 1024 + 10);
p2 := ReallocMemory(p2, 40 * 1024 + 100);
p3 := ReallocMemory(p3, 80 * 1024 + 10);
FreeMemory(p1);
FreeMemory(p2);
FreeMemory(p3);

andrewdynamo · Low level and performance

ScaleMM is faster than FastMM, so also good for single thread.
ScaleMM also caches the thread managers, so should also be good with many short lived threads.
ScaleMM only uses more mem then FastMM, so do not use it if you are low on memory (btw: FastMM also uses more mem then low level Windows mem)

ScaleMM2 is almost ready, busy with the last details. It works without FastMM so faster in multithreaded (no FastMM locks underneath), and it will have special medium block algoritm, and uses direct virtualalloc for large mem (> 1Mb) but a special large block handling can be easily made in case someone uses lots of large blocks.

andrewdynamo · Low level and performance

Version 2 almost working, ScaleMM1 works on top of v2 now.
(ScaleMM2 needs 16byte header + minimum alloc size of 32 bytes, so too much overhead for small blocks. ScaleMM1 is for small mem (<2kb) and ScaleMM2 for medium (<1Mb) and larger is direct VirtualAlloc/free)
http://code.google.com/p/scalemm/source … caleMM.pas
http://code.google.com/p/scalemm/source … aleMM2.pas

Some "small" problems needs to be fixed like backwards free block scanning, so you'll get "out of memory" in intensive tests.
And of course the necessary optimizations, cleanup, documentation etc.

Speed of ScaleMM1 is about 1100ms and ScaleMM2 about 2400ms (30M allocs/reallocs/free). So small memory allocs is faster than medium.

PS: ABA problem needs to be fixed too, will do this soon

andrewdynamo · Low level and performance

First of all: a happy 2011!

Good news: V2 seems to work now. Not fully working (like interthread mem and it does not release mem to Windows).
Speed is about 3 times slower (30M allocs/reallocs/free in 3.1s, V1 does it in 1.2s) but there are enough optimizations possible.
http://code.google.com/p/scalemm/source … aleMM2.pas

andrewdynamo · Low level and performance

TPrami wrote:

Good to hear that there is an progress.
I was just pondering how ScaleMM will work with large blocks.
Allocating and Disposing is locked, because of FastMM (etc) underneath, but how about using them, like copying smaller blocks for processing etc... Will that be lock free? I Suppose so... (Just to try to understand where this is currently)
-TP-

Yes, ScaleMM version 1 works on top of FastMM, so large allocations (or when ScaleMM needs more mem) ar locked by FastMM. But all small blocks etc are processed per thread so no locking at all!

TPrami wrote:

I need ScaleMM for one server, but quite often it uses large blocks, they are owned by each thread (so no cross thread usage), but just been thinking that operations on those are blocked anyways with FastMM currently.
And other thing crossed my mind is that most likely the OmniThreadLibrary users would also gain from ScaleMM, if there are no too much depencies between the OTL and FMM...
-TP-

Because of the FastMM lock dependency I am making a large block allocator. This allocator is fully dynamic and it seems so good (?) it could also be used for small blocks. So I am thinking to use it as a complete allocator (no real difference between small or medium mem). Or if the speed is not good enough to use ScaleMM1 on top of ScaleMM2 :-).

I really would like to get rid of FastMM locking to get full scaling: this is the future (multi cores, OTL, AsyncCalls, etc)!

Btw: I hope I solved a nasty bug in ScaleMM2 yesterday evening, so I can test/develop it further.

andrewdynamo · Low level and performance

I have a "working" POC of my newest algoritm:
http://code.google.com/p/scalemm/source … aleMM2.pas

It works completely different from the first version: it does not use preallocated blocks of one fixed size but
it does a dynamic allocation from one big block of 1Mb (like FastMM does with medium blocks). This way you have a lower
memory usage, because you use all memory of the 1mb block for all sizes. Downside of this approach is some more memory
fragmentation within the block, but I tried to reduce this by using an lookup index (bit array mask) by size, so a small alloc does not use the first available (big) mem but the tries to use the smallest as possible.

It is working for a small amount of allocs, but it has a nasty bug somewhere... However you can get an idea of the working how it should be. It is not optimized yet, but I tried to use "fast" techniques (shl and shr instead of div) in the base.
I hope I can remove some overhead somewhere (too much if statements etc).

Note: the bit scanning (also reverse bit scanning to get highest bit for the size), using of masks etc makes it less easy to follow (more "high tech" then version 1).

Don't know how fast it will be (for small allocs) in the end, maybe using ScaleMM v1 on top of version 2? :-)

andrewdynamo · Low level and performance

Intel Core2 Quad core Q8300 @ 2.5Ghz

4B
1 = 47,83 nanoseconds per cycle
2 = 58,80 nanoseconds per cycle
4 = 124,68 nanoseconds per cycle
8 = 128,21 nanoseconds per cycle

8B
1 = 56,61 nanoseconds per cycle
2 = 73,01 nanoseconds per cycle
4 = 146,73 nanoseconds per cycle
8 = 146,73 nanoseconds per cycle

8BV
1 = 54,40 nanoseconds per cycle
2 = 75,75 nanoseconds per cycle
4 = 177,40 nanoseconds per cycle
8 = 222,44 nanoseconds per cycle

So 4B seems the fastest

Btw: I almost have my new ScaleMM algoritm ready, only need to fix some bugs (to have a working POC, fully working needs some more time)

andrewdynamo · Low level and performance

About the ABA problem: I think you're right. Not easy to fix it: critical section or larger lock (FRecursive) will make it a lot slower. However, it is a lock on a block so less frequent (not every mem alloc/free).
Version check should also be possible: a little more mem overhead on a block (no problem) and block should be 8 byte aligned for a cmpxchg8b.
I will see what's the best and/or easiest :-).

andrewdynamo · Low level and performance

AdamWu wrote:

LOCK only applies to individual instructions, and no context switch can happen inside an instruction, locked or not.

I do not think this is true: if what you say is true, I would not need a Sleep() on a "simple" CAS instruction, because it would be completed (very) fast.
It seems however it can be interrupted, so it keeps a lock quite long: I got a race condition, it burned my CPU and I had to wait forever.

andrewdynamo · Low level and performance

ab wrote:

Perhaps a little bit, but an integer mul is handled in a very aggressive way, i.e. within a few CPU cycles with modern CPU (with iCore I mean).
What is CPU consuming, is the div operation. Add or Mul are more or less equivalent.

OK, I thought mul was also slow

ab wrote:

I'm trying to implement some bitmap-based freeing of memory blocks in SynScaleMM, but in a different way you're doing in ScaleMM2. Still finishing the implementation. It'll be set by a USEBITMAP conditional.

I saw the conditional: can you tell a little bit about how you think to do it? How much different?
(ScaleMM2 was "Proof Of Concept" anyway )

Btw: fixed ScaleMM is committed

andrewdynamo · Low level and performance

I thought (or hoped) a LOCK would not be interupted by a context switch, but it does...
(was more or less a dirty test)

So I did a quick but real test, the following gives the best results:

  ia: array of Integer;
  ...
  for l := 0 to 1000 do
  for i := 0 to 10000 do
  begin
    j := ia[i];
    k := j+1;
    while not CAS32(j, k, ia[i]) do
    begin
      if not SwitchToThread then
        Sleep(0);
      j := ia[i];
      k := j+1;
      if CAS32i(j, k, ia[i]) then
        Break
      else
         Sleep(1);
    end;
  end;

Timing results:
Testing with 1 threads...
Average: 182
Testing with 2 threads...
Average: 214
Testing with 3 threads...
Average: 260
Testing with 4 threads...
Average: 279
Testing with 8 threads...
Average: 426
Testing with 16 threads...
Average: 777

SwitchToThread gives slightly better results if thread count > core count (more efficient to switch to thread on same core).
If I don't use sleep(1) I get a race condition and it runs "forever" (I stopped waiting after 30s).

I will update my ScaleMM soon, first I want to test Fast Code Benchmark (much slower at startup than Fastmm -> never mind seems due to copyfile of exe).
Btw: I use the FMemoryArray now as a cache with an increment of the offset instead of multiply (add is faster than multiply?)

andrewdynamo · Low level and performance

ab wrote:

For Delphi 2010, it's even worse: the conversion is made using CvtInt, and not CvtIntW (which was existing in all cases), and the slow System.@UStrFromPCharLen function is called on every IntToStr() call ! During Unicode refactoring, performance was not a goal in EMB...

How about an Enhanced RTL for D2010? :-)

andrewdynamo · Low level and performance

I have committed version 1.1 with some improvements/optimizations:

Still, the "Fast Code MM Benchmark" results are not good enough:
----------------------------------------------------------------
Average Total Performance: (Scaled so that the winner = 100%)
D2010 : 100,0
ScaleMem : 84,3
Average Speed Performance: (Scaled so that the winner = 100%)
D2010 : 84,1
ScaleMem : 100,0
Average Memory Performance: (Scaled so that the winner = 100%)
D2010 : 100,0
ScaleMem : 50,2
----------------------------------------------------------------

Some tests suffers from many "FreeBlockMemoryToGlobal" and global "GetBlockMemory" calls. So bad combination of
thread memory and FastMM (large) block locking.
Also, large alloc + free + realloc have some lower performance because ScaleMM does a size check first (some extra
overhead) and passes it to FastMM after that.

Note: memory consumption is higher because less optimal memory usage: each thread has its own (partly) used memory.

So these results can be explained (less optimal working because of mixed thread memory + fastmm locking). But in the mean
time this is the best I can get.
Btw: I get much better "StringThreadTest" results if I use ScaleMM on top of HeapMM or VirtualMM (because of no FastMM "large block" locks) but then reallocation tests perform very bad. I hope to fix all these problems with ScaleMM2 (but do not expect results soon).

andrewdynamo · Low level and performance

I committed my latest version to SVN:
http://code.google.com/p/scalemm/source … caleMM.pas

StringTest Performance is still too low due to lock on global table (because a full block is free). Also FastMM has a lock on large blocks too (so 2 times a lock).

All tests and checks are OK so ready for release.

I'm busy with a new allocation algoritme (for any alloc size, not only small), I hope a POC next week.

andrewdynamo · Low level and performance

Yes, it is less relevant.
However, to get a high score (to show ScaleMM is really faster than FastMM) I have to "cheat" my realloc algoritm... :-(
Maybe a simple change of > instead of >= can make a big difference too.

I benchmarked ScaleMM, so it should be "ready for release" now (no crashes in extensive benchmark).
ScaleMM2 is only POC, however a simple app runs fine, but big app won't (mem is never released etc etc).

andrewdynamo · Low level and performance

Overal:
- ScaleMM is faster, only realloc's are slower than D2010 internal Fastmm.
Of course ScaleMM is much better in multi-threaded tests :-)
- ScaleMM uses more mem than d2010/fastmm

Results:
http://code.google.com/p/scalemm/source … _D2010.txt

So I have to check why reallocs are slower, especially the "StringThreadTest" has very low score!

I'll try to reduce memory overhead in ScaleMM2

andrewdynamo · Low level and performance

I changed "the challenge" using pansichar instead of pchar (D2005 was not yet unicode :-) ) and now all tests are running fine.
Hope to post some results soon.

andrewdynamo · mORMot 1

Do you also have a generator for the data classes?

My current customer has a so called "CRUD generator" which generates all data object from the database (via metadata).
Very convenient if you have a lot of tables and if your DB changes a lot (just "re-generate all" and your model is updated with the latest DB version). Much better then "FieldByName" ;-).

I have just refactored the current implementation (with attributes for metadata and generics), but I will take a look at this implementation too (we use MS SQL server).

andrewdynamo · Low level and performance

I have some (good) thoughts about how to implement medium blocks in a fast and easy (not complex) way, but need to work (and think) it out first.

Performance with your mods is it a bit faster: 1200ms (my simple demo test) instead of 1300ms.

I found the FastCode MM Challenge this weekend, which contains good MM tests: I already fixed some small problems. Still some AVs wit realloc, hope to fix this soon.

andrewdynamo · Low level and performance

Btw: if you need a free profiler, which stores the execution time of each call(!), try my AsmProfiler :-)
http://code.google.com/p/asmprofiler/
(little bit old and ugly code but it works)

andrewdynamo · Low level and performance

So for large blocks: direct call VirtualAlloc. If we do it with no locks it would scale better than FastMM. Should not be difficult to make.

But what kind of algoritme should be used, or how to buffer/cache large blocks? You don't want to cache blocks of e.g. 10M, and/or reuse it for an alloc of 5M -> too much memory overhead.

Or just no caching: alloc and direct free?

andrewdynamo · Low level and performance

TPrami wrote:

Hello hello,
One question, if this is meant for small en medium blocks (most common) how it is going to scale with threads that use very large data blocks, think of handles large bitmaps?? Will it scale better than FastMM or is it the same?
-TP-

For large blocks it will be the same (no scaling) or worse (due to some extra overhead: block calculation size then pass to FastMM).
So for the future, it should also handle large blocks. But for the mean time, I wanted to test small/medium blocks first.

mORMot Open Source

#1 Re: mORMot 1 » Smart Mobile Studio mORMot class server » 2013-02-15 07:36:05

#2 Re: mORMot 1 » Smart Mobile Studio mORMot class server » 2013-02-14 21:18:25

#3 Re: mORMot 1 » Smart Mobile Studio mORMot class server » 2013-02-07 12:03:10

#4 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2012-10-10 12:07:10

#5 Re: mORMot 1 » Roadmap: interface-based callbacks for Event Collaboration » 2012-09-10 09:19:59

#6 Re: mORMot 1 » Roadmap: interface-based callbacks for Event Collaboration » 2012-09-07 06:30:18

#7 Re: mORMot 1 » The mORMot attitude » 2012-04-25 11:34:42

#8 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-09-22 13:58:33

#9 Re: Source Code repository » SynOleDB: OpenSource Unit for direct access to any database via OleDB » 2011-07-05 06:28:44

#10 Re: Source Code repository » SynOleDB: OpenSource Unit for direct access to any database via OleDB » 2011-07-04 12:28:04

#11 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-06-17 19:00:51

#12 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-06-17 06:49:27

#13 Re: mORMot 1 » HTTP server using fast http.sys kernel-mode server » 2011-06-07 07:23:15

#14 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-03-21 07:20:30

#15 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-03-17 10:40:48

#16 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-03-17 07:08:14

#17 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-02-08 18:46:45

#18 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-02-05 21:15:54

#19 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-02-04 15:07:46

#20 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-02-02 08:37:47

#21 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-02-02 06:48:52

#22 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-02-01 15:01:45

#23 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-31 09:30:06

#24 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-28 14:34:13

#25 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-28 09:22:40

#26 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-27 12:25:47

#27 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-27 09:26:44

#28 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-25 10:19:29

#29 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-25 08:41:42

#30 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-24 08:11:28

#31 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-14 13:40:23

#32 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2011-01-03 08:18:05

#33 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-30 06:43:41

#34 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-29 11:47:41

#35 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-20 07:12:24

#36 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-13 06:57:50

#37 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-08 07:19:30

#38 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-07 14:53:09

#39 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-07 11:33:47

#40 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-06 12:25:19

#41 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-06 12:06:47

#42 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-12-03 15:21:01

#43 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-29 14:56:27

#44 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-29 11:13:04

#45 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-29 09:09:56

#46 Re: mORMot 1 » Code integration / ORM » 2010-11-29 07:34:38

#47 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-29 07:00:33

#48 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 14:36:19

#49 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 14:29:52

#50 Re: Low level and performance » Delphi doesn't like multi-core CPUs (or the contrary) » 2010-11-26 13:01:40

Board footer