#1 mORMot 1 » enhanced rtl functions » 2022-07-04 18:51:24

RObyDP
Replies: 0

hi Arnaud,

somebody told me that in your latest Mormot you have built optimized rtl functions as @fillchar using simd asm
perhaps can you point me to these functions?
can be used in delphi?
which license do you permits?
thank you
R.

#2 Re: mORMot 1 » Fastest AES-PRNG, AES-CTR and AES-GCM Delphi implementation » 2021-02-15 10:31:17

hello,

really great result,

can I ask, do you support Argon2?

I'm referring, to start an instance E2E using SGX or TrustZone in random protected ram (isolated from OS) or remote net pipe, and extending using 32bit key with Argon2 hash. The most reliable until now.

A bientot!

(nowadays nothing than <> Argon2 has any value)

Do you will plan to implement X3DH, Double Ratchet, Curve25519 other than AES-256 e HMAC-SHA256?

#3 Re: mORMot 1 » Fast MM5 » 2020-05-08 14:31:38

Do you will release a Delphi version?

#4 mORMot 1 » System.Threading TThreadPool and TLS » 2017-12-02 10:01:54

RObyDP
Replies: 1

sorry for the offtopic,
hello,
I'm using the very powerful TTASK api of System.Threading,
however I need a thing:
the pool is dynamic, I attach a TLS class with DB components into the running thread of the context, and keeping it in memory until the thread runs (avoiding create and destroy every time a db context).
I need to have an event OnAThreadExit of the System.TThreadPool so I can free the TLS class when the thread is made free.
How can be done without change the source of System.Threading.pas? Winapi doesn't offer a callback on thread exit, only a poll api is found, so cannot be used (I need to call TLS code within the thread, before the exit).

thank you
R.

#5 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-19 09:37:39

ok, well, in the case the sequence order don't need to be synchronized (it's not important the order of execution read write there), I can proceed without interlocked* calls without worries?
btw. thanks for your time

#6 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-18 18:42:31

off-offtopic

thread 1) read @Int64
00000000006CCAC3 488B4538         mov rax,[rbp+$38] pointer to address
00000000006CCAC7 48894530         mov [rbp+$30],rax read from a 64bit quadword single op mov (mov value into register)

thread 2) write @Int64
00000000006CC8FA 488B4538         mov rax,[rbp+$38] pointer to the same address
00000000006CC8FE 488905739F0500   mov [rel $00059f73],rax write the above 64bit quadword with a single op mov

do you confirm that here we don't need atomic sync functions because of the single mmu align (64bit or more, so a collision never can happen)?
thanks

#7 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-14 06:58:37

indeed, it's very slow against lz4, synlz or snappy for example, useful only for default web browser compression

#8 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-13 22:16:05

They don't fill the dictionary table of common tokens during the deflate, so for example "ciao ciao ciao" is not compressed at all, the checksum hash crc32 is still used.

#9 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-13 13:03:59

ok I did test on real html files produced by wordpress and a forum php, run a test with similar compression ratio (1 for gzip, deflate, -2 for intel)
gzip 12 seconds
your mingw 5.6 sec (cloudflare)
intel 2.5 sec

parallelized intel provide 20Gbit/s html IN -> 4.4Gbit/s compressed output on i7 4/8 cores 3.4 Ghz

I wait the correction from cloudflare engineer, for LLVM patches, then I let you know
A bientot

#10 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-13 10:15:32

look at the Zlib patches from Intel (the files are in the Intel IPP for Linux), they works fine as static DLL, introducing a DeflateInit -2 and DeflateInit2 -2 options for fastest mode.
Under Win64 this performs very similar to cloudflare.

#11 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 21:45:28

I'm talking with the engineer who did cloudflare patch, indeed with Clang the issue are with the SSE* calls.

#12 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 21:44:16

I compiled gcc -c -O3 -msse4.2 -mpclmul, but the results are slow. How you can be so fast? But I suppose this is a magic secret :-)

#13 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 20:48:28

GCC under Windows? MinGw or CygWin? But seems that we cannot use a function from gpl code as linux kernel for commercial purposes, true?

#14 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 20:37:17

I have isolated the problem, it's in _mm_crc32_u32 LLVM, now I check the function headers (seems a integer overflow)

#15 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 20:06:55

clang with O2 works, but a test take 13seconds; with O3 corrupt the results, but take 3.4 seconds
under Linux with GCC test take 3.2 seconds
:-\

#16 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 17:29:48

do you have used the PCLMUL define of Cloudflare? Because 'crc32_pclmul_le_16' exists only in Linux kernel.
Perhaps do you know in Visual C ++ the equivalent of -msse4.2 -mpclmul in gcc (to enable xmmintrin and emmintrin SSE?)
plz consider, I have little experience with C compilers.

#17 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-12 17:24:10

can I ask, do you use MinGw64 to make the *.obj?

#18 Re: mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-11 19:37:17

well I did a lot of DLL versions with Intel, Intel+PPI, cloudflare, checksum on and off...
I like have a objects build with CLANG
can we try?

#19 mORMot 1 » plz help with LLVM5 and ZlibSSE » 2017-09-11 18:16:50

RObyDP
Replies: 22

hello,

I'm evaluating the speed of many zlib implementations, ng, cloudflare, intel, bcz I want try to parallelize some parts
all ok making DLL with VC 2015-2017

I like have OBJ to statically link them inside Delphi
all ok with gnu cc

but I'm stopped with LLVM and cloudflare

PLEASE can somebody help me?

I have done a lot of testing editing the sources and the compiler options, but I obtain always wrong results.

So:
1) download Cloudflare from https://github.com/cloudflare/zlib
2) download LLVM 5.0 from www.llvm.org
3) if you have visual studio community 2015 then LLVM take the headers from default include folders (or download windows PSDK)

Try to produce objects as example:
clang -c -O3 -D_CRT_NONSTDC_NO_DEPRECATE -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_WARNINGS -DZLIB_WINAPI -DASMV -DASMINF -DWIN64 -mssse3 -msse4.1 -msse4.2 *.c
(there are an option for a crc32 function taken from Linux kernel, but avoiding this define we don't have problems, thus is used in gzip that I don't need)

ok try to map the objects with SynSSLZip, and see results: with DeflateInit2 (mandatory for deflate browser compatibility) the results are not correct

btw. here a revised struct for z_stream:

TZ_stream = record                         
    next_in: PByte;
    avail_in: UInt32;
    total_in: UInt64;
    next_out: PByte;
    avail_out: UInt32;
    total_out: UInt64;
    msg: PAnsiChar;
    state:Pinternal_state;
    zalloc: alloc_func;
    zfree: free_func;
    opaque: Pointer;
    data_type: Integer;
    adler: UInt64;
    reserved: UInt64;
  end;

sizeof=112

please help, also something I can pay if we solve this, where is the trick with LLVM?

for your curiosity, actually, from my tests, intel zlib and cloudflare zlib are similar: cloudflare is better at level 2 and up, intel is better at -2 level (avoiding crc adler checksum)
sorry for my english, I'm in a hurry up

(eventually rdp@dellapasqua.com)
A bientot

#21 Re: mORMot 1 » a offtopic question » 2017-09-09 16:53:45

So, let's say, in a method the local variables are allocated in the stack every time the funcion is called (and the manager deallocate them at the function exit), so doesn't need to be protected from other threads (every thread get a copy of the vars in the stack), allright?
Instead all global types obviously should be protected from read/write using monitor, mutex, crts...

#22 mORMot 1 » a offtopic question » 2017-09-09 12:38:25

RObyDP
Replies: 4

about thread concurrency with Delphi

function Dummy: integer;
var
A
B
C
begin
//do something
end;

Then suppose to have many threads calling Dummy()
the local variables A,B,C instance are unique globally or instantiated for each thread? Are there some runtime management?
The access to those variables are atomically done if global unique or we should use a critical section before use Dummy()?
So what happens in RAM when two threads are calling Dummy concurrently and accessing read/write ABC at the same time?

thanks for info

#23 Low level and performance » Delphi XE rtl source path add » 2017-08-29 20:06:37

RObyDP
Replies: 1

hello,

I'm trying to add source files /rtl, after rebuilding the RTL.
I obtain [dcc64 Fatal Error] System.Variants.pas(1271): E2158 System.Variants unit out of date or corrupted: missing '@VarCast'.

Does somebody knows how to solve this problem?

Thanks.
R.

#24 mORMot 1 » real case test MM parallel 4x scalable (i7 6700) » 2017-07-26 05:38:16

RObyDP
Replies: 1

I did a small test with real code scenario,
look at parallel zlib with my patch, zcompress loop 1000 of a 1100KB text file:

uses System.Zlib;

threadvar
  INS: TMemoryStream;
  OUTS: pointer;
  SizeIn: integer;
  SizeOUT: integer;

procedure TForm.CompressClick(Sender: TObject);
var
  Count: integer;
begin
    Count := GetTickCount;
    TParallel.For(1,1000,procedure(I:integer)
    begin
    INS := TMemoryStream.Create;
    INS.LoadFromFile('c:\teststream.txt');
    SizeIn := INS.Size;
    GetMem(OUTS, SizeIn);
    SizeOUT := SizeIn;
    ZCompress(INS.Memory, SizeIn, OUTS, SizeOUT, zcFastest);
    INS.Free;
    FreeMem(OUTS);
    end);
        ShowMessage(IntToStr(GetTickCount - Count));
end;

- fastmm4 900-1000msec
- brainMM 563msec
- msheap 532msec
- my patch Intel IPP + TTB 281 msec

www.dellapasqua.com
www.dellapasqua.com/intelTBB.rar (put a teststream.txt file on c:\ and run files)

#26 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-24 12:25:52

I'll do a build of IPP+TBB for IA32, and will see how performs.

#28 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-22 21:23:46

Hi!

Unfortunately the library has a few critical bugs, which will be fixed, I think, this autumn. Look for example here: https://github.com/d-mozulyov/BrainMM/issues/5

Thanks for attention to my project


Отправлено с iPhone

13 июля 2017 г., в 15:20, Roberto Della Pasqua <rdp@dellapasqua.com> написал(а):
Hello Dimitry,

I like ask if your BrainMM is reliable enough to be used for server 64bit 24/7 applications,
does it suffers of mem fragmentation or leaks, troubles at all?

Do you have customer reports of months of work?

Thank you.

Regards.

#29 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-21 17:10:21

what do you think about a Delphi compiler OPENMP adherence?

#30 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-21 17:08:31

well, parallelizing the MM is not enough to get better overall performances, all the algorithms should be made with parallel class, as TParallel.For, etc. and not every task can be parallelized. Anyway the cpu near future is the massive multicore enlargement instead of Ghz war. So the direction seems this.

#32 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-21 12:52:46

FastMM4 22sec
FastMM4-AVX 18sec
IntelPPI+TBB 4sec

(consider that under single thread fastmm performs slightly better than Intel)

#33 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-20 18:54:22

do you think it's possible to convert the DLLs in one OBJ, maybe using objconv AgnSoft? It's a very hard task, no?

#34 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-20 18:52:25

errata corrige:
using FastMM4 NoThreadContention 22098 msec
using ScaleMM2 22393 msec
using Windows 10 / Windows 2016 Heap 5102 msec
using Intel TBB + Intel IPP 3975 msec

#35 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-20 17:34:51

indeed, the results are similar (Intel MM is good for many threads concurrently):

Using IntelTBB
1.1. Low level common:
  Total failed: 0 / 10,952,515  - Low level common PASSED  2.45s

1.2. Low level types:
  Total failed: 0 / 733,788  - Low level types PASSED  173.84ms

1.3. Big table:
  Total failed: 0 / 886,592  - Big table PASSED  1.30s

2.11. DDD shared units:
  Total failed: 0 / 80,388  - DDD shared units PASSED  462.14ms



Without IntelTBB
1.1. Low level common:
  Total failed: 0 / 10,951,964  - Low level common PASSED  2.41s

1.2. Low level types:
  Total failed: 0 / 734,580  - Low level types PASSED  239.36ms

1.3. Big table:
  Total failed: 0 / 886,427  - Big table PASSED  1.39s

2.11. DDD shared units:
  Total failed: 0 / 80,388  - DDD shared units PASSED  991.26ms

(btw. I have used SynSQLite3Static, but the multithreaded test doesn't run, any hint?)

#36 Re: mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-20 17:02:08

because Intel license permits to redistribute under DLL forms :-\ (if I read correctly)

#37 mORMot 1 » RTL 64bit patched with Intel SIMD IPP TBB, MM from 70secs to 4 secs ! » 2017-07-20 12:28:03

RObyDP
Replies: 21

hello Arnaud,

sorry, me again, I'm your nightmare :-P

please, if you have time, to do a test of the Synopse with my patches www.dellapasqua.com, one is for mem manager and rtl fillchar, copymem, etc, another is for zlib,

in alternative, where can I find the Synopse benchmark, so to test the speed and reliability?

Btw. do you like extend the IPP patches with other functions as Math, Vectors, etc.?

A bientot.

Roberto

#38 Re: mORMot 1 » Asm PosBinary function please » 2017-07-11 14:55:44

can I ask a courtesy?

Do you have tested this MM https://github.com/d-mozulyov/BrainMM ?

Will be nice see how it performs into a real life benchmark using the mormot multithreaded test...

I want see if it really scale so well, and the reliability.

#40 mORMot 1 » Asm PosBinary function please » 2017-07-11 11:01:16

RObyDP
Replies: 4

hello Arnaud,

do you have a fast ASM function to do a Pos with mem pointers or array of bytes instead of strings? ( I want avoid the @uniquestring call each time Pcasting)
For 64bit of course

for example should be PosBin(SubStr:TBytes; Source: Pointer; Index:UInt64);

I can do in pascal using memcompare and indexes, but I know your crew is strong with ASM

let me know
thank you
Roberto

#41 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-12 00:45:20

WIN32 speed solved

After testing a lot of environments for the toolchain (mingw, cygwin, ms visual clangc2, bcc, many config options), I have opted for clang 3.8.1 latest for the win32/win64

md5 tested ok

Http Json 50KB TMemoryStream file test
Intel core i7 2.6ghz, Windows 10 Pro

Compression ratio 6x

64bit WIN64
Snappy compress in 237.33ms, ratio=85%, 1.6 GB/s
Snappy uncompress in 92.43ms, 4.3 GB/s

32bit WIN32
Snappy compress in 269.96ms, ratio=85%, 1.4 GB/s
Snappy uncompress in 135.88ms, 2.9 GB/s

http://www.dellapasqua.com/snappy64/

tnx

#42 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-11 22:15:31

using latest cygwin and gcc 5.40

win 64bit (slightly slower than bcc64 llvm)
Snappy compress in 244.04ms, ratio=85%, 1.6 GB/s
Snappy uncompress in 124.18ms, 3.2 GB/s

32bit (a lot faster, bcc32 doesn't optimize through llvm?)
Snappy compress in 264.61ms, ratio=85%, 1.5 GB/s
Snappy uncompress in 108.62ms, 3.6 GB/s

the question is: should I redistribute C objects made with cygwin that's gnu without license infringement?
Can those libs be linked freely in closed source app?
Any help?

If the license permits, then I can try to build the android, osx, ios through gcc cygwin...

#43 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-11 12:09:43

http://pastebin.com/wBwuGP9b
(it's a random order from a customer of mine, from a web ecommerce platform; they are all between 30kb-90kb)

btw. thanks for your time

#44 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-11 11:40:20

Look, I'm passionate of software algorithms and I'm doing those tests only for fun (at the end also snappy is a lz77 class...)

Those my actual test with your updated source:
(btw I'm using your last version of mormot and delphi berlin in release build, fastmm4 default mm)
I use a 50kb json, a typical file order from ecommerce platform

64bit
Snappy compress in 238.27ms, ratio=85%, 1.6 GB/s
Snappy uncompress in 125.02ms, 3.2 GB/s
SynLZ compress in 698.31ms, ratio=86%, 588.6 MB/s
SynLZ uncompress in 375.44ms, 1 GB/s

32bit
Snappy compress in 347.30ms, ratio=85%, 1.1 GB/s
Snappy uncompress in 149.52ms, 2.6 GB/s
SynLZ compress in 464.66ms, ratio=86%, 884.6 MB/s
SynLZ uncompress in 316.27ms, 1.2 GB/s

#45 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-10 22:08:35

i7 2.6ghz
50KB json TMemoryStream

64bit
Snappy compress in 226.39ms, size=68140000, 287 MB/s
Snappy uncompress in 112.78ms, size=431020000, 3.5 GB/s
SynLZ compress in 605.05ms, size=63580000, 100.2 MB/s
SynLZ uncompress in 342.00ms, size=431020000, 1.1 GB/s

32bit
Snappy compress in 329.53ms, size=68140000, 197.1 MB/s
Snappy uncompress in 165.60ms, size=431020000, 2.4 GB/s
SynLZ compress in 458.01ms, size=63580000, 132.3 MB/s
SynLZ uncompress in 335.78ms, size=431020000, 1.1 GB/s

#46 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-10 15:28:16

From stackoverflow:

[...]
A while back I wrote a few garbage collectors to teach myself more about performance optimization in C.
And the results I got is in my mind enough to slightly favor clang.
Especially since garbage collection is mostly about pointer chasing and copying memory.

The results are (numbers in seconds):

+---------------------+-----+-----+
|Type                 |GCC  |Clang|
+---------------------+-----+-----+
|Copying GC           |22.46|22.55|
|Copying GC, optimized|22.01|20.22|
|Mark & Sweep         | 8.72| 8.38|
|Ref Counting/Cycles  |15.14|14.49|
|Ref Counting/Plain   | 9.94| 9.32|
+---------------------+-----+-----+

LLVM seems slighly faster than GCC

#47 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-10 15:06:22

Sorry for the delay in this post,

in my last test those are the results:

Json 50KB TMemoryStream file test, core i7 2.6ghz, WIN64:
compression speed at 275MB/sec
decompression speed at 3.3GB/sec
compression ratio 600%

bcc clang 7.20 uses LLVM 3.3 backend and optimize as or slightly better than GCC and VC

#48 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-06 16:36:18

hello again, I have made a look further into Andy port and the source haven't stripped out debug assertions.
Now I have updated the library www.dellapasqua.com/snappy64, here the results:

50KB json magento order
Snappy compress in 28.16ms, size=6814000, 230.7 MB/s
SynLZ compress in 66.88ms, size=6358000, 90.6 MB/s
Snappy uncompress in 16.65ms, size=43102000, 2.4 GB/s
SynLZ uncompress in 47.30ms, size=43102000, 868.9 MB/s

1GB vhd file
Snappy compress in 1.83s, size=508411112, 264.1 MB/s
SynLZ compress in 3.96s, size=532219716, 128.1 MB/s
Snappy uncompress in 931.06ms, size=1043862016, 1 GB/s
SynLZ uncompress in 3.82s, size=1043862016, 260 MB/s

so average is 250MB compress and 1GB decompress
btw. I don't want to make a flame over the marvellous mormot code, just maybe can be usefull having a faster lib for delphi "world"
btw2. mormot and A. are fantastic imho

#49 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-05 16:40:24

With a 160KB text file I got those results:

Snappy compress in 93.51ms, size=10044800, 102.4 MB/s
Snappy uncompress in 23.59ms, size=16285800, 658.2 MB/s

SynLZ compress in 105.62ms, size=9391100, 84.7 MB/s
SynLZ uncompress in 68.73ms, size=16285800, 225.9 MB/s

Seems to me a good algorithm smile

CIAO!

#50 Re: mORMot 1 » SNAPPY delphi 64 port » 2016-08-05 16:29:23

yes, I know mormot is cool, and I love pascal code ;-)

Board footer

Powered by FluxBB