You are not logged in.
hi Arnaud,
somebody told me that in your latest Mormot you have built optimized rtl functions as @fillchar using simd asm
perhaps can you point me to these functions?
can be used in delphi?
which license do you permits?
thank you
R.
hello,
really great result,
can I ask, do you support Argon2?
I'm referring, to start an instance E2E using SGX or TrustZone in random protected ram (isolated from OS) or remote net pipe, and extending using 32bit key with Argon2 hash. The most reliable until now.
A bientot!
(nowadays nothing than <> Argon2 has any value)
Do you will plan to implement X3DH, Double Ratchet, Curve25519 other than AES-256 e HMAC-SHA256?
Do you will release a Delphi version?
sorry for the offtopic,
hello,
I'm using the very powerful TTASK api of System.Threading,
however I need a thing:
the pool is dynamic, I attach a TLS class with DB components into the running thread of the context, and keeping it in memory until the thread runs (avoiding create and destroy every time a db context).
I need to have an event OnAThreadExit of the System.TThreadPool so I can free the TLS class when the thread is made free.
How can be done without change the source of System.Threading.pas? Winapi doesn't offer a callback on thread exit, only a poll api is found, so cannot be used (I need to call TLS code within the thread, before the exit).
thank you
R.
ok, well, in the case the sequence order don't need to be synchronized (it's not important the order of execution read write there), I can proceed without interlocked* calls without worries?
btw. thanks for your time
off-offtopic
thread 1) read @Int64
00000000006CCAC3 488B4538 mov rax,[rbp+$38] pointer to address
00000000006CCAC7 48894530 mov [rbp+$30],rax read from a 64bit quadword single op mov (mov value into register)
thread 2) write @Int64
00000000006CC8FA 488B4538 mov rax,[rbp+$38] pointer to the same address
00000000006CC8FE 488905739F0500 mov [rel $00059f73],rax write the above 64bit quadword with a single op mov
do you confirm that here we don't need atomic sync functions because of the single mmu align (64bit or more, so a collision never can happen)?
thanks
indeed, it's very slow against lz4, synlz or snappy for example, useful only for default web browser compression
They don't fill the dictionary table of common tokens during the deflate, so for example "ciao ciao ciao" is not compressed at all, the checksum hash crc32 is still used.
ok I did test on real html files produced by wordpress and a forum php, run a test with similar compression ratio (1 for gzip, deflate, -2 for intel)
gzip 12 seconds
your mingw 5.6 sec (cloudflare)
intel 2.5 sec
parallelized intel provide 20Gbit/s html IN -> 4.4Gbit/s compressed output on i7 4/8 cores 3.4 Ghz
I wait the correction from cloudflare engineer, for LLVM patches, then I let you know
A bientot
look at the Zlib patches from Intel (the files are in the Intel IPP for Linux), they works fine as static DLL, introducing a DeflateInit -2 and DeflateInit2 -2 options for fastest mode.
Under Win64 this performs very similar to cloudflare.
I'm talking with the engineer who did cloudflare patch, indeed with Clang the issue are with the SSE* calls.
I compiled gcc -c -O3 -msse4.2 -mpclmul, but the results are slow. How you can be so fast? But I suppose this is a magic secret :-)
GCC under Windows? MinGw or CygWin? But seems that we cannot use a function from gpl code as linux kernel for commercial purposes, true?
I have isolated the problem, it's in _mm_crc32_u32 LLVM, now I check the function headers (seems a integer overflow)
clang with O2 works, but a test take 13seconds; with O3 corrupt the results, but take 3.4 seconds
under Linux with GCC test take 3.2 seconds
:-\
do you have used the PCLMUL define of Cloudflare? Because 'crc32_pclmul_le_16' exists only in Linux kernel.
Perhaps do you know in Visual C ++ the equivalent of -msse4.2 -mpclmul in gcc (to enable xmmintrin and emmintrin SSE?)
plz consider, I have little experience with C compilers.
can I ask, do you use MinGw64 to make the *.obj?
well I did a lot of DLL versions with Intel, Intel+PPI, cloudflare, checksum on and off...
I like have a objects build with CLANG
can we try?
hello,
I'm evaluating the speed of many zlib implementations, ng, cloudflare, intel, bcz I want try to parallelize some parts
all ok making DLL with VC 2015-2017
I like have OBJ to statically link them inside Delphi
all ok with gnu cc
but I'm stopped with LLVM and cloudflare
PLEASE can somebody help me?
I have done a lot of testing editing the sources and the compiler options, but I obtain always wrong results.
So:
1) download Cloudflare from https://github.com/cloudflare/zlib
2) download LLVM 5.0 from www.llvm.org
3) if you have visual studio community 2015 then LLVM take the headers from default include folders (or download windows PSDK)
Try to produce objects as example:
clang -c -O3 -D_CRT_NONSTDC_NO_DEPRECATE -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_WARNINGS -DZLIB_WINAPI -DASMV -DASMINF -DWIN64 -mssse3 -msse4.1 -msse4.2 *.c
(there are an option for a crc32 function taken from Linux kernel, but avoiding this define we don't have problems, thus is used in gzip that I don't need)
ok try to map the objects with SynSSLZip, and see results: with DeflateInit2 (mandatory for deflate browser compatibility) the results are not correct
btw. here a revised struct for z_stream:
TZ_stream = record
next_in: PByte;
avail_in: UInt32;
total_in: UInt64;
next_out: PByte;
avail_out: UInt32;
total_out: UInt64;
msg: PAnsiChar;
state:Pinternal_state;
zalloc: alloc_func;
zfree: free_func;
opaque: Pointer;
data_type: Integer;
adler: UInt64;
reserved: UInt64;
end;
sizeof=112
please help, also something I can pay if we solve this, where is the trick with LLVM?
for your curiosity, actually, from my tests, intel zlib and cloudflare zlib are similar: cloudflare is better at level 2 and up, intel is better at -2 level (avoiding crc adler checksum)
sorry for my english, I'm in a hurry up
(eventually rdp@dellapasqua.com)
A bientot
thanks for the info
So, let's say, in a method the local variables are allocated in the stack every time the funcion is called (and the manager deallocate them at the function exit), so doesn't need to be protected from other threads (every thread get a copy of the vars in the stack), allright?
Instead all global types obviously should be protected from read/write using monitor, mutex, crts...
about thread concurrency with Delphi
function Dummy: integer;
var
A
B
C
begin
//do something
end;
Then suppose to have many threads calling Dummy()
the local variables A,B,C instance are unique globally or instantiated for each thread? Are there some runtime management?
The access to those variables are atomically done if global unique or we should use a critical section before use Dummy()?
So what happens in RAM when two threads are calling Dummy concurrently and accessing read/write ABC at the same time?
thanks for info
hello,
I'm trying to add source files /rtl, after rebuilding the RTL.
I obtain [dcc64 Fatal Error] System.Variants.pas(1271): E2158 System.Variants unit out of date or corrupted: missing '@VarCast'.
Does somebody knows how to solve this problem?
Thanks.
R.
I did a small test with real code scenario,
look at parallel zlib with my patch, zcompress loop 1000 of a 1100KB text file:
uses System.Zlib;
threadvar
INS: TMemoryStream;
OUTS: pointer;
SizeIn: integer;
SizeOUT: integer;
procedure TForm.CompressClick(Sender: TObject);
var
Count: integer;
begin
Count := GetTickCount;
TParallel.For(1,1000,procedure(I:integer)
begin
INS := TMemoryStream.Create;
INS.LoadFromFile('c:\teststream.txt');
SizeIn := INS.Size;
GetMem(OUTS, SizeIn);
SizeOUT := SizeIn;
ZCompress(INS.Memory, SizeIn, OUTS, SizeOUT, zcFastest);
INS.Free;
FreeMem(OUTS);
end);
ShowMessage(IntToStr(GetTickCount - Count));
end;
- fastmm4 900-1000msec
- brainMM 563msec
- msheap 532msec
- my patch Intel IPP + TTB 281 msec
www.dellapasqua.com
www.dellapasqua.com/intelTBB.rar (put a teststream.txt file on c:\ and run files)
but haven't time now, so I trust you about 32bit
I'll do a build of IPP+TBB for IA32, and will see how performs.
I don't have tested, but under win32 seems the faster.
Hi!
Unfortunately the library has a few critical bugs, which will be fixed, I think, this autumn. Look for example here: https://github.com/d-mozulyov/BrainMM/issues/5
Thanks for attention to my project
Отправлено с iPhone
13 июля 2017 г., в 15:20, Roberto Della Pasqua <rdp@dellapasqua.com> написал(а):
Hello Dimitry,
I like ask if your BrainMM is reliable enough to be used for server 64bit 24/7 applications,
does it suffers of mem fragmentation or leaks, troubles at all?
Do you have customer reports of months of work?
Thank you.
Regards.
what do you think about a Delphi compiler OPENMP adherence?
well, parallelizing the MM is not enough to get better overall performances, all the algorithms should be made with parallel class, as TParallel.For, etc. and not every task can be parallelized. Anyway the cpu near future is the massive multicore enlargement instead of Ghz war. So the direction seems this.
10.0.15063 (win10 creators update)
FastMM4 22sec
FastMM4-AVX 18sec
IntelPPI+TBB 4sec
(consider that under single thread fastmm performs slightly better than Intel)
do you think it's possible to convert the DLLs in one OBJ, maybe using objconv AgnSoft? It's a very hard task, no?
errata corrige:
using FastMM4 NoThreadContention 22098 msec
using ScaleMM2 22393 msec
using Windows 10 / Windows 2016 Heap 5102 msec
using Intel TBB + Intel IPP 3975 msec
indeed, the results are similar (Intel MM is good for many threads concurrently):
Using IntelTBB
1.1. Low level common:
Total failed: 0 / 10,952,515 - Low level common PASSED 2.45s
1.2. Low level types:
Total failed: 0 / 733,788 - Low level types PASSED 173.84ms
1.3. Big table:
Total failed: 0 / 886,592 - Big table PASSED 1.30s
2.11. DDD shared units:
Total failed: 0 / 80,388 - DDD shared units PASSED 462.14ms
Without IntelTBB
1.1. Low level common:
Total failed: 0 / 10,951,964 - Low level common PASSED 2.41s
1.2. Low level types:
Total failed: 0 / 734,580 - Low level types PASSED 239.36ms
1.3. Big table:
Total failed: 0 / 886,427 - Big table PASSED 1.39s
2.11. DDD shared units:
Total failed: 0 / 80,388 - DDD shared units PASSED 991.26ms
(btw. I have used SynSQLite3Static, but the multithreaded test doesn't run, any hint?)
because Intel license permits to redistribute under DLL forms :-\ (if I read correctly)
hello Arnaud,
sorry, me again, I'm your nightmare :-P
please, if you have time, to do a test of the Synopse with my patches www.dellapasqua.com, one is for mem manager and rtl fillchar, copymem, etc, another is for zlib,
in alternative, where can I find the Synopse benchmark, so to test the speed and reliability?
Btw. do you like extend the IPP patches with other functions as Math, Vectors, etc.?
A bientot.
Roberto
can I ask a courtesy?
Do you have tested this MM https://github.com/d-mozulyov/BrainMM ?
Will be nice see how it performs into a real life benchmark using the mormot multithreaded test...
I want see if it really scale so well, and the reliability.
very thanks
hello Arnaud,
do you have a fast ASM function to do a Pos with mem pointers or array of bytes instead of strings? ( I want avoid the @uniquestring call each time Pcasting)
For 64bit of course
for example should be PosBin(SubStr:TBytes; Source: Pointer; Index:UInt64);
I can do in pascal using memcompare and indexes, but I know your crew is strong with ASM
let me know
thank you
Roberto
WIN32 speed solved
After testing a lot of environments for the toolchain (mingw, cygwin, ms visual clangc2, bcc, many config options), I have opted for clang 3.8.1 latest for the win32/win64
md5 tested ok
Http Json 50KB TMemoryStream file test
Intel core i7 2.6ghz, Windows 10 Pro
Compression ratio 6x
64bit WIN64
Snappy compress in 237.33ms, ratio=85%, 1.6 GB/s
Snappy uncompress in 92.43ms, 4.3 GB/s
32bit WIN32
Snappy compress in 269.96ms, ratio=85%, 1.4 GB/s
Snappy uncompress in 135.88ms, 2.9 GB/s
http://www.dellapasqua.com/snappy64/
tnx
using latest cygwin and gcc 5.40
win 64bit (slightly slower than bcc64 llvm)
Snappy compress in 244.04ms, ratio=85%, 1.6 GB/s
Snappy uncompress in 124.18ms, 3.2 GB/s
32bit (a lot faster, bcc32 doesn't optimize through llvm?)
Snappy compress in 264.61ms, ratio=85%, 1.5 GB/s
Snappy uncompress in 108.62ms, 3.6 GB/s
the question is: should I redistribute C objects made with cygwin that's gnu without license infringement?
Can those libs be linked freely in closed source app?
Any help?
If the license permits, then I can try to build the android, osx, ios through gcc cygwin...
http://pastebin.com/wBwuGP9b
(it's a random order from a customer of mine, from a web ecommerce platform; they are all between 30kb-90kb)
btw. thanks for your time
Look, I'm passionate of software algorithms and I'm doing those tests only for fun (at the end also snappy is a lz77 class...)
Those my actual test with your updated source:
(btw I'm using your last version of mormot and delphi berlin in release build, fastmm4 default mm)
I use a 50kb json, a typical file order from ecommerce platform
64bit
Snappy compress in 238.27ms, ratio=85%, 1.6 GB/s
Snappy uncompress in 125.02ms, 3.2 GB/s
SynLZ compress in 698.31ms, ratio=86%, 588.6 MB/s
SynLZ uncompress in 375.44ms, 1 GB/s
32bit
Snappy compress in 347.30ms, ratio=85%, 1.1 GB/s
Snappy uncompress in 149.52ms, 2.6 GB/s
SynLZ compress in 464.66ms, ratio=86%, 884.6 MB/s
SynLZ uncompress in 316.27ms, 1.2 GB/s
i7 2.6ghz
50KB json TMemoryStream
64bit
Snappy compress in 226.39ms, size=68140000, 287 MB/s
Snappy uncompress in 112.78ms, size=431020000, 3.5 GB/s
SynLZ compress in 605.05ms, size=63580000, 100.2 MB/s
SynLZ uncompress in 342.00ms, size=431020000, 1.1 GB/s
32bit
Snappy compress in 329.53ms, size=68140000, 197.1 MB/s
Snappy uncompress in 165.60ms, size=431020000, 2.4 GB/s
SynLZ compress in 458.01ms, size=63580000, 132.3 MB/s
SynLZ uncompress in 335.78ms, size=431020000, 1.1 GB/s
From stackoverflow:
[...]
A while back I wrote a few garbage collectors to teach myself more about performance optimization in C.
And the results I got is in my mind enough to slightly favor clang.
Especially since garbage collection is mostly about pointer chasing and copying memory.
The results are (numbers in seconds):
+---------------------+-----+-----+
|Type |GCC |Clang|
+---------------------+-----+-----+
|Copying GC |22.46|22.55|
|Copying GC, optimized|22.01|20.22|
|Mark & Sweep | 8.72| 8.38|
|Ref Counting/Cycles |15.14|14.49|
|Ref Counting/Plain | 9.94| 9.32|
+---------------------+-----+-----+
LLVM seems slighly faster than GCC
Sorry for the delay in this post,
in my last test those are the results:
Json 50KB TMemoryStream file test, core i7 2.6ghz, WIN64:
compression speed at 275MB/sec
decompression speed at 3.3GB/sec
compression ratio 600%
bcc clang 7.20 uses LLVM 3.3 backend and optimize as or slightly better than GCC and VC
hello again, I have made a look further into Andy port and the source haven't stripped out debug assertions.
Now I have updated the library www.dellapasqua.com/snappy64, here the results:
50KB json magento order
Snappy compress in 28.16ms, size=6814000, 230.7 MB/s
SynLZ compress in 66.88ms, size=6358000, 90.6 MB/s
Snappy uncompress in 16.65ms, size=43102000, 2.4 GB/s
SynLZ uncompress in 47.30ms, size=43102000, 868.9 MB/s
1GB vhd file
Snappy compress in 1.83s, size=508411112, 264.1 MB/s
SynLZ compress in 3.96s, size=532219716, 128.1 MB/s
Snappy uncompress in 931.06ms, size=1043862016, 1 GB/s
SynLZ uncompress in 3.82s, size=1043862016, 260 MB/s
so average is 250MB compress and 1GB decompress
btw. I don't want to make a flame over the marvellous mormot code, just maybe can be usefull having a faster lib for delphi "world"
btw2. mormot and A. are fantastic imho
With a 160KB text file I got those results:
Snappy compress in 93.51ms, size=10044800, 102.4 MB/s
Snappy uncompress in 23.59ms, size=16285800, 658.2 MB/s
SynLZ compress in 105.62ms, size=9391100, 84.7 MB/s
SynLZ uncompress in 68.73ms, size=16285800, 225.9 MB/s
Seems to me a good algorithm
CIAO!
yes, I know mormot is cool, and I love pascal code ;-)