You are not logged in.
Pages: 1
hello,
I'm evaluating the speed of many zlib implementations, ng, cloudflare, intel, bcz I want try to parallelize some parts
all ok making DLL with VC 2015-2017
I like have OBJ to statically link them inside Delphi
all ok with gnu cc
but I'm stopped with LLVM and cloudflare
PLEASE can somebody help me?
I have done a lot of testing editing the sources and the compiler options, but I obtain always wrong results.
So:
1) download Cloudflare from https://github.com/cloudflare/zlib
2) download LLVM 5.0 from www.llvm.org
3) if you have visual studio community 2015 then LLVM take the headers from default include folders (or download windows PSDK)
Try to produce objects as example:
clang -c -O3 -D_CRT_NONSTDC_NO_DEPRECATE -D_CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_WARNINGS -DZLIB_WINAPI -DASMV -DASMINF -DWIN64 -mssse3 -msse4.1 -msse4.2 *.c
(there are an option for a crc32 function taken from Linux kernel, but avoiding this define we don't have problems, thus is used in gzip that I don't need)
ok try to map the objects with SynSSLZip, and see results: with DeflateInit2 (mandatory for deflate browser compatibility) the results are not correct
btw. here a revised struct for z_stream:
TZ_stream = record
next_in: PByte;
avail_in: UInt32;
total_in: UInt64;
next_out: PByte;
avail_out: UInt32;
total_out: UInt64;
msg: PAnsiChar;
state:Pinternal_state;
zalloc: alloc_func;
zfree: free_func;
opaque: Pointer;
data_type: Integer;
adler: UInt64;
reserved: UInt64;
end;
sizeof=112
please help, also something I can pay if we solve this, where is the trick with LLVM?
for your curiosity, actually, from my tests, intel zlib and cloudflare zlib are similar: cloudflare is better at level 2 and up, intel is better at -2 level (avoiding crc adler checksum)
sorry for my english, I'm in a hurry up
(eventually rdp@dellapasqua.com)
A bientot
Offline
I was able to let the CloudFlare fork work with FPC via https://synopse.info/fossil/finfo?name=SynZLibSSE.pas
Offline
well I did a lot of DLL versions with Intel, Intel+PPI, cloudflare, checksum on and off...
I like have a objects build with CLANG
can we try?
Offline
Did you try with FPC and the files available at https://github.com/synopse/mORMot/tree/master/fpc-win64 ?
Offline
can I ask, do you use MinGw64 to make the *.obj?
Offline
do you have used the PCLMUL define of Cloudflare? Because 'crc32_pclmul_le_16' exists only in Linux kernel.
Perhaps do you know in Visual C ++ the equivalent of -msse4.2 -mpclmul in gcc (to enable xmmintrin and emmintrin SSE?)
plz consider, I have little experience with C compilers.
Offline
clang with O2 works, but a test take 13seconds; with O3 corrupt the results, but take 3.4 seconds
under Linux with GCC test take 3.2 seconds
:-\
Offline
I have isolated the problem, it's in _mm_crc32_u32 LLVM, now I check the function headers (seems a integer overflow)
Offline
GCC under Windows? MinGw or CygWin? But seems that we cannot use a function from gpl code as linux kernel for commercial purposes, true?
Offline
I compiled gcc -c -O3 -msse4.2 -mpclmul, but the results are slow. How you can be so fast? But I suppose this is a magic secret :-)
Offline
I'm talking with the engineer who did cloudflare patch, indeed with Clang the issue are with the SSE* calls.
Offline
look at the Zlib patches from Intel (the files are in the Intel IPP for Linux), they works fine as static DLL, introducing a DeflateInit -2 and DeflateInit2 -2 options for fastest mode.
Under Win64 this performs very similar to cloudflare.
Offline
ok I did test on real html files produced by wordpress and a forum php, run a test with similar compression ratio (1 for gzip, deflate, -2 for intel)
gzip 12 seconds
your mingw 5.6 sec (cloudflare)
intel 2.5 sec
parallelized intel provide 20Gbit/s html IN -> 4.4Gbit/s compressed output on i7 4/8 cores 3.4 Ghz
I wait the correction from cloudflare engineer, for LLVM patches, then I let you know
A bientot
Offline
They don't fill the dictionary table of common tokens during the deflate, so for example "ciao ciao ciao" is not compressed at all, the checksum hash crc32 is still used.
Offline
indeed, it's very slow against lz4, synlz or snappy for example, useful only for default web browser compression
Offline
off-offtopic
thread 1) read @Int64
00000000006CCAC3 488B4538 mov rax,[rbp+$38] pointer to address
00000000006CCAC7 48894530 mov [rbp+$30],rax read from a 64bit quadword single op mov (mov value into register)
thread 2) write @Int64
00000000006CC8FA 488B4538 mov rax,[rbp+$38] pointer to the same address
00000000006CC8FE 488905739F0500 mov [rel $00059f73],rax write the above 64bit quadword with a single op mov
do you confirm that here we don't need atomic sync functions because of the single mmu align (64bit or more, so a collision never can happen)?
thanks
Offline
ok, well, in the case the sequence order don't need to be synchronized (it's not important the order of execution read write there), I can proceed without interlocked* calls without worries?
btw. thanks for your time
Offline
Pages: 1